This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
services:nagios [2016/06/29 21:32] dgalloway |
services:nagios [2022/06/29 17:25] (current) djgalloway [LRC Health] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== WIP - Nagios ====== | + | ====== Nagios ====== |
===== Summary ===== | ===== Summary ===== | ||
- | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. | + | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff. |
+ | |||
+ | NRPE is configured on nagios-monitored hosts using the common role in [[services:ceph-cm-ansible|ceph-cm-ansible]]. | ||
===== Checks ===== | ===== Checks ===== | ||
- | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. | + | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in ''/etc/nagios-plugins/config/''. |
** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ||
Line 13: | Line 15: | ||
Calls ''/usr/libexec/raid.pl'' on applicable hosts. | Calls ''/usr/libexec/raid.pl'' on applicable hosts. | ||
+ | |||
+ | ==== LRC Health ==== | ||
+ | This checks in with reesi001 where a [[https://github.com/valerytschopp/ceph-nagios-plugins|custom nagios plugin]] is in place. It currently whitelists 'failing to respond to cache pressure' when anything but ''HEALTH_OK'' is returned. | ||
+ | <code> | ||
+ | root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg | ||
+ | command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure' | ||
+ | </code> | ||
+ |