This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
services:nagios [2016/06/29 21:32] dgalloway created |
services:nagios [2022/06/29 17:25] (current) djgalloway [LRC Health] |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Nagios ====== | ====== Nagios ====== | ||
| ===== Summary ===== | ===== Summary ===== | ||
| - | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. | + | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff. |
| + | |||
| + | NRPE is configured on nagios-monitored hosts using the common role in [[services:ceph-cm-ansible|ceph-cm-ansible]]. | ||
| ===== Checks ===== | ===== Checks ===== | ||
| - | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. | + | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in ''/etc/nagios-plugins/config/''. |
| ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ||
| Line 13: | Line 15: | ||
| Calls ''/usr/libexec/raid.pl'' on applicable hosts. | Calls ''/usr/libexec/raid.pl'' on applicable hosts. | ||
| + | |||
| + | ==== LRC Health ==== | ||
| + | This checks in with reesi001 where a [[https://github.com/valerytschopp/ceph-nagios-plugins|custom nagios plugin]] is in place. It currently whitelists 'failing to respond to cache pressure' when anything but ''HEALTH_OK'' is returned. | ||
| + | <code> | ||
| + | root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg | ||
| + | command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure' | ||
| + | </code> | ||
| + | |||