====== Nagios ====== ===== Summary ===== We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff. NRPE is configured on nagios-monitored hosts using the common role in [[services:ceph-cm-ansible|ceph-cm-ansible]]. ===== Checks ===== Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in ''/etc/nagios-plugins/config/''. ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** Calls ''/usr/libexec/smart.sh'' on applicable hosts. ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/raid.pl|RAID]] ** Calls ''/usr/libexec/raid.pl'' on applicable hosts. ==== LRC Health ==== This checks in with reesi001 where a [[https://github.com/valerytschopp/ceph-nagios-plugins|custom nagios plugin]] is in place. It currently whitelists 'failing to respond to cache pressure' when anything but ''HEALTH_OK'' is returned. root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure'