We have a nagios instance at nagios.front.sepia.ceph.com
. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff.
NRPE is configured on nagios-monitored hosts using the common role in ceph-cm-ansible.
Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in /etc/nagios-plugins/config/
.
Calls /usr/libexec/smart.sh
on applicable hosts.
Calls /usr/libexec/raid.pl
on applicable hosts.
This checks in with reesi001 where a custom nagios plugin is in place. It currently whitelists 'failing to respond to cache pressure' when anything but HEALTH_OK
is returned.
root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure'