Table of Contents

Nagios

Summary

We have a nagios instance at nagios.front.sepia.ceph.com. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff.

NRPE is configured on nagios-monitored hosts using the common role in ceph-cm-ansible.

Checks

Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in /etc/nagios-plugins/config/.

SMART

Calls /usr/libexec/smart.sh on applicable hosts.

RAID

Calls /usr/libexec/raid.pl on applicable hosts.

LRC Health

This checks in with reesi001 where a custom nagios plugin is in place. It currently whitelists 'failing to respond to cache pressure' when anything but HEALTH_OK is returned.

root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg
command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure'