This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
services:nagios [2016/09/20 13:41] dgalloway |
services:nagios [2022/06/29 17:25] (current) djgalloway [LRC Health] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== WIP - Nagios ====== | + | ====== Nagios ====== |
===== Summary ===== | ===== Summary ===== | ||
- | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. | + | We have a nagios instance at ''nagios.front.sepia.ceph.com''. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff. |
+ | |||
+ | NRPE is configured on nagios-monitored hosts using the common role in [[services:ceph-cm-ansible|ceph-cm-ansible]]. | ||
===== Checks ===== | ===== Checks ===== | ||
- | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. | + | Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in ''/etc/nagios-plugins/config/''. |
** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ** [[https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/files/libexec/smart.sh|SMART]] ** | ||
Line 15: | Line 17: | ||
==== LRC Health ==== | ==== LRC Health ==== | ||
- | This checks in with mira021 where a [[https://github.com/valerytschopp/ceph-nagios-plugins|custom nagios plugin]] is in place. It currently whitelists 'failing to respond to cache pressure' when anything but ''HEALTH_OK'' is returned. | + | This checks in with reesi001 where a [[https://github.com/valerytschopp/ceph-nagios-plugins|custom nagios plugin]] is in place. It currently whitelists 'failing to respond to cache pressure' when anything but ''HEALTH_OK'' is returned. |
<code> | <code> | ||
- | root@mira021:~# tail -n 1 /etc/nagios/nrpe_local.cfg | + | root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg |
command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure' | command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure' | ||
</code> | </code> | ||