User Tools

Site Tools


services:nagios

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
services:nagios [2016/06/29 21:32]
dgalloway created
services:nagios [2022/06/29 17:25] (current)
djgalloway [LRC Health]
Line 1: Line 1:
 ====== Nagios ====== ====== Nagios ======
 ===== Summary ===== ===== Summary =====
-We have a nagios instance at ''​nagios.front.sepia.ceph.com''​. ​ Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway.+We have a nagios instance at ''​nagios.front.sepia.ceph.com''​. ​ Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway.  Alfredo and Andrew also get alerts for the Jenkins CI stuff. 
 + 
 +NRPE is configured on nagios-monitored hosts using the common role in [[services:​ceph-cm-ansible|ceph-cm-ansible]].
  
 ===== Checks ===== ===== Checks =====
-Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host.+Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host.  Some of these are configurable and found in ''/​etc/​nagios-plugins/​config/''​.
  
 ** [[https://​github.com/​ceph/​ceph-cm-ansible/​blob/​master/​roles/​testnode/​files/​libexec/​smart.sh|SMART]] ** ** [[https://​github.com/​ceph/​ceph-cm-ansible/​blob/​master/​roles/​testnode/​files/​libexec/​smart.sh|SMART]] **
Line 13: Line 15:
  
 Calls ''/​usr/​libexec/​raid.pl''​ on applicable hosts. Calls ''/​usr/​libexec/​raid.pl''​ on applicable hosts.
 +
 +==== LRC Health ====
 +This checks in with reesi001 where a [[https://​github.com/​valerytschopp/​ceph-nagios-plugins|custom nagios plugin]] is in place. ​ It currently whitelists '​failing to respond to cache pressure'​ when anything but ''​HEALTH_OK''​ is returned.
 +<​code>​
 +root@reesi001:​~#​ tail -n 1 /​etc/​nagios/​nrpe_local.cfg
 +command[check_ceph_health]=/​usr/​lib/​nagios/​plugins/​ceph-nagios-plugins/​src/​check_ceph_health --name client.nagios -k /​etc/​ceph/​client.nagios.keyring --whitelist '​failing to respond to cache pressure'​
 +</​code>​
 +
services/nagios.1467235969.txt.gz · Last modified: 2016/06/29 21:32 by dgalloway