User Tools

Site Tools


Sidebar

General Lab Info (Mainly for Devs)

Hardware

Lab Infrastructure Services

Misc Admin Tasks
These are infrequently completed tasks that don't fit under any specific service

Production Services

OVH = OVH
RHEV = Sepia RHE instance
Baremetal = Host in Sepia lab

The Attic/Legacy Info

services:nagios

Table of Contents

Nagios

Summary

We have a nagios instance at nagios.front.sepia.ceph.com. Alerts for testnodes are sent to ceph-infra AT redhat DOT com while alerts for production services are sent to dmick and dgalloway. Alfredo and Andrew also get alerts for the Jenkins CI stuff.

NRPE is configured on nagios-monitored hosts using the common role in ceph-cm-ansible.

Checks

Load, Disk Space, and HTTP are built-in Nagios checks performed on applicable host. Some of these are configurable and found in /etc/nagios-plugins/config/.

SMART

Calls /usr/libexec/smart.sh on applicable hosts.

RAID

Calls /usr/libexec/raid.pl on applicable hosts.

LRC Health

This checks in with reesi001 where a custom nagios plugin is in place. It currently whitelists 'failing to respond to cache pressure' when anything but HEALTH_OK is returned.

root@reesi001:~# tail -n 1 /etc/nagios/nrpe_local.cfg
command[check_ceph_health]=/usr/lib/nagios/plugins/ceph-nagios-plugins/src/check_ceph_health --name client.nagios -k /etc/ceph/client.nagios.keyring --whitelist 'failing to respond to cache pressure'
services/nagios.txt · Last modified: 2022/06/29 17:25 by djgalloway