This is an old revision of the document!

shaman.ceph.com

Summary

There are three VMs in the OVH CI region that make up shaman.

shaman.ceph.com is just a load balancing VM
1.shaman.ceph.com is the primary shaman node that has the postgres DB with all the repo information
2.shaman.ceph.com is a READ ONLY backup in the event 1.shaman.ceph.com goes down

User Access

User: ubuntu (or infra admin username)
Key: CI private key (or your private key)
Port: 2222

Ops Tasks

Starting/Restarting service

systemctl start|stop|restart|status shaman

Updating/Redeploying shaman

If needed, copy deploy/playbooks/examples/deploy_production.yml to deploy/playbooks/
1. Get and set the following credentials. These can be found in /opt/shaman/src/shaman/prod.py on 1.shaman.ceph.com
  1. api_user
  2. api_key
  3. rabbit_host
  4. rabbit_user
  5. rabbit_pw
  6. github_secret
Run the playbook (see below)

**It is extremely important that the postgres tag is skipped**

Set --limit to one node at a time to avoid disrupting the CI or lab testing.

ansible-playbook --tags="deploy_app" --skip-tags="postgres,nginx" --extra-vars="master_ip=158.69.71.144 standby_ip=158.69.71.192" deploy_production.yml --limit="1.shaman.ceph.com,2.shaman.ceph.com"

Pulling Slave Stats

I needed to determine what percentage of jobs were running on static vs. ephemeral slaves. Alfredo wrote a python script to pull this data out of the shaman database. This script totals how many jobs ran on static vs. ephemeral slaves over a 2 week period (since that's how long we keep dev builds).

ssh 2.shaman.ceph.com
cd /opt/shaman/src/shaman

Copy the script below to two_week_stats.py

import datetime

from shaman import models
from shaman.models import Build, Project
models.start_read_only()



def report():
    two_weeks = datetime.datetime.utcnow() - datetime.timedelta(days=15)
    
    ceph_project = models.Project.filter_by(name='ceph').one()
    builds = Build.filter_by(project=ceph_project).filter(Build.completed > two_weeks).all()

    ovh_builds = {}
    lab_builds = {}
    rest_of_the_world = {}
    ovh_count = 0
    lab_count = 0

    for build in builds:
        node_name = build.extra['node_name']
        if '__' in node_name:
            mapping = ovh_builds
            counter = ovh_count
        elif 'slave-' in node_name:
            mapping = lab_builds
            counter = lab_count
        else:
            mapping = rest_of_the_world
    
        try:
            mapping[node_name] += 1
        except KeyError:
            mapping[node_name] = 1
    
    for mapping in [ovh_builds, lab_builds]:
        count = 0
        for key, value in mapping.items():
            print key, value
            count += value
	print "TOTAL: %s" % count
        print "="*60
        print

pecan shell --shell ipython prod.py

Then

In [1]: from shaman import models

In [2]: models.start_read_only()

In [3]: import two_week_stats

In [4]: two_week_stats.report()

Nagios Checks

There's a custom Nagios check in place that queries the /api/nodes/next endpoint.

This check is in place to make sure the postgres database is writeable. An incident occurred in 2019 where OVH rebooted all 3 shaman-related VMs at the same time and the DB was read-only for an unknown reason.

root@nagios:~# cat /usr/lib/nagios/plugins/check_shaman 
#!/bin/bash
# Checks shaman /api/nodes/next endpoint

if curl -s -I -u XXXXX:XXXXX https://${1}/api/nodes/next | grep -q "200 OK"; then
    echo "OK - Shaman /api/nodes/next endpoint healthy"
    exit 0
else
    echo "CRITICAL - Shaman /api/nodes/next endpoint failed"
    exit 2
fi

Sepia Lab Wiki

Sidebar

Table of Contents

shaman.ceph.com

Summary

User Access

Ops Tasks

Starting/Restarting service

Updating/Redeploying shaman

Pulling Slave Stats

Nagios Checks

Sepia Lab Wiki

User Tools

Site Tools

Sidebar

Table of Contents

shaman.ceph.com

Summary

User Access

Ops Tasks

Starting/Restarting service

Updating/Redeploying shaman

Pulling Slave Stats

Nagios Checks

Page Tools