Table of Contents

shaman.ceph.com

Summary

https://github.com/ceph/shaman

There are three VMs in the OVH CI region that make up shaman.

User Access

User: ubuntu (or infra admin username)
Key: CI private key (or your private key)
Port: 2222

Ops Tasks

Starting/Restarting service

systemctl start|stop|restart|status shaman

Updating/Redeploying shaman

  1. If needed, copy deploy/playbooks/examples/deploy_production.yml to deploy/playbooks/
    1. Get and set the following credentials. These can be found in /opt/shaman/src/shaman/prod.py on 1.shaman.ceph.com
      1. api_user
      2. api_key
      3. rabbit_host
      4. rabbit_user
      5. rabbit_pw
      6. github_secret
  2. Run the playbook (see below)

**It is extremely important that the postgres tag is skipped**

Set --limit to one node at a time to avoid disrupting the CI or lab testing.

ansible-playbook --tags="deploy_app" --skip-tags="postgres,nginx" --extra-vars="master_ip=158.69.71.144 standby_ip=158.69.71.192" deploy_production.yml --limit="1.shaman.ceph.com,2.shaman.ceph.com"

Pulling Slave Stats

I needed to determine what percentage of jobs were running on static vs. ephemeral slaves. Alfredo wrote a python script to pull this data out of the shaman database. This script totals how many jobs ran on static vs. ephemeral slaves over a 2 week period (since that's how long we keep dev builds).

Doing this on 2.shaman.ceph.com ensures you're in a read-only capacity.

  1. ssh 2.shaman.ceph.com
  2. cd /opt/shaman/src/shaman
  3. Copy the script below to two_week_stats.py
    import datetime
    
    from shaman import models
    from shaman.models import Build, Project
    models.start_read_only()
    
    
    
    def report():
        two_weeks = datetime.datetime.utcnow() - datetime.timedelta(days=15)
        
        ceph_project = models.Project.filter_by(name='ceph').one()
        builds = Build.filter_by(project=ceph_project).filter(Build.completed > two_weeks).all()
    
        ovh_builds = {}
        irvingi_builds = {}
        braggi_builds = {}
        adami_builds = {}    
        rest_of_the_world = {}
        ovh_count = 0
        irvingi_count = 0
        braggi_count = 0
        adami_count = 0
    
        for build in builds:
            node_name = build.extra['node_name']
            if '__' in node_name:
                mapping = ovh_builds
                counter = ovh_count
            elif 'slave-' in node_name:
                mapping = irvingi_builds
                counter = irvingi_count
            elif 'braggi' in node_name:
                mapping = braggi_builds
                counter = braggi_count
            elif 'adami' in node_name:
                mapping = adami_builds
                counter = adami_count
            else:
                mapping = rest_of_the_world
        
            try:
                mapping[node_name] += 1
            except KeyError:
                mapping[node_name] = 1
        
        for mapping in [ovh_builds, irvingi_builds, braggi_builds, adami_builds]:
            count = 0
            for key, value in mapping.items():
                print key, value
                count += value
    	print "TOTAL: %s" % count
            print "="*60
            print
  4. /opt/shaman/bin/pecan shell --shell ipython prod.py
  5. Then
    In [1]: from shaman import models
    
    In [2]: models.start_read_only()
    
    In [3]: import two_week_stats
    
    In [4]: two_week_stats.report()

Delete builds/repos from database

HYPOTHETICALLY ;-) if a repo/build got pushed to shaman that contains an embargoed security fix, you can delete the entries from shaman's DB. The packages will still be on chacra servers but shaman won't know about them. You can always delete them from chacra too if necessary.

ssh 1.shaman.ceph.com
sudo su - postgres

postgres@1:~$ psql -d shaman
psql (9.5.23)
Type "help" for help.

shaman=# \dt
             List of relations
 Schema |      Name       | Type  | Owner  
--------+-----------------+-------+--------
 public | alembic_version | table | shaman
 public | archs           | table | shaman
 public | builds          | table | shaman
 public | nodes           | table | shaman
 public | projects        | table | shaman
 public | repos           | table | shaman
(6 rows)

shaman=# delete from public.builds where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
DELETE 6

shaman=# select id from public.repos where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
   id   
--------
 197001
 197010
 197011
 197012
 197030
 196999

shaman=# delete from public.archs where repo_id = '197001';
DELETE 1
shaman=# delete from public.archs where repo_id = '197010';
DELETE 2
shaman=# delete from public.archs where repo_id = '197011';
DELETE 2
shaman=# delete from public.archs where repo_id = '197012';
DELETE 2
shaman=# delete from public.archs where repo_id = '197030';
DELETE 2
shaman=# delete from public.archs where repo_id = '196999';
DELETE 1
shaman=# delete from public.repos where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
DELETE 6

Nagios Checks

There's a custom Nagios check in place that queries the /api/nodes/next endpoint.

This check is in place to make sure the postgres database is writeable. An incident occurred in 2019 where OVH rebooted all 3 shaman-related VMs at the same time and the DB was read-only for an unknown reason.

root@nagios:~# cat /usr/lib/nagios/plugins/check_shaman 
#!/bin/bash
# Checks shaman /api/nodes/next endpoint

if curl -s -I -u XXXXX:XXXXX https://${1}/api/nodes/next | grep -q "200 OK"; then
    echo "OK - Shaman /api/nodes/next endpoint healthy"
    exit 0
else
    echo "CRITICAL - Shaman /api/nodes/next endpoint failed"
    exit 2
fi