User Tools

Site Tools


production:shaman.ceph.com

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
production:shaman.ceph.com [2018/05/01 18:40]
djgalloway [Updating/Redeploying shaman]
production:shaman.ceph.com [2023/04/06 00:51] (current)
dmick [Summary]
Line 4: Line 4:
  
 There are three VMs in the [[https://​wiki.sepia.ceph.com/​doku.php?​id=services:​ovh#​production_services|OVH CI region]] that make up shaman. There are three VMs in the [[https://​wiki.sepia.ceph.com/​doku.php?​id=services:​ovh#​production_services|OVH CI region]] that make up shaman.
-  * shaman.ceph.com is just a load balancing VM+  * shaman.ceph.com is just a load balancing VM.  Accesses are proxied to either 1.shaman.ceph.com or 2.shaman.ceph.com with an '​upstream shaman'​ clause in /​etc/​nginx/​nginx.conf,​ which is then referred to by the site config for shaman.ceph.com.
   * 1.shaman.ceph.com is the primary shaman node that has the postgres DB with all the repo information   * 1.shaman.ceph.com is the primary shaman node that has the postgres DB with all the repo information
-  * 2.shaman.ceph.com is a **READ ONLY** backup in the event 2.shaman.ceph.com goes down+  * 2.shaman.ceph.com is a **READ ONLY** backup in the event 1.shaman.ceph.com goes down 
 +  * 2.shaman.ceph.com can handle write requests because pecan, the web framework, is also aware of the primary/hot standby configuration,​ and so will redirect writes to 1.shaman.ceph.com on its own, if they appear.
  
 ===== User Access ===== ===== User Access =====
Line 15: Line 16:
 </​code>​ </​code>​
  
-===== Ops Tasks ====+===== Ops Tasks =====
  
 ==== Starting/​Restarting service ==== ==== Starting/​Restarting service ====
Line 23: Line 24:
  
 ==== Updating/​Redeploying shaman ==== ==== Updating/​Redeploying shaman ====
-''​api_key'' ​needs to be set in deploy_production.yml+  - If needed, copy ''​deploy/​playbooks/​examples/​deploy_production.yml''​ to ''​deploy/​playbooks/''​ 
 +    - Get and set the following credentials. ​ These can be found in ''/​opt/​shaman/​src/​shaman/​prod.py''​ on 1.shaman.ceph.com 
 +      - ''​api_user''​ 
 +      - ''​api_key''​ 
 +      - ''​rabbit_host''​ 
 +      - ''​rabbit_user''​ 
 +      - ''​rabbit_pw''​ 
 +      - ''​github_secret''​ 
 +  - Run the playbook (see below)
  
 **%%**It is extremely important that the postgres tag is skipped**%%** **%%**It is extremely important that the postgres tag is skipped**%%**
 +
 +Set ''​%%--limit%%''​ to one node at a time to avoid disrupting the CI or lab testing.
  
 <​code>​ <​code>​
 ansible-playbook --tags="​deploy_app"​ --skip-tags="​postgres,​nginx"​ --extra-vars="​master_ip=158.69.71.144 standby_ip=158.69.71.192"​ deploy_production.yml --limit="​1.shaman.ceph.com,​2.shaman.ceph.com"​ ansible-playbook --tags="​deploy_app"​ --skip-tags="​postgres,​nginx"​ --extra-vars="​master_ip=158.69.71.144 standby_ip=158.69.71.192"​ deploy_production.yml --limit="​1.shaman.ceph.com,​2.shaman.ceph.com"​
 +</​code>​
 +
 +==== Pulling Slave Stats ====
 +I needed to determine what percentage of jobs were running on static vs. ephemeral slaves. ​ Alfredo wrote a python script to pull this data out of the shaman database. ​ This script totals how many jobs ran on static vs. ephemeral slaves over a 2 week period (since that's how long we keep dev builds).
 +
 +Doing this on 2.shaman.ceph.com ensures you're in a read-only capacity.
 +
 +  - ''​ssh 2.shaman.ceph.com''​
 +  - ''​cd /​opt/​shaman/​src/​shaman''​
 +  - Copy the script below to ''​two_week_stats.py''<​code>​
 +import datetime
 +
 +from shaman import models
 +from shaman.models import Build, Project
 +models.start_read_only()
 +
 +
 +
 +def report():
 +    two_weeks = datetime.datetime.utcnow() - datetime.timedelta(days=15)
 +    ​
 +    ceph_project = models.Project.filter_by(name='​ceph'​).one()
 +    builds = Build.filter_by(project=ceph_project).filter(Build.completed > two_weeks).all()
 +
 +    ovh_builds = {}
 +    irvingi_builds = {}
 +    braggi_builds = {}
 +    adami_builds = {}    ​
 +    rest_of_the_world = {}
 +    ovh_count = 0
 +    irvingi_count = 0
 +    braggi_count = 0
 +    adami_count = 0
 +
 +    for build in builds:
 +        node_name = build.extra['​node_name'​]
 +        if '​__'​ in node_name:
 +            mapping = ovh_builds
 +            counter = ovh_count
 +        elif '​slave-'​ in node_name:
 +            mapping = irvingi_builds
 +            counter = irvingi_count
 +        elif '​braggi'​ in node_name:
 +            mapping = braggi_builds
 +            counter = braggi_count
 +        elif '​adami'​ in node_name:
 +            mapping = adami_builds
 +            counter = adami_count
 +        else:
 +            mapping = rest_of_the_world
 +    ​
 +        try:
 +            mapping[node_name] += 1
 +        except KeyError:
 +            mapping[node_name] = 1
 +    ​
 +    for mapping in [ovh_builds,​ irvingi_builds,​ braggi_builds,​ adami_builds]:​
 +        count = 0
 +        for key, value in mapping.items():​
 +            print key, value
 +            count += value
 + print "​TOTAL:​ %s" % count
 +        print "​="​*60
 +        print
 +</​code>​
 +  - ''​%%/​opt/​shaman/​bin/​pecan shell --shell ipython prod.py%%''​
 +  - Then<​code>​
 +In [1]: from shaman import models
 +
 +In [2]: models.start_read_only()
 +
 +In [3]: import two_week_stats
 +
 +In [4]: two_week_stats.report()
 +</​code>​
 +
 +==== Delete builds/​repos from database ====
 +HYPOTHETICALLY ;-) if a repo/build got pushed to shaman that contains an embargoed security fix, you can delete the entries from shaman'​s DB.  The packages will still be on chacra servers but shaman won't know about them.  You can always [[production:​chacra.ceph.com#​manually_delete_a_repo_from_postgres_db|delete]] them from chacra too if necessary.
 +
 +<​code>​
 +ssh 1.shaman.ceph.com
 +sudo su - postgres
 +
 +postgres@1:​~$ psql -d shaman
 +psql (9.5.23)
 +Type "​help"​ for help.
 +
 +shaman=# \dt
 +             List of relations
 + ​Schema |      Name       | Type  | Owner  ​
 +--------+-----------------+-------+--------
 + ​public | alembic_version | table | shaman
 + ​public | archs           | table | shaman
 + ​public | builds ​         | table | shaman
 + ​public | nodes           | table | shaman
 + ​public | projects ​       | table | shaman
 + ​public | repos           | table | shaman
 +(6 rows)
 +
 +shaman=# delete from public.builds where sha1 = '​f73b19678311b996984c30e7c0eb96a22ffa29ce';​
 +DELETE 6
 +
 +shaman=# select id from public.repos where sha1 = '​f73b19678311b996984c30e7c0eb96a22ffa29ce';​
 +   ​id ​  
 +--------
 + ​197001
 + ​197010
 + ​197011
 + ​197012
 + ​197030
 + ​196999
 +
 +shaman=# delete from public.archs where repo_id = '​197001';​
 +DELETE 1
 +shaman=# delete from public.archs where repo_id = '​197010';​
 +DELETE 2
 +shaman=# delete from public.archs where repo_id = '​197011';​
 +DELETE 2
 +shaman=# delete from public.archs where repo_id = '​197012';​
 +DELETE 2
 +shaman=# delete from public.archs where repo_id = '​197030';​
 +DELETE 2
 +shaman=# delete from public.archs where repo_id = '​196999';​
 +DELETE 1
 +shaman=# delete from public.repos where sha1 = '​f73b19678311b996984c30e7c0eb96a22ffa29ce';​
 +DELETE 6
 +</​code>​
 +
 +===== Nagios Checks =====
 +There'​s a custom Nagios check in place that queries the ''/​api/​nodes/​next''​ endpoint.
 +
 +This check is in place to make sure the postgres database is writeable. ​ An incident occurred in 2019 where OVH rebooted all 3 shaman-related VMs at the same time and the DB was read-only for an unknown reason.
 +
 +<​code>​
 +root@nagios:​~#​ cat /​usr/​lib/​nagios/​plugins/​check_shaman ​
 +#!/bin/bash
 +# Checks shaman /​api/​nodes/​next endpoint
 +
 +if curl -s -I -u XXXXX:XXXXX https://​${1}/​api/​nodes/​next | grep -q "200 OK"; then
 +    echo "OK - Shaman /​api/​nodes/​next endpoint healthy"​
 +    exit 0
 +else
 +    echo "​CRITICAL - Shaman /​api/​nodes/​next endpoint failed"​
 +    exit 2
 +fi
 </​code>​ </​code>​
production/shaman.ceph.com.1525200048.txt.gz · Last modified: 2018/05/01 18:40 by djgalloway