Differences

This shows you the differences between two versions of the page.

--- production:shaman.ceph.com [2018/05/01 18:40]
djgalloway [Updating/Redeploying shaman]
+++ production:shaman.ceph.com [2022/10/06 02:58]
dmick [Summary]
@@ Line 4: / Line 4: @@
 There are three VMs in the [[https://wiki.sepia.ceph.com/doku.php?id=services:ovh#production_services|OVH CI region]] that make up shaman.
-  * shaman.ceph.com is just a load balancing VM
+  * shaman.ceph.com is just a load balancing VM.  Accesses are proxied to either 1.shaman.ceph.com or 2.shaman.ceph.com with an 'upstream shaman' clause in /etc/nginx/nginx.conf, which is then referred to by the site config for shaman.ceph.com.
   * 1.shaman.ceph.com is the primary shaman node that has the postgres DB with all the repo information
-  * 2.shaman.ceph.com is a **READ ONLY** backup in the event 2.shaman.ceph.com goes down
+  * 2.shaman.ceph.com is a **READ ONLY** backup in the event 1.shaman.ceph.com goes down
 ===== User Access =====
@@ Line 15: / Line 15: @@
 </code>
-===== Ops Tasks ====
+===== Ops Tasks =====
 ==== Starting/Restarting service ====
@@ Line 23: / Line 23: @@
 ==== Updating/Redeploying shaman ====
-''api_key'' needs to be set in deploy_production.yml
+  - If needed, copy ''deploy/playbooks/examples/deploy_production.yml'' to ''deploy/playbooks/''
+    - Get and set the following credentials.  These can be found in ''/opt/shaman/src/shaman/prod.py'' on 1.shaman.ceph.com
+      - ''api_user''
+      - ''api_key''
+      - ''rabbit_host''
+      - ''rabbit_user''
+      - ''rabbit_pw''
+      - ''github_secret''
+  - Run the playbook (see below)
 **%%**It is extremely important that the postgres tag is skipped**%%**
+Set ''%%--limit%%'' to one node at a time to avoid disrupting the CI or lab testing.
 <code>
 ansible-playbook --tags="deploy_app" --skip-tags="postgres,nginx" --extra-vars="master_ip=158.69.71.144 standby_ip=158.69.71.192" deploy_production.yml --limit="1.shaman.ceph.com,2.shaman.ceph.com"
+</code>
+==== Pulling Slave Stats ====
+I needed to determine what percentage of jobs were running on static vs. ephemeral slaves.  Alfredo wrote a python script to pull this data out of the shaman database.  This script totals how many jobs ran on static vs. ephemeral slaves over a 2 week period (since that's how long we keep dev builds).
+Doing this on 2.shaman.ceph.com ensures you're in a read-only capacity.
+  - ''ssh 2.shaman.ceph.com''
+  - ''cd /opt/shaman/src/shaman''
+  - Copy the script below to ''two_week_stats.py''<code>
+import datetime
+from shaman import models
+from shaman.models import Build, Project
+models.start_read_only()
+def report():
+    two_weeks = datetime.datetime.utcnow() - datetime.timedelta(days=15)
+    ceph_project = models.Project.filter_by(name='ceph').one()
+    builds = Build.filter_by(project=ceph_project).filter(Build.completed > two_weeks).all()
+    ovh_builds = {}
+    irvingi_builds = {}
+    braggi_builds = {}
+    adami_builds = {}
+    rest_of_the_world = {}
+    ovh_count = 0
+    irvingi_count = 0
+    braggi_count = 0
+    adami_count = 0
+    for build in builds:
+        node_name = build.extra['node_name']
+        if '__' in node_name:
+            mapping = ovh_builds
+            counter = ovh_count
+        elif 'slave-' in node_name:
+            mapping = irvingi_builds
+            counter = irvingi_count
+        elif 'braggi' in node_name:
+            mapping = braggi_builds
+            counter = braggi_count
+        elif 'adami' in node_name:
+            mapping = adami_builds
+            counter = adami_count
+        else:
+            mapping = rest_of_the_world
+        try:
+            mapping[node_name] += 1
+        except KeyError:
+            mapping[node_name] = 1
+    for mapping in [ovh_builds, irvingi_builds, braggi_builds, adami_builds]:
+        count = 0
+        for key, value in mapping.items():
+            print key, value
+            count += value
+	print "TOTAL: %s" % count
+        print "="*60
+        print
+</code>
+  - ''%%/opt/shaman/bin/pecan shell --shell ipython prod.py%%''
+  - Then<code>
+In [1]: from shaman import models
+In [2]: models.start_read_only()
+In [3]: import two_week_stats
+In [4]: two_week_stats.report()
+</code>
+==== Delete builds/repos from database ====
+HYPOTHETICALLY ;-) if a repo/build got pushed to shaman that contains an embargoed security fix, you can delete the entries from shaman's DB.  The packages will still be on chacra servers but shaman won't know about them.  You can always [[production:chacra.ceph.com#manually_delete_a_repo_from_postgres_db|delete]] them from chacra too if necessary.
+<code>
+ssh 1.shaman.ceph.com
+sudo su - postgres
+postgres@1:~$ psql -d shaman
+psql (9.5.23)
+Type "help" for help.
+shaman=# \dt
+             List of relations
+ Schema |      Name       | Type  | Owner
+--------+-----------------+-------+--------
+ public | alembic_version | table | shaman
+ public | archs           | table | shaman
+ public | builds          | table | shaman
+ public | nodes           | table | shaman
+ public | projects        | table | shaman
+ public | repos           | table | shaman
+(6 rows)
+shaman=# delete from public.builds where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
+DELETE 6
+shaman=# select id from public.repos where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
+   id
+--------
+shaman=# delete from public.archs where repo_id = '197001';
+DELETE 1
+shaman=# delete from public.archs where repo_id = '197010';
+DELETE 2
+shaman=# delete from public.archs where repo_id = '197011';
+DELETE 2
+shaman=# delete from public.archs where repo_id = '197012';
+DELETE 2
+shaman=# delete from public.archs where repo_id = '197030';
+DELETE 2
+shaman=# delete from public.archs where repo_id = '196999';
+DELETE 1
+shaman=# delete from public.repos where sha1 = 'f73b19678311b996984c30e7c0eb96a22ffa29ce';
+DELETE 6
+</code>
+===== Nagios Checks =====
+There's a custom Nagios check in place that queries the ''/api/nodes/next'' endpoint.
+This check is in place to make sure the postgres database is writeable.  An incident occurred in 2019 where OVH rebooted all 3 shaman-related VMs at the same time and the DB was read-only for an unknown reason.
+<code>
+root@nagios:~# cat /usr/lib/nagios/plugins/check_shaman
+#!/bin/bash
+# Checks shaman /api/nodes/next endpoint
+if curl -s -I -u XXXXX:XXXXX https://${1}/api/nodes/next | grep -q "200 OK"; then
+    echo "OK - Shaman /api/nodes/next endpoint healthy"
+    exit 0
+else
+    echo "CRITICAL - Shaman /api/nodes/next endpoint failed"
+    exit 2
+fi
 </code>

Sepia Lab Wiki

User Tools

Site Tools

Differences

Page Tools