Differences

This shows you the differences between two versions of the page.

--- tasks:scheduled-maintenance [2018/04/03 20:26]
djgalloway
+++ tasks:scheduled-maintenance [2018/04/03 21:30] (current)
djgalloway
@@ Line 13: / Line 13: @@
 ===== Maintenance Matrix =====
-^ Service                           ^ Disrupts Community  ^ Disrupts Developers  ^ Backups  ^ Ansible  ^ HA        ^ Failover       ^ Risk      ^ Stakeholders              ^ Applications                     ^ SME                ^ Other Notes                                                                                  ^
+^ Service                           ^ Disrupts Community  ^ Disrupts Developers  ^ Backups  ^ Ansible ((Can the host be restored/rebuilt with Ansible?))  ^ HA        ^ Failover  ^ Risk    ^ Stakeholders              ^ Applications                     ^ SME                 ^ Other Notes                                                                                  ^
-| www.ceph.com                      | Yes                 | No                   | Yes      | Some     | OVH       | No             | Medium    | Wordpress                 | lvaz, Age of Peers               |                    | Leo regularly updates Wordpress and its plugins                                              |
+| www.ceph.com                      | Yes                 | No                   | Yes      | Some                                                        | OVH       | No        | Medium  | N/A                       | Wordpress, nginx                 | lvaz, Age of Peers  | Leo regularly updates Wordpress and its plugins                                              |
-| download.ceph.com                 | Yes                 | Yes                  | Yes      | Most     | OVH       | No             | Medium    | All                       | Nginx                            | dgalloway          |                                                                                              |
+| download.ceph.com                 | Yes                 | Yes                  | Yes      | Most                                                        | OVH       | No        | Medium  | All                       | Nginx                            | dgalloway           |                                                                                              |
-| tracker.ceph.com                  | Yes                 | Yes                  | Yes      | No       | OVH       | No             | High      | All                       | Redmine                          | dgalloway,dmick    | Redmine and its plugins are tricky.  None of us are ruby experts.                            |
+| tracker.ceph.com                  | Yes                 | Yes                  | Yes      | No                                                          | OVH       | No        | High    | All                       | Redmine                          | dgalloway,dmick     | Redmine and its plugins are tricky.  None of us are ruby experts.                            |
-| docs.ceph.com                     | Yes                 | No                   | Yes      | Some     | OVH       | No             | Low       | All                       | Nginx                            | dgalloway          |                                                                                              |
+| docs.ceph.com                     | Yes                 | No                   | Yes      | Some                                                        | OVH       | No        | Low     | All                       | Nginx                            | dgalloway           |                                                                                              |
-| chacra.ceph.com                   | No                  | Yes                  | Yes      | Yes      | OVH       | No             | Low       | Release team, Core devs?  | chacra, celery, postgres, nginx  | dgalloway,alfredo  |                                                                                              |
+| chacra.ceph.com                   | No                  | Yes                  | Yes      | Yes                                                         | OVH       | No        | Low     | Release team, Core devs?  | chacra, celery, postgres, nginx  | dgalloway,alfredo   |                                                                                              |
-| chacra dev instances              | No                  | Yes                  | No       | Yes      | OVH       | Yes            | Low       | Devs                      | chacra, celery, postgres, nginx  | dgalloway,alfredo  |                                                                                              |
+| chacra dev instances              | No                  | Yes                  | No       | Yes                                                         | OVH       | Yes       | Low     | Devs                      | chacra, celery, postgres, nginx  | dgalloway,alfredo   |                                                                                              |
-| shaman                            | No                  | Yes                  | No       | Yes      | OVH       | Yes            | Medium    | Devs                      | shaman, ?                        | dgalloway,alfredo  |                                                                                              |
+| shaman                            | No                  | Yes                  | No       | Yes                                                         | OVH       | Yes       | Medium  | Devs                      | shaman, ?                        | dgalloway,alfredo   |                                                                                              |
-| {apt-mirror,gitbuilder}.ceph.com  | No                  | Yes                  | No       | No       | No        | No             | High      | Devs                      | Apache                           | dgalloway,dmick    | Still on single baremetal mira                                                               |
+| {apt-mirror,gitbuilder}.ceph.com  | No                  | Yes                  | No       | No                                                          | No        | No        | High    | Devs                      | Apache                           | dgalloway,dmick     | Still on single baremetal mira                                                               |
-| jenkins{2}.ceph.com               | Yes                 | Yes                  | Some     | Yes      | OVH       | No             | Medium    | Devs                      | Jenkins, mita, celery, nginx     | dgalloway,alfredo  |                                                                                              |
+| jenkins{2}.ceph.com               | Yes                 | Yes                  | Some     | Yes                                                         | OVH       | No        | Medium  | Devs                      | Jenkins, mita, celery, nginx     | dgalloway,alfredo   |                                                                                              |
-| prado.ceph.com                    | Could               | Could                | No       | Yes      | OVH       | No             | Low       | Devs                      | prado, nginx                     | dgalloway,alfredo  |                                                                                              |
+| prado.ceph.com                    | Could               | Could                | No       | Yes                                                         | OVH       | No        | Low     | Devs                      | prado, nginx                     | dgalloway,alfredo   |                                                                                              |
-| git.ceph.com                      | Yes?                | Yes                  | No       | Yes      | RHEV      | No             | Medium    | Devs, others?             | git, git-daemon, apache          | dgalloway,dmick    |                                                                                              |
+| git.ceph.com                      | Yes?                | Yes                  | No       | Yes                                                         | RHEV      | No        | Medium  | Devs, others?             | git, git-daemon, apache          | dgalloway,dmick     |                                                                                              |
-| teuthology VM                     | No                  | Yes                  | Some     | Most     | RHEV      | No             | Low       | Devs                      | Teuthology                       | dgalloway,zack     | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com  |
+| teuthology VM                     | No                  | Yes                  | Some     | Most                                                        | RHEV      | No        | Low     | Devs                      | Teuthology                       | dgalloway,zack      | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com  |
-| pulpito.front                     | No                  | Not really           | No       | Yes      | No        | No             | Medium    | QE?                       | pulpito                          | zack               | Relies on paddles.  Still on baremetal mira                                                  |
+| pulpito.front                     | No                  | Not really           | No       | Yes                                                         | No        | No        | Medium  | QE?                       | pulpito                          | zack                | Relies on paddles.  Still on baremetal mira                                                  |
-| paddles.front                     | No                  | Yes                  | Yes      | Yes      | No        | No             | Medium    | Devs                      | paddles                          | dgalloway,zack     | Still on baremetal mira                                                                      |
+| paddles.front                     | No                  | Yes                  | Yes      | Yes                                                         | No        | No        | Medium  | Devs                      | paddles                          | dgalloway,zack      | Still on baremetal mira                                                                      |
-| Cobbler                           | No                  | No                   | No       | Yes      | RHEV      | No             | Low       | dgalloway                 | Cobbler, apache                  | dgalloway          | Really only needed for creating FOG images                                                   |
+| Cobbler                           | No                  | No                   | No       | Yes                                                         | RHEV      | No        | Low     | dgalloway                 | Cobbler, apache                  | dgalloway           | Really only needed for creating FOG images                                                   |
-| conserver.front                   | No                  | Yes?                 | Some     | No       | RHEV      | No             | Low       | Devs                      | conserver                        | dgalloway          |                                                                                              |
+| conserver.front                   | No                  | Yes?                 | Some     | No                                                          | RHEV      | No        | Low     | Devs                      | conserver                        | dgalloway           |                                                                                              |
-| DHCP (store01)                    | No                  | Yes                  | Yes      | No       | No        | No             | Medium    | Devs                      | dhcpd                            | dgalloway          |                                                                                              |
+| DHCP (store01)                    | No                  | Yes                  | Yes      | Yes                                                         | No        | No        | Medium  | Devs                      | dhcpd                            | dgalloway           |                                                                                              |
-| DNS                               | Could               | Yes                  | N/A      | Yes      | RHEV/OVH  | Not currently  | Low	Devs  | named                     | dgalloway                        |                    |                                                                                              |
+| DNS                               | Could               | Yes                  | N/A      | Yes                                                         | RHEV/OVH  | ns1/ns2   | Low     | Devs                      | named                            | dgalloway           |                                                                                              |
-| FOG                               | No                  | Yes                  | No       | Little   | RHEV      | No             | Medium    | Devs                      | fog                              | dgalloway          |                                                                                              |
+| FOG                               | No                  | Yes                  | No       | Yes                                                         | RHEV      | No        | Medium  | Devs                      | fog                              | dgalloway           |                                                                                              |
-| LRC                               | No                  | Could                | No       | No       | Yes       | Ish            | Medium    | Devs                      | ceph                             | dgalloway,sage     |                                                                                              |
+| LRC                               | No                  | Could                | No       | Some                                                        | Yes       | Ish       | Medium  | Devs                      | ceph                             | dgalloway,sage      |                                                                                              |
-| gw.sepia.ceph.com                 | Could               | Yes                  | Yes      | Yes      | RHEV      | No             | Medium    | All                       | openvpn, nginx                   | dgalloway          |                                                                                              |
+| gw.sepia.ceph.com                 | Could               | Yes                  | Yes      | Yes                                                         | RHEV      | No        | Medium  | All                       | openvpn, nginx                   | dgalloway           |                                                                                              |
-| RHEV                              | No                  | Could                | Yes      | No       | Yes       | Ish            | Medium    | All                       | RHEV, gluster                    | dgalloway          | Packages are a mix between CentOS gluster and RHGS                                           |
+| RHEV                              | No                  | Could                | Yes      | No                                                          | Yes       | Ish       | Medium  | All                       | RHEV, gluster                    | dgalloway           |                                                                                              |
-| Gluster                           | No                  | Could                | No       | No       | Yes       | Ish            | Medium    | All                       | Gluster                          | dgalloway          | RHGS compatibility must remain aligned with RHEV version                                     |
+| Gluster                           | No                  | Could                | No       | No                                                          | Yes       | Ish       | Medium  | All                       | Gluster                          | dgalloway           | RHGS compatibility must remain aligned with RHEV version                                     |
-===== Scheduled Maintenance Plan =====
+===== Scheduled Maintenance Plans =====
-==== CI Infrastructure ====
+==== CI Infrastructure Procedure ====
 Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted.  Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance.  To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.
+  - Notify ceph-devel@
   - Log into each Jenkins instance, **Manage Jenkins** -> **Prepare for Shutdown**
   - Again in Jenkins, go to **Manage Jenkins** -> **Manage Plugins**
@@ Line 53: / Line 54: @@
     - ''apt install linux-image-generic'' (or equivalent to **just** update the kernel)
     - Reboot the host so you're running the latest kernel
-  - If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''apt update && apt upgrade && reboot''
+  - **Update Slaves**
-  - If a redeploy is needed, see each service's individual wiki page
+    - ssh to each static smithi slave (smithi{119..128}
-    - Once all the other CI hosts are up to date, update each Jenkins instance: ''apt upgrade''
+    - ssh to each slave-{centos,ubuntu}-* slave, update packages, and **shut down**
-    - This should restart Jenkins but if it doesn't, ''systemctl start jenkins''
+    - Put each irvingi node in Maintenance mode under the **Hosts** tab in the [[https://mgr01.front.sepia.ceph.com/ovirt-engine/webadmin/?locale=en_US#hosts-events|RHEV Web UI]]
-    - ''systemctl enable jenkins''
+    - In the RHEV Web UI, highlight each irvingi host and click **Update**
+    - Bring slave-{centos,ubuntu}-* VMs back online after irvingis update
+    - Make sure all static slaves reconnect to Jenkins
+  - **Update chacra, mita, shaman, prado**
+    - If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''apt update && apt upgrade && reboot''
+    - If a redeploy is needed, see each service's individual wiki page
+  - Once all the other CI hosts are up to date, update each Jenkins instance: ''apt upgrade''
+  - This should restart Jenkins but if it doesn't, ''systemctl start jenkins''
+  - ''systemctl enable jenkins''
   - Spot check a few jobs to make sure all plugins are working properly
     - You can check this by commenting ''jenkins test make check'' in a PR
@@ Line 65: / Line 74: @@
 ----
-==== Public Facing Sites ====
+==== Public Facing Sites Procedure ====
 === tracker.ceph.com and www.ceph.com ===
 For the most part, these hosts' packages can be updated and hosts rebooted whenever.  If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.
 == Post-update Tasks ==
-Check login to tracker.ceph.com and spot check a few pages on www.ceph.com afterwards.
+  * Log in to tracker.ceph.com and modify a bug
+  * Spot check a few pages on www.ceph.com
+  * Log in to www.ceph.com if you have a login to wordpress
 ----
@@ Line 75: / Line 87: @@
 === docs.ceph.com ===
 As long as there isn't a [[https://jenkins.ceph.com/computer/docs.ceph.com/|job]] running, this host can be updated and rebooted whenever.
 == Post-update Tasks ==
   * Does http://docs.ceph.com load?
@@ Line 82: / Line 95: @@
 === download.ceph.com ===
-Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage.
+Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists.
+== Post-update Tasks ==
+  * Does https://download.ceph.com load?
+----
+==== Sepia Lab Procedure ====
+  - Send a planned outage notice to sepia at lists dot ceph.com
+  - [[services:teuthology#gracefully_stop_workers|Instruct teuthology workers to die]]
+    - Wait for there to be no jobs running.  This can take up to 12 hours.
+    - If you don't want to wait:
+      - Ask Yuri if ''scheduled_teuthology'' runs can be killed
+      - Ask individual devs if their runs can be killed
+  - Kill idle workers<code>
+ssh teuthology.front.sepia.ceph.com
+sudo su - teuthworker
+bin/worker_kill_idle
+</code>
+  - Once there are no jobs running and no workers (''teuthworker@teuthology:~$ bin/worker_count'')
+    - Update packages on teuthology.front.sepia.ceph.com
+    - ''%%shutdown -r +5 "Rebooting during planned maintenance.  SAVE YOUR WORK!"%%''
+    - Update and reboot:
+      - labdashboard.front.sepia.ceph.com
+      - circle.front.sepia.ceph.com
+      - cobbler.front.sepia.ceph.com
+      - conserver.front.sepia.ceph.com
+      - drop.ceph.com
+      - git.ceph.com
+      - fog.front.sepia.ceph.com
+      - ns1.front.sepia.ceph.com
+      - ns2.front.sepia.ceph.com
+      - nsupdate.front.sepia.ceph.com
+      - vpn-pub.ovh.sepia.ceph.com
+      - satellite.front.sepia.ceph.com
+      - sentry.front.sepia.ceph.com
+      - pulpito.front.sepia.ceph.com
+  - Finally, update and reboot gw.sepia.ceph.com ((See [[services:rhev#emergency_rhev_web_ui_access_w_o_vpn]] for emergency backup plan))
+=== Sepia Lab Post Maintenance Tasks ===
+  - ''ssh $(whoami)@8.43.84.131'' (circle.front/GFW proxy)
+  - Log in to [[https://cobbler.front.sepia.ceph.com/cobbler_web/system/list|Cobbler]]
+  - ''console smithi001.front.sepia.ceph.com'' (Do you get a SOL session?)
+  - ''ssh drop.ceph.com''
+  - Does git.ceph.com load?  Is it up to date?
+  - Log in to [[http://fog.front.sepia.ceph.com/fog/management/index.php|FOG]] ((Updating the database can break FOG.  See https://forums.fogproject.org/topic/10006/ubuntu-is-fog-s-enemy))
+  - ''dig smithi001.front.sepia.ceph.com @vpn-pub.ovh.sepia.ceph.com''
+  - ''dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com''
+  - ''dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com''
+  - ''%%teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04%%'' (Do you get a functioning OVH node?  ''dig'' its hostname against vpn-pub and make sure nsupdate is working)
+  - ''%%teuthology-lock --lock-many 1 -m smithi --os-type rhel%%''
+    - Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update
+  - Does http://sentry.ceph.com/sepia load?
+  - Does http://pulpito.ceph.com load?
+  - Verify all the reverse proxies in ''/etc/nginx/sites-enabled'' on gw.sepia.ceph.com are accessible
+Finally, once all post-maintenance tasks are complete,<code>
+ssh teuthology.front.sepia.ceph.com
+sudo su - teuthworker
+bin/worker_start smithi 25
+^D^D
+</code>
+Check how many running workers there should be in ''/home/teuthworker/bin/worker_start'' and start **1/4** of them at a time.  If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks.
+===== Boilerplate Outage Notices =====
+==== CI ====
+<code>
+Hi All,
+A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
+We will be updating and rebooting the following hosts:
+jenkins.ceph.com
+.jenkins.ceph.com
+chacra.ceph.com
+{1..5}.chacra.ceph.com
+shaman.ceph.com
+.shaman.ceph.com
+.shaman.ceph.com
+This means:
+  - Jenkins will be paused and stop processing new jobs so PR checks will be delayed
+  - Once there are no jobs running, all hosts will be updated and rebooted
+  - Repos on chacra nodes will be temporarily unavailable
+Let me know if you have any questions/concerns.
+Thanks,
+</code>
+==== Sepia Lab ====
+<code>
+Hi All,
+A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
+We will be updating and rebooting the following hosts:
+teuthology.front.sepia.ceph.com
+labdashboard.front.sepia.ceph.com
+circle.front.sepia.ceph.com
+cobbler.front.sepia.ceph.com
+conserver.front.sepia.ceph.com
+fog.front.sepia.ceph.com
+ns1.front.sepia.ceph.com
+ns2.front.sepia.ceph.com
+nsupdate.front.sepia.ceph.com
+vpn-pub.ovh.sepia.ceph.com
+satellite.front.sepia.ceph.com
+sentry.front.sepia.ceph.com
+pulpito.front.sepia.ceph.com
+drop.ceph.com
+git.ceph.com
+gw.sepia.ceph.com
+This means:
+  - teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete
+  - DNS may be temporarily unavailable
+  - All aforementioned hosts will be temporarily unavailable for a brief time
+  - Your VPN connection will need to be restarted
+I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete.
+Let me know if you have any questions/concerns.
+Thanks,
+</code>

Sepia Lab Wiki

User Tools

Site Tools

Differences

Page Tools