User Tools

Site Tools


tasks:scheduled-maintenance

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tasks:scheduled-maintenance [2017/12/21 21:06]
djgalloway [Frequency/Timing]
tasks:scheduled-maintenance [2018/04/03 21:30] (current)
djgalloway
Line 5: Line 5:
  
 ===== Frequency/​Timing ===== ===== Frequency/​Timing =====
-The PnT Labs team has scheduled monthly maintenance on the second Friday of every month.  ​While this maintenance only has the potential to disrupt operations in the downstream/​Octo lab, it could be a good candidate. ​ Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.+The PnT Labs team has scheduled monthly maintenance on the second Friday of every month.  ​Since this maintenance only has the potential to disrupt operations in the downstream/​Octo lab, it could be a good candidate. ​ Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.
  
 Maintenance should not disrupt testing cycles when an upstream release is imminent. Maintenance should not disrupt testing cycles when an upstream release is imminent.
Line 13: Line 13:
 ===== Maintenance Matrix ===== ===== Maintenance Matrix =====
  
-^ Service ^ Disrupts Community ^ Disrupts Developers ^ Backups ^ Ansible ^ HA ^ Failover ^ Risk ^ Stakeholders ^ Applications ^ SME ^ Other Notes ^ +^ Service ​                          ​^ Disrupts Community ​ ^ Disrupts Developers ​ ^ Backups ​ ^ Ansible ​((Can the host be restored/​rebuilt with Ansible?​))  ​^ HA        ^ Failover ​ ^ Risk    ^ Stakeholders ​             ^ Applications ​                    ​^ SME                 ​^ Other Notes                                                                                  
-|www.ceph.com | Yes | No | Yes | Some | OVH | No | Medium | Wordpress | lvaz, Age of Peers | | +| www.ceph.com ​                     | Yes                 ​| No                   ​| Yes      | Some                                                        | OVH       ​| No        | Medium ​ | N/A                       | Wordpress, nginx                 | lvaz, Age of Peers  Leo regularly updates Wordpress and its plugins ​                                             ​
-|download.ceph.com | Yes | Yes | Yes | Most | OVH | No | Medium | All | Nginx | dgalloway | | +| download.ceph.com ​                ​| Yes                 ​| Yes                  | Yes      | Most                                                        | OVH       ​| No        | Medium ​ | All                       ​| Nginx                            | dgalloway ​          ​                                                                                             
-|tracker.ceph.com | Yes | Yes | Yes | No | OVH | No | High | All | Redmine | dgalloway,​dmick | Redmine and its plugins are tricky. ​ None of us are ruby experts. | +| tracker.ceph.com ​                 | Yes                 ​| Yes                  | Yes      | No                                                          | OVH       ​| No        | High    | All                       ​| Redmine ​                         | dgalloway,​dmick ​    ​| Redmine and its plugins are tricky. ​ None of us are ruby experts. ​                           
-|docs.ceph.com | Yes | No | Yes | Some | OVH | No | Low | Nginx | dgalloway | | +| docs.ceph.com ​                    ​| Yes                 ​| No                   ​| Yes      | Some                                                        | OVH       ​| No        | Low     | All                       | Nginx                            | dgalloway ​          ​                                                                                             
-|chacra.ceph.com | No | Yes | Yes | Yes | OVH | No | Low | Release team, Core devs? | chacra, celery, postgres, nginx | dgalloway,​alfredo | | +| chacra.ceph.com ​                  ​| No                  | Yes                  | Yes      | Yes                                                         ​| OVH       ​| No        | Low     ​| Release team, Core devs?  | chacra, celery, postgres, nginx  | dgalloway,​alfredo ​  ​                                                                                             
-|chacra dev instances | No | Yes | No | Yes | OVH | Yes | Low | Devs | chacra, celery, postgres, nginx | dgalloway,​alfredo | | +| chacra dev instances ​             | No                  | Yes                  | No       ​| Yes                                                         ​| OVH       ​| Yes       ​| Low     ​| Devs                      | chacra, celery, postgres, nginx  | dgalloway,​alfredo ​  ​                                                                                             
-|shaman | No | Yes | No | Yes | OVH | Yes | Medium | Devs | shaman, ? | dgalloway,​alfredo | | +| shaman ​                           | No                  | Yes                  | No       ​| Yes                                                         ​| OVH       ​| Yes       ​| Medium ​ | Devs                      | shaman, ?                        | dgalloway,​alfredo ​  ​                                                                                             
-|{apt-mirror,​gitbuilder}.ceph.com | No | Yes | No | No | No | No | High | Devs | Apache | dgalloway,​dmick | Still on single baremetal mira | +| {apt-mirror,​gitbuilder}.ceph.com ​ | No                  | Yes                  | No       ​| No                                                          | No        | No        | High    | Devs                      | Apache ​                          ​| dgalloway,​dmick ​    ​| Still on single baremetal mira                                                               ​
-|jenkins{2}.ceph.com | Yes | Yes | Some | Yes | OVH | No | Medium | Devs | Jenkins, mita, celery, nginx | dgalloway,​alfredo | | +| jenkins{2}.ceph.com ​              ​| Yes                 ​| Yes                  | Some     ​| Yes                                                         ​| OVH       ​| No        | Medium ​ | Devs                      | Jenkins, mita, celery, nginx     ​| dgalloway,​alfredo ​  ​                                                                                             
-|prado.ceph.com | Could | Could | No | Yes | OVH | No | Low | Devs | prado, nginx | dgalloway,​alfredo | | +| prado.ceph.com ​                   | Could               ​| Could                | No       ​| Yes                                                         ​| OVH       ​| No        | Low     ​| Devs                      | prado, nginx                     ​| dgalloway,​alfredo ​  ​                                                                                             
-|git.ceph.com | Yes? | Yes | No | Yes | RHEV | No | Medium | Devs, others? | git, git-daemon, apache | dgalloway,​dmick | | +| git.ceph.com ​                     | Yes?                | Yes                  | No       ​| Yes                                                         ​| RHEV      | No        | Medium ​ | Devs, others? ​            ​| git, git-daemon, apache ​         | dgalloway,​dmick ​    ​                                                                                             
-|teuthology VM | No | Yes | Some | Most | RHEV | No | Low | Devs | Teuthology | dgalloway,​zack | Relies on Paddles, git.ceph.com,​ apt-mirror, download.ceph.com,​ chacra, gitbuilder.ceph.com | +| teuthology VM                     ​| No                  | Yes                  | Some     ​| Most                                                        | RHEV      | No        | Low     ​| Devs                      | Teuthology ​                      ​| dgalloway,​zack ​     | Relies on Paddles, git.ceph.com,​ apt-mirror, download.ceph.com,​ chacra, gitbuilder.ceph.com ​ 
-|pulpito.front | No | Not really | No | Yes | No | No | Medium | QE? | pulpito | zack | Relies on paddles. ​ Still on baremetal mira | +| pulpito.front ​                    ​| No                  | Not really ​          ​| No       ​| Yes                                                         ​| No        | No        | Medium ​ | QE?                       ​| pulpito ​                         | zack                | Relies on paddles. ​ Still on baremetal mira                                                  
-|paddles.front | No | Yes | Yes | Yes | No | No | Medium | Devs | paddles | dgalloway,​zack | Still on baremetal mira | +| paddles.front ​                    ​| No                  | Yes                  | Yes      | Yes                                                         ​| No        | No        | Medium ​ | Devs                      | paddles ​                         | dgalloway,​zack ​     | Still on baremetal mira                                                                      
-|Cobbler | No | No | No | Yes | RHEV | No | Low | dgalloway | Cobbler, apache | dgalloway | Really only needed for creating FOG images | +| Cobbler ​                          ​| No                  | No                   ​| No       ​| Yes                                                         ​| RHEV      | No        | Low     ​| dgalloway ​                ​| Cobbler, apache ​                 | dgalloway ​          ​| Really only needed for creating FOG images ​                                                  ​
-|conserver.front | No | Yes? | Some | No | RHEV | No | Low | Devs | conserver | dgalloway | | +| conserver.front ​                  ​| No                  | Yes?                 ​| Some     ​| No                                                          | RHEV      | No        | Low     ​| Devs                      | conserver ​                       | dgalloway ​          ​                                                                                             
-|DHCP (store01) | No | Yes | Yes | No | No | No | Medium | Devs | dhcpd | dgalloway | | +| DHCP (store01) ​                   | No                  | Yes                  | Yes      Yes                                                         | No        | No        | Medium ​ | Devs                      | dhcpd                            | dgalloway ​          ​                                                                                             
-|DNS | Could | Yes | N/A | Yes | RHEV/OVH | Not currently ​| Low Devs | named | dgalloway | | +| DNS                               ​| Could               ​| Yes                  | N/A      | Yes                                                         ​| RHEV/​OVH ​ ns1/​ns2 ​  | Low     | Devs                      | named                            | dgalloway ​          ​                                                                                             
-|FOG | No | Yes | No | Little ​| RHEV | No | Medium | Devs | fog | dgalloway | | +| FOG                               ​| No                  | Yes                  | No       ​Yes                                                         | RHEV      | No        | Medium ​ | Devs                      | fog                              | dgalloway ​          ​                                                                                             
-|LRC | No | Could | No | No | Yes | Ish | Medium | Devs | ceph | dgalloway,​sage | | +| LRC                               ​| No                  | Could                | No       ​Some                                                        ​| Yes       ​| Ish       ​| Medium ​ | Devs                      | ceph                             ​| dgalloway,​sage ​                                                                                                  
-|gw.sepia.ceph.com | Could | Yes | Yes | Yes | RHEV | No | Medium | All | openvpn, nginx | dgalloway | | +| gw.sepia.ceph.com ​                ​| Could               ​| Yes                  | Yes      | Yes                                                         ​| RHEV      | No        | Medium ​ | All                       ​| openvpn, nginx                   ​| dgalloway ​          ​                                                                                             
-|RHEV | No | Could | Yes | No | Yes | Ish | Medium | All | RHEV, gluster | dgalloway | Packages are a mix between CentOS gluster and RHGS |  +| RHEV                              | No                  | Could                | Yes      | No                                                          | Yes       ​| Ish       ​| Medium ​ | All                       ​| RHEV, gluster ​                   | dgalloway ​          ​                                                                                             
-|Gluster | No | Could | No | No | Yes | Ish | Medium | All | Gluster | dgalloway | RHGS compatibility must remain aligned with RHEV version |+| Gluster ​                          ​| No                  | Could                | No       ​| No                                                          | Yes       ​| Ish       ​| Medium ​ | All                       ​| Gluster ​                         | dgalloway ​          ​| RHGS compatibility must remain aligned with RHEV version ​                                    ​|
  
 +===== Scheduled Maintenance Plans =====
 +==== CI Infrastructure Procedure ====
 +Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. ​ Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. ​ To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.
 +
 +  - Notify ceph-devel@
 +  - Log into each Jenkins instance, **Manage Jenkins** -> **Prepare for Shutdown**
 +  - Again in Jenkins, go to **Manage Jenkins** -> **Manage Plugins**
 +  - Select **All** at the bottom and click **Download now and install after restart**
 +  - Wait for all jobs to finish and make sure plugins are downloaded
 +  - Once all jobs are completed, ssh to each Jenkins instance
 +    - Updating the jenkins package or rebooting will restart the service so:
 +    - ''​systemctl stop jenkins''​
 +    - ''​systemctl disable jenkins''​
 +    - ''​apt update''​
 +    - ''​apt install linux-image-generic''​ (or equivalent to **just** update the kernel)
 +    - Reboot the host so you're running the latest kernel
 +  - **Update Slaves**
 +    - ssh to each static smithi slave (smithi{119..128}
 +    - ssh to each slave-{centos,​ubuntu}-* slave, update packages, and **shut down**
 +    - Put each irvingi node in Maintenance mode under the **Hosts** tab in the [[https://​mgr01.front.sepia.ceph.com/​ovirt-engine/​webadmin/?​locale=en_US#​hosts-events|RHEV Web UI]]
 +    - In the RHEV Web UI, highlight each irvingi host and click **Update**
 +    - Bring slave-{centos,​ubuntu}-* VMs back online after irvingis update
 +    - Make sure all static slaves reconnect to Jenkins
 +  - **Update chacra, mita, shaman, prado**
 +    - If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''​apt update && apt upgrade && reboot''​
 +    - If a redeploy is needed, see each service'​s individual wiki page
 +  - Once all the other CI hosts are up to date, update each Jenkins instance: ''​apt upgrade''​
 +  - This should restart Jenkins but if it doesn'​t,​ ''​systemctl start jenkins''​
 +  - ''​systemctl enable jenkins''​
 +  - Spot check a few jobs to make sure all plugins are working properly
 +    - You can check this by commenting ''​jenkins test make check''​ in a PR
 +    - Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?)
 +    - Make sure postbuild scripts are running when they'​re supposed to (Is the FAILURE script running when the build PASSED? ​ That's a problem)
 +
 +----
 +
 +==== Public Facing Sites Procedure ====
 +=== tracker.ceph.com and www.ceph.com ===
 +For the most part, these hosts' packages can be updated and hosts rebooted whenever. ​ If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.  ​
 +
 +== Post-update Tasks ==
 +  * Log in to tracker.ceph.com and modify a bug
 +  * Spot check a few pages on www.ceph.com
 +  * Log in to www.ceph.com if you have a login to wordpress
 +
 +----
 +
 +=== docs.ceph.com ===
 +As long as there isn't a [[https://​jenkins.ceph.com/​computer/​docs.ceph.com/​|job]] running, this host can be updated and rebooted whenever.
 +
 +== Post-update Tasks ==
 +  * Does http://​docs.ceph.com load?
 +  * Is the host reattached as a slave in [[https://​jenkins.ceph.com/​computer/​docs.ceph.com/​|Jenkins]]?​
 +
 +----
 +
 +=== download.ceph.com ===
 +Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists.
 +
 +== Post-update Tasks ==
 +  * Does https://​download.ceph.com load?
 +
 +----
 +
 +==== Sepia Lab Procedure ====
 +  - Send a planned outage notice to sepia at lists dot ceph.com
 +  - [[services:​teuthology#​gracefully_stop_workers|Instruct teuthology workers to die]]
 +    - Wait for there to be no jobs running. ​ This can take up to 12 hours.
 +    - If you don't want to wait:
 +      - Ask Yuri if ''​scheduled_teuthology''​ runs can be killed
 +      - Ask individual devs if their runs can be killed
 +  - Kill idle workers<​code>​
 +ssh teuthology.front.sepia.ceph.com
 +sudo su - teuthworker
 +bin/​worker_kill_idle
 +</​code>​
 +  - Once there are no jobs running and no workers (''​teuthworker@teuthology:​~$ bin/​worker_count''​)
 +    - Update packages on teuthology.front.sepia.ceph.com
 +    - ''​%%shutdown -r +5 "​Rebooting during planned maintenance. ​ SAVE YOUR WORK!"​%%''​
 +    - Update and reboot:
 +      - labdashboard.front.sepia.ceph.com
 +      - circle.front.sepia.ceph.com
 +      - cobbler.front.sepia.ceph.com
 +      - conserver.front.sepia.ceph.com
 +      - drop.ceph.com
 +      - git.ceph.com
 +      - fog.front.sepia.ceph.com
 +      - ns1.front.sepia.ceph.com
 +      - ns2.front.sepia.ceph.com
 +      - nsupdate.front.sepia.ceph.com
 +      - vpn-pub.ovh.sepia.ceph.com
 +      - satellite.front.sepia.ceph.com
 +      - sentry.front.sepia.ceph.com
 +      - pulpito.front.sepia.ceph.com
 +  - Finally, update and reboot gw.sepia.ceph.com ((See [[services:​rhev#​emergency_rhev_web_ui_access_w_o_vpn]] for emergency backup plan))
 +
 +=== Sepia Lab Post Maintenance Tasks ===
 +  - ''​ssh $(whoami)@8.43.84.131''​ (circle.front/​GFW proxy)
 +  - Log in to [[https://​cobbler.front.sepia.ceph.com/​cobbler_web/​system/​list|Cobbler]]
 +  - ''​console smithi001.front.sepia.ceph.com''​ (Do you get a SOL session?)
 +  - ''​ssh drop.ceph.com''​
 +  - Does git.ceph.com load?  Is it up to date?
 +  - Log in to [[http://​fog.front.sepia.ceph.com/​fog/​management/​index.php|FOG]] ((Updating the database can break FOG.  See https://​forums.fogproject.org/​topic/​10006/​ubuntu-is-fog-s-enemy))
 +  - ''​dig smithi001.front.sepia.ceph.com @vpn-pub.ovh.sepia.ceph.com''​
 +  - ''​dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com''​
 +  - ''​dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com''​
 +  - ''​%%teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04%%''​ (Do you get a functioning OVH node?  ''​dig''​ its hostname against vpn-pub and make sure nsupdate is working)
 +  - ''​%%teuthology-lock --lock-many 1 -m smithi --os-type rhel%%''​
 +    - Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update
 +  - Does http://​sentry.ceph.com/​sepia load?
 +  - Does http://​pulpito.ceph.com load?
 +  - Verify all the reverse proxies in ''/​etc/​nginx/​sites-enabled''​ on gw.sepia.ceph.com are accessible
 +
 +Finally, once all post-maintenance tasks are complete,<​code>​
 +ssh teuthology.front.sepia.ceph.com
 +sudo su - teuthworker
 +bin/​worker_start smithi 25
 +^D^D
 +</​code>​
 +
 +Check how many running workers there should be in ''/​home/​teuthworker/​bin/​worker_start''​ and start **1/4** of them at a time.  If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks.
 +
 +===== Boilerplate Outage Notices =====
 +==== CI ====
 +<​code>​
 +Hi All,
 +
 +A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
 +
 +We will be updating and rebooting the following hosts:
 +jenkins.ceph.com
 +2.jenkins.ceph.com
 +chacra.ceph.com
 +{1..5}.chacra.ceph.com
 +shaman.ceph.com
 +1.shaman.ceph.com
 +2.shaman.ceph.com
 +
 +This means:
 +  - Jenkins will be paused and stop processing new jobs so PR checks will be delayed
 +  - Once there are no jobs running, all hosts will be updated and rebooted
 +  - Repos on chacra nodes will be temporarily unavailable
 +
 +Let me know if you have any questions/​concerns.
 +
 +Thanks,
 +</​code>​
 +
 +==== Sepia Lab ====
 +<​code>​
 +Hi All,
 +
 +A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
 +
 +We will be updating and rebooting the following hosts:
 +teuthology.front.sepia.ceph.com
 +labdashboard.front.sepia.ceph.com
 +circle.front.sepia.ceph.com
 +cobbler.front.sepia.ceph.com
 +conserver.front.sepia.ceph.com
 +fog.front.sepia.ceph.com
 +ns1.front.sepia.ceph.com
 +ns2.front.sepia.ceph.com
 +nsupdate.front.sepia.ceph.com
 +vpn-pub.ovh.sepia.ceph.com
 +satellite.front.sepia.ceph.com
 +sentry.front.sepia.ceph.com
 +pulpito.front.sepia.ceph.com
 +drop.ceph.com
 +git.ceph.com
 +gw.sepia.ceph.com
 +
 +This means:
 +  - teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete
 +  - DNS may be temporarily unavailable
 +  - All aforementioned hosts will be temporarily unavailable for a brief time
 +  - Your VPN connection will need to be restarted
 +
 +I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete.
 +
 +Let me know if you have any questions/​concerns.
 +
 +Thanks,
 +</​code>​
tasks/scheduled-maintenance.1513890398.txt.gz · Last modified: 2017/12/21 21:06 by djgalloway