This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
tasks:scheduled-maintenance [2018/04/03 20:58] djgalloway |
tasks:scheduled-maintenance [2018/04/03 21:30] (current) djgalloway |
||
|---|---|---|---|
| Line 13: | Line 13: | ||
| ===== Maintenance Matrix ===== | ===== Maintenance Matrix ===== | ||
| - | ^ Service ^ Disrupts Community ^ Disrupts Developers ^ Backups ^ Ansible ^ HA ^ Failover ^ Risk ^ Stakeholders ^ Applications ^ SME ^ Other Notes ^ | + | ^ Service ^ Disrupts Community ^ Disrupts Developers ^ Backups ^ Ansible ((Can the host be restored/rebuilt with Ansible?)) ^ HA ^ Failover ^ Risk ^ Stakeholders ^ Applications ^ SME ^ Other Notes ^ |
| - | | www.ceph.com | Yes | No | Yes | Some | OVH | No | Medium | Wordpress | lvaz, Age of Peers | | Leo regularly updates Wordpress and its plugins | | + | | www.ceph.com | Yes | No | Yes | Some | OVH | No | Medium | N/A | Wordpress, nginx | lvaz, Age of Peers | Leo regularly updates Wordpress and its plugins | |
| - | | download.ceph.com | Yes | Yes | Yes | Most | OVH | No | Medium | All | Nginx | dgalloway | | | + | | download.ceph.com | Yes | Yes | Yes | Most | OVH | No | Medium | All | Nginx | dgalloway | | |
| - | | tracker.ceph.com | Yes | Yes | Yes | No | OVH | No | High | All | Redmine | dgalloway,dmick | Redmine and its plugins are tricky. None of us are ruby experts. | | + | | tracker.ceph.com | Yes | Yes | Yes | No | OVH | No | High | All | Redmine | dgalloway,dmick | Redmine and its plugins are tricky. None of us are ruby experts. | |
| - | | docs.ceph.com | Yes | No | Yes | Some | OVH | No | Low | All | Nginx | dgalloway | | | + | | docs.ceph.com | Yes | No | Yes | Some | OVH | No | Low | All | Nginx | dgalloway | | |
| - | | chacra.ceph.com | No | Yes | Yes | Yes | OVH | No | Low | Release team, Core devs? | chacra, celery, postgres, nginx | dgalloway,alfredo | | | + | | chacra.ceph.com | No | Yes | Yes | Yes | OVH | No | Low | Release team, Core devs? | chacra, celery, postgres, nginx | dgalloway,alfredo | | |
| - | | chacra dev instances | No | Yes | No | Yes | OVH | Yes | Low | Devs | chacra, celery, postgres, nginx | dgalloway,alfredo | | | + | | chacra dev instances | No | Yes | No | Yes | OVH | Yes | Low | Devs | chacra, celery, postgres, nginx | dgalloway,alfredo | | |
| - | | shaman | No | Yes | No | Yes | OVH | Yes | Medium | Devs | shaman, ? | dgalloway,alfredo | | | + | | shaman | No | Yes | No | Yes | OVH | Yes | Medium | Devs | shaman, ? | dgalloway,alfredo | | |
| - | | {apt-mirror,gitbuilder}.ceph.com | No | Yes | No | No | No | No | High | Devs | Apache | dgalloway,dmick | Still on single baremetal mira | | + | | {apt-mirror,gitbuilder}.ceph.com | No | Yes | No | No | No | No | High | Devs | Apache | dgalloway,dmick | Still on single baremetal mira | |
| - | | jenkins{2}.ceph.com | Yes | Yes | Some | Yes | OVH | No | Medium | Devs | Jenkins, mita, celery, nginx | dgalloway,alfredo | | | + | | jenkins{2}.ceph.com | Yes | Yes | Some | Yes | OVH | No | Medium | Devs | Jenkins, mita, celery, nginx | dgalloway,alfredo | | |
| - | | prado.ceph.com | Could | Could | No | Yes | OVH | No | Low | Devs | prado, nginx | dgalloway,alfredo | | | + | | prado.ceph.com | Could | Could | No | Yes | OVH | No | Low | Devs | prado, nginx | dgalloway,alfredo | | |
| - | | git.ceph.com | Yes? | Yes | No | Yes | RHEV | No | Medium | Devs, others? | git, git-daemon, apache | dgalloway,dmick | | | + | | git.ceph.com | Yes? | Yes | No | Yes | RHEV | No | Medium | Devs, others? | git, git-daemon, apache | dgalloway,dmick | | |
| - | | teuthology VM | No | Yes | Some | Most | RHEV | No | Low | Devs | Teuthology | dgalloway,zack | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com | | + | | teuthology VM | No | Yes | Some | Most | RHEV | No | Low | Devs | Teuthology | dgalloway,zack | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com | |
| - | | pulpito.front | No | Not really | No | Yes | No | No | Medium | QE? | pulpito | zack | Relies on paddles. Still on baremetal mira | | + | | pulpito.front | No | Not really | No | Yes | No | No | Medium | QE? | pulpito | zack | Relies on paddles. Still on baremetal mira | |
| - | | paddles.front | No | Yes | Yes | Yes | No | No | Medium | Devs | paddles | dgalloway,zack | Still on baremetal mira | | + | | paddles.front | No | Yes | Yes | Yes | No | No | Medium | Devs | paddles | dgalloway,zack | Still on baremetal mira | |
| - | | Cobbler | No | No | No | Yes | RHEV | No | Low | dgalloway | Cobbler, apache | dgalloway | Really only needed for creating FOG images | | + | | Cobbler | No | No | No | Yes | RHEV | No | Low | dgalloway | Cobbler, apache | dgalloway | Really only needed for creating FOG images | |
| - | | conserver.front | No | Yes? | Some | No | RHEV | No | Low | Devs | conserver | dgalloway | | | + | | conserver.front | No | Yes? | Some | No | RHEV | No | Low | Devs | conserver | dgalloway | | |
| - | | DHCP (store01) | No | Yes | Yes | No | No | No | Medium | Devs | dhcpd | dgalloway | | | + | | DHCP (store01) | No | Yes | Yes | Yes | No | No | Medium | Devs | dhcpd | dgalloway | | |
| - | | DNS | Could | Yes | N/A | Yes | RHEV/OVH | Not currently | Low Devs | named | dgalloway | | | | + | | DNS | Could | Yes | N/A | Yes | RHEV/OVH | ns1/ns2 | Low | Devs | named | dgalloway | | |
| - | | FOG | No | Yes | No | Little | RHEV | No | Medium | Devs | fog | dgalloway | | | + | | FOG | No | Yes | No | Yes | RHEV | No | Medium | Devs | fog | dgalloway | | |
| - | | LRC | No | Could | No | No | Yes | Ish | Medium | Devs | ceph | dgalloway,sage | | | + | | LRC | No | Could | No | Some | Yes | Ish | Medium | Devs | ceph | dgalloway,sage | | |
| - | | gw.sepia.ceph.com | Could | Yes | Yes | Yes | RHEV | No | Medium | All | openvpn, nginx | dgalloway | | | + | | gw.sepia.ceph.com | Could | Yes | Yes | Yes | RHEV | No | Medium | All | openvpn, nginx | dgalloway | | |
| - | | RHEV | No | Could | Yes | No | Yes | Ish | Medium | All | RHEV, gluster | dgalloway | Packages are a mix between CentOS gluster and RHGS | | + | | RHEV | No | Could | Yes | No | Yes | Ish | Medium | All | RHEV, gluster | dgalloway | | |
| - | | Gluster | No | Could | No | No | Yes | Ish | Medium | All | Gluster | dgalloway | RHGS compatibility must remain aligned with RHEV version | | + | | Gluster | No | Could | No | No | Yes | Ish | Medium | All | Gluster | dgalloway | RHGS compatibility must remain aligned with RHEV version | |
| ===== Scheduled Maintenance Plans ===== | ===== Scheduled Maintenance Plans ===== | ||
| Line 42: | Line 42: | ||
| Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node. | Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node. | ||
| + | - Notify ceph-devel@ | ||
| - Log into each Jenkins instance, **Manage Jenkins** -> **Prepare for Shutdown** | - Log into each Jenkins instance, **Manage Jenkins** -> **Prepare for Shutdown** | ||
| - Again in Jenkins, go to **Manage Jenkins** -> **Manage Plugins** | - Again in Jenkins, go to **Manage Jenkins** -> **Manage Plugins** | ||
| Line 53: | Line 54: | ||
| - ''apt install linux-image-generic'' (or equivalent to **just** update the kernel) | - ''apt install linux-image-generic'' (or equivalent to **just** update the kernel) | ||
| - Reboot the host so you're running the latest kernel | - Reboot the host so you're running the latest kernel | ||
| - | - If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''apt update && apt upgrade && reboot'' | + | - **Update Slaves** |
| - | - ssh to each static slave (smithi{119..128} and slave-{centos,ubuntu}-* as of this writing), update packages, and reboot | + | - ssh to each static smithi slave (smithi{119..128} |
| - | - If a redeploy is needed, see each service's individual wiki page | + | - ssh to each slave-{centos,ubuntu}-* slave, update packages, and **shut down** |
| - | - Once all the other CI hosts are up to date, update each Jenkins instance: ''apt upgrade'' | + | - Put each irvingi node in Maintenance mode under the **Hosts** tab in the [[https://mgr01.front.sepia.ceph.com/ovirt-engine/webadmin/?locale=en_US#hosts-events|RHEV Web UI]] |
| - | - This should restart Jenkins but if it doesn't, ''systemctl start jenkins'' | + | - In the RHEV Web UI, highlight each irvingi host and click **Update** |
| - | - ''systemctl enable jenkins'' | + | - Bring slave-{centos,ubuntu}-* VMs back online after irvingis update |
| + | - Make sure all static slaves reconnect to Jenkins | ||
| + | - **Update chacra, mita, shaman, prado** | ||
| + | - If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''apt update && apt upgrade && reboot'' | ||
| + | - If a redeploy is needed, see each service's individual wiki page | ||
| + | - Once all the other CI hosts are up to date, update each Jenkins instance: ''apt upgrade'' | ||
| + | - This should restart Jenkins but if it doesn't, ''systemctl start jenkins'' | ||
| + | - ''systemctl enable jenkins'' | ||
| - Spot check a few jobs to make sure all plugins are working properly | - Spot check a few jobs to make sure all plugins are working properly | ||
| - You can check this by commenting ''jenkins test make check'' in a PR | - You can check this by commenting ''jenkins test make check'' in a PR | ||
| - Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?) | - Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?) | ||
| - Make sure postbuild scripts are running when they're supposed to (Is the FAILURE script running when the build PASSED? That's a problem) | - Make sure postbuild scripts are running when they're supposed to (Is the FAILURE script running when the build PASSED? That's a problem) | ||
| - | - Make sure all static slaves reconnect to Jenkins | ||
| ---- | ---- | ||
| Line 88: | Line 95: | ||
| === download.ceph.com === | === download.ceph.com === | ||
| - | Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage. | + | Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists. |
| == Post-update Tasks == | == Post-update Tasks == | ||
| Line 137: | Line 144: | ||
| - ''dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com'' | - ''dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com'' | ||
| - ''dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com'' | - ''dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com'' | ||
| - | - ''teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04'' (Do you get a functioning OVH node? ''dig'' its hostname against vpn-pub and make sure nsupdate is working) | + | - ''%%teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04%%'' (Do you get a functioning OVH node? ''dig'' its hostname against vpn-pub and make sure nsupdate is working) |
| - | - ''teuthology-lock --lock-many 1 -m smithi --os-type rhel'' | + | - ''%%teuthology-lock --lock-many 1 -m smithi --os-type rhel%%'' |
| - Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update | - Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update | ||
| - Does http://sentry.ceph.com/sepia load? | - Does http://sentry.ceph.com/sepia load? | ||
| - Does http://pulpito.ceph.com load? | - Does http://pulpito.ceph.com load? | ||
| + | - Verify all the reverse proxies in ''/etc/nginx/sites-enabled'' on gw.sepia.ceph.com are accessible | ||
| Finally, once all post-maintenance tasks are complete,<code> | Finally, once all post-maintenance tasks are complete,<code> | ||
| Line 151: | Line 159: | ||
| Check how many running workers there should be in ''/home/teuthworker/bin/worker_start'' and start **1/4** of them at a time. If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks. | Check how many running workers there should be in ''/home/teuthworker/bin/worker_start'' and start **1/4** of them at a time. If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks. | ||
| + | |||
| + | ===== Boilerplate Outage Notices ===== | ||
| + | ==== CI ==== | ||
| + | <code> | ||
| + | Hi All, | ||
| + | |||
| + | A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC. | ||
| + | |||
| + | We will be updating and rebooting the following hosts: | ||
| + | jenkins.ceph.com | ||
| + | 2.jenkins.ceph.com | ||
| + | chacra.ceph.com | ||
| + | {1..5}.chacra.ceph.com | ||
| + | shaman.ceph.com | ||
| + | 1.shaman.ceph.com | ||
| + | 2.shaman.ceph.com | ||
| + | |||
| + | This means: | ||
| + | - Jenkins will be paused and stop processing new jobs so PR checks will be delayed | ||
| + | - Once there are no jobs running, all hosts will be updated and rebooted | ||
| + | - Repos on chacra nodes will be temporarily unavailable | ||
| + | |||
| + | Let me know if you have any questions/concerns. | ||
| + | |||
| + | Thanks, | ||
| + | </code> | ||
| + | |||
| + | ==== Sepia Lab ==== | ||
| + | <code> | ||
| + | Hi All, | ||
| + | |||
| + | A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC. | ||
| + | |||
| + | We will be updating and rebooting the following hosts: | ||
| + | teuthology.front.sepia.ceph.com | ||
| + | labdashboard.front.sepia.ceph.com | ||
| + | circle.front.sepia.ceph.com | ||
| + | cobbler.front.sepia.ceph.com | ||
| + | conserver.front.sepia.ceph.com | ||
| + | fog.front.sepia.ceph.com | ||
| + | ns1.front.sepia.ceph.com | ||
| + | ns2.front.sepia.ceph.com | ||
| + | nsupdate.front.sepia.ceph.com | ||
| + | vpn-pub.ovh.sepia.ceph.com | ||
| + | satellite.front.sepia.ceph.com | ||
| + | sentry.front.sepia.ceph.com | ||
| + | pulpito.front.sepia.ceph.com | ||
| + | drop.ceph.com | ||
| + | git.ceph.com | ||
| + | gw.sepia.ceph.com | ||
| + | |||
| + | This means: | ||
| + | - teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete | ||
| + | - DNS may be temporarily unavailable | ||
| + | - All aforementioned hosts will be temporarily unavailable for a brief time | ||
| + | - Your VPN connection will need to be restarted | ||
| + | |||
| + | I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete. | ||
| + | |||
| + | Let me know if you have any questions/concerns. | ||
| + | |||
| + | Thanks, | ||
| + | </code> | ||