====== Planned/Scheduled Maintenance Procedures ======
Historically, infrastructure maintenance has been on an as-needed "if it ain't broke, don't fix it" basis. Now that most Infrastructure services are highly available and the lab network/services/suite have stabilized, we're able to afford occasional downtime to perform system updates, hardware swaps, OS upgrades, etc.
This document (like all good documentation) will be ever-evolving.
===== Frequency/Timing =====
The PnT Labs team has scheduled monthly maintenance on the second Friday of every month. Since this maintenance only has the potential to disrupt operations in the downstream/Octo lab, it could be a good candidate. Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.
Maintenance should not disrupt testing cycles when an upstream release is imminent.
Maintenance should be announced on the [[http://status.sepia.ceph.com|Status Portal]] and sepia-users mailing list at least a week prior and all stakeholders, yet TBD, should give a green light before proceeding.
===== Maintenance Matrix =====
^ Service ^ Disrupts Community ^ Disrupts Developers ^ Backups ^ Ansible ((Can the host be restored/rebuilt with Ansible?)) ^ HA ^ Failover ^ Risk ^ Stakeholders ^ Applications ^ SME ^ Other Notes ^
| www.ceph.com | Yes | No | Yes | Some | OVH | No | Medium | N/A | Wordpress, nginx | lvaz, Age of Peers | Leo regularly updates Wordpress and its plugins |
| download.ceph.com | Yes | Yes | Yes | Most | OVH | No | Medium | All | Nginx | dgalloway | |
| tracker.ceph.com | Yes | Yes | Yes | No | OVH | No | High | All | Redmine | dgalloway,dmick | Redmine and its plugins are tricky. None of us are ruby experts. |
| docs.ceph.com | Yes | No | Yes | Some | OVH | No | Low | All | Nginx | dgalloway | |
| chacra.ceph.com | No | Yes | Yes | Yes | OVH | No | Low | Release team, Core devs? | chacra, celery, postgres, nginx | dgalloway,alfredo | |
| chacra dev instances | No | Yes | No | Yes | OVH | Yes | Low | Devs | chacra, celery, postgres, nginx | dgalloway,alfredo | |
| shaman | No | Yes | No | Yes | OVH | Yes | Medium | Devs | shaman, ? | dgalloway,alfredo | |
| {apt-mirror,gitbuilder}.ceph.com | No | Yes | No | No | No | No | High | Devs | Apache | dgalloway,dmick | Still on single baremetal mira |
| jenkins{2}.ceph.com | Yes | Yes | Some | Yes | OVH | No | Medium | Devs | Jenkins, mita, celery, nginx | dgalloway,alfredo | |
| prado.ceph.com | Could | Could | No | Yes | OVH | No | Low | Devs | prado, nginx | dgalloway,alfredo | |
| git.ceph.com | Yes? | Yes | No | Yes | RHEV | No | Medium | Devs, others? | git, git-daemon, apache | dgalloway,dmick | |
| teuthology VM | No | Yes | Some | Most | RHEV | No | Low | Devs | Teuthology | dgalloway,zack | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com |
| pulpito.front | No | Not really | No | Yes | No | No | Medium | QE? | pulpito | zack | Relies on paddles. Still on baremetal mira |
| paddles.front | No | Yes | Yes | Yes | No | No | Medium | Devs | paddles | dgalloway,zack | Still on baremetal mira |
| Cobbler | No | No | No | Yes | RHEV | No | Low | dgalloway | Cobbler, apache | dgalloway | Really only needed for creating FOG images |
| conserver.front | No | Yes? | Some | No | RHEV | No | Low | Devs | conserver | dgalloway | |
| DHCP (store01) | No | Yes | Yes | Yes | No | No | Medium | Devs | dhcpd | dgalloway | |
| DNS | Could | Yes | N/A | Yes | RHEV/OVH | ns1/ns2 | Low | Devs | named | dgalloway | |
| FOG | No | Yes | No | Yes | RHEV | No | Medium | Devs | fog | dgalloway | |
| LRC | No | Could | No | Some | Yes | Ish | Medium | Devs | ceph | dgalloway,sage | |
| gw.sepia.ceph.com | Could | Yes | Yes | Yes | RHEV | No | Medium | All | openvpn, nginx | dgalloway | |
| RHEV | No | Could | Yes | No | Yes | Ish | Medium | All | RHEV, gluster | dgalloway | |
| Gluster | No | Could | No | No | Yes | Ish | Medium | All | Gluster | dgalloway | RHGS compatibility must remain aligned with RHEV version |
===== Scheduled Maintenance Plans =====
==== CI Infrastructure Procedure ====
Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.
- Notify ceph-devel@
- Log into each Jenkins instance, **Manage Jenkins** -> **Prepare for Shutdown**
- Again in Jenkins, go to **Manage Jenkins** -> **Manage Plugins**
- Select **All** at the bottom and click **Download now and install after restart**
- Wait for all jobs to finish and make sure plugins are downloaded
- Once all jobs are completed, ssh to each Jenkins instance
- Updating the jenkins package or rebooting will restart the service so:
- ''systemctl stop jenkins''
- ''systemctl disable jenkins''
- ''apt update''
- ''apt install linux-image-generic'' (or equivalent to **just** update the kernel)
- Reboot the host so you're running the latest kernel
- **Update Slaves**
- ssh to each static smithi slave (smithi{119..128}
- ssh to each slave-{centos,ubuntu}-* slave, update packages, and **shut down**
- Put each irvingi node in Maintenance mode under the **Hosts** tab in the [[https://mgr01.front.sepia.ceph.com/ovirt-engine/webadmin/?locale=en_US#hosts-events|RHEV Web UI]]
- In the RHEV Web UI, highlight each irvingi host and click **Update**
- Bring slave-{centos,ubuntu}-* VMs back online after irvingis update
- Make sure all static slaves reconnect to Jenkins
- **Update chacra, mita, shaman, prado**
- If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and ''apt update && apt upgrade && reboot''
- If a redeploy is needed, see each service's individual wiki page
- Once all the other CI hosts are up to date, update each Jenkins instance: ''apt upgrade''
- This should restart Jenkins but if it doesn't, ''systemctl start jenkins''
- ''systemctl enable jenkins''
- Spot check a few jobs to make sure all plugins are working properly
- You can check this by commenting ''jenkins test make check'' in a PR
- Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?)
- Make sure postbuild scripts are running when they're supposed to (Is the FAILURE script running when the build PASSED? That's a problem)
----
==== Public Facing Sites Procedure ====
=== tracker.ceph.com and www.ceph.com ===
For the most part, these hosts' packages can be updated and hosts rebooted whenever. If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.
== Post-update Tasks ==
* Log in to tracker.ceph.com and modify a bug
* Spot check a few pages on www.ceph.com
* Log in to www.ceph.com if you have a login to wordpress
----
=== docs.ceph.com ===
As long as there isn't a [[https://jenkins.ceph.com/computer/docs.ceph.com/|job]] running, this host can be updated and rebooted whenever.
== Post-update Tasks ==
* Does http://docs.ceph.com load?
* Is the host reattached as a slave in [[https://jenkins.ceph.com/computer/docs.ceph.com/|Jenkins]]?
----
=== download.ceph.com ===
Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists.
== Post-update Tasks ==
* Does https://download.ceph.com load?
----
==== Sepia Lab Procedure ====
- Send a planned outage notice to sepia at lists dot ceph.com
- [[services:teuthology#gracefully_stop_workers|Instruct teuthology workers to die]]
- Wait for there to be no jobs running. This can take up to 12 hours.
- If you don't want to wait:
- Ask Yuri if ''scheduled_teuthology'' runs can be killed
- Ask individual devs if their runs can be killed
- Kill idle workers
ssh teuthology.front.sepia.ceph.com
sudo su - teuthworker
bin/worker_kill_idle
- Once there are no jobs running and no workers (''teuthworker@teuthology:~$ bin/worker_count'')
- Update packages on teuthology.front.sepia.ceph.com
- ''%%shutdown -r +5 "Rebooting during planned maintenance. SAVE YOUR WORK!"%%''
- Update and reboot:
- labdashboard.front.sepia.ceph.com
- circle.front.sepia.ceph.com
- cobbler.front.sepia.ceph.com
- conserver.front.sepia.ceph.com
- drop.ceph.com
- git.ceph.com
- fog.front.sepia.ceph.com
- ns1.front.sepia.ceph.com
- ns2.front.sepia.ceph.com
- nsupdate.front.sepia.ceph.com
- vpn-pub.ovh.sepia.ceph.com
- satellite.front.sepia.ceph.com
- sentry.front.sepia.ceph.com
- pulpito.front.sepia.ceph.com
- Finally, update and reboot gw.sepia.ceph.com ((See [[services:rhev#emergency_rhev_web_ui_access_w_o_vpn]] for emergency backup plan))
=== Sepia Lab Post Maintenance Tasks ===
- ''ssh $(whoami)@8.43.84.131'' (circle.front/GFW proxy)
- Log in to [[https://cobbler.front.sepia.ceph.com/cobbler_web/system/list|Cobbler]]
- ''console smithi001.front.sepia.ceph.com'' (Do you get a SOL session?)
- ''ssh drop.ceph.com''
- Does git.ceph.com load? Is it up to date?
- Log in to [[http://fog.front.sepia.ceph.com/fog/management/index.php|FOG]] ((Updating the database can break FOG. See https://forums.fogproject.org/topic/10006/ubuntu-is-fog-s-enemy))
- ''dig smithi001.front.sepia.ceph.com @vpn-pub.ovh.sepia.ceph.com''
- ''dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com''
- ''dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com''
- ''%%teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04%%'' (Do you get a functioning OVH node? ''dig'' its hostname against vpn-pub and make sure nsupdate is working)
- ''%%teuthology-lock --lock-many 1 -m smithi --os-type rhel%%''
- Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update
- Does http://sentry.ceph.com/sepia load?
- Does http://pulpito.ceph.com load?
- Verify all the reverse proxies in ''/etc/nginx/sites-enabled'' on gw.sepia.ceph.com are accessible
Finally, once all post-maintenance tasks are complete,
ssh teuthology.front.sepia.ceph.com
sudo su - teuthworker
bin/worker_start smithi 25
^D^D
Check how many running workers there should be in ''/home/teuthworker/bin/worker_start'' and start **1/4** of them at a time. If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks.
===== Boilerplate Outage Notices =====
==== CI ====
Hi All,
A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
We will be updating and rebooting the following hosts:
jenkins.ceph.com
2.jenkins.ceph.com
chacra.ceph.com
{1..5}.chacra.ceph.com
shaman.ceph.com
1.shaman.ceph.com
2.shaman.ceph.com
This means:
- Jenkins will be paused and stop processing new jobs so PR checks will be delayed
- Once there are no jobs running, all hosts will be updated and rebooted
- Repos on chacra nodes will be temporarily unavailable
Let me know if you have any questions/concerns.
Thanks,
==== Sepia Lab ====
Hi All,
A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.
We will be updating and rebooting the following hosts:
teuthology.front.sepia.ceph.com
labdashboard.front.sepia.ceph.com
circle.front.sepia.ceph.com
cobbler.front.sepia.ceph.com
conserver.front.sepia.ceph.com
fog.front.sepia.ceph.com
ns1.front.sepia.ceph.com
ns2.front.sepia.ceph.com
nsupdate.front.sepia.ceph.com
vpn-pub.ovh.sepia.ceph.com
satellite.front.sepia.ceph.com
sentry.front.sepia.ceph.com
pulpito.front.sepia.ceph.com
drop.ceph.com
git.ceph.com
gw.sepia.ceph.com
This means:
- teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete
- DNS may be temporarily unavailable
- All aforementioned hosts will be temporarily unavailable for a brief time
- Your VPN connection will need to be restarted
I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete.
Let me know if you have any questions/concerns.
Thanks,