Planned/Scheduled Maintenance Procedures

Planned/Scheduled Maintenance Procedures

Historically, infrastructure maintenance has been on an as-needed “if it ain't broke, don't fix it” basis. Now that most Infrastructure services are highly available and the lab network/services/suite have stabilized, we're able to afford occasional downtime to perform system updates, hardware swaps, OS upgrades, etc.

This document (like all good documentation) will be ever-evolving.

Frequency/Timing

The PnT Labs team has scheduled monthly maintenance on the second Friday of every month. Since this maintenance only has the potential to disrupt operations in the downstream/Octo lab, it could be a good candidate. Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.

Maintenance should not disrupt testing cycles when an upstream release is imminent.

Maintenance should be announced on the Status Portal and sepia-users mailing list at least a week prior and all stakeholders, yet TBD, should give a green light before proceeding.

Maintenance Matrix

Service	Disrupts Community	Disrupts Developers	Backups	Ansible ¹⁾	HA	Failover	Risk	Stakeholders	Applications	SME	Other Notes
www.ceph.com	Yes	No	Yes	Some	OVH	No	Medium	N/A	Wordpress, nginx	lvaz, Age of Peers	Leo regularly updates Wordpress and its plugins
download.ceph.com	Yes	Yes	Yes	Most	OVH	No	Medium	All	Nginx	dgalloway
tracker.ceph.com	Yes	Yes	Yes	No	OVH	No	High	All	Redmine	dgalloway,dmick	Redmine and its plugins are tricky. None of us are ruby experts.
docs.ceph.com	Yes	No	Yes	Some	OVH	No	Low	All	Nginx	dgalloway
chacra.ceph.com	No	Yes	Yes	Yes	OVH	No	Low	Release team, Core devs?	chacra, celery, postgres, nginx	dgalloway,alfredo
chacra dev instances	No	Yes	No	Yes	OVH	Yes	Low	Devs	chacra, celery, postgres, nginx	dgalloway,alfredo
shaman	No	Yes	No	Yes	OVH	Yes	Medium	Devs	shaman, ?	dgalloway,alfredo
{apt-mirror,gitbuilder}.ceph.com	No	Yes	No	No	No	No	High	Devs	Apache	dgalloway,dmick	Still on single baremetal mira
jenkins{2}.ceph.com	Yes	Yes	Some	Yes	OVH	No	Medium	Devs	Jenkins, mita, celery, nginx	dgalloway,alfredo
prado.ceph.com	Could	Could	No	Yes	OVH	No	Low	Devs	prado, nginx	dgalloway,alfredo
git.ceph.com	Yes?	Yes	No	Yes	RHEV	No	Medium	Devs, others?	git, git-daemon, apache	dgalloway,dmick
teuthology VM	No	Yes	Some	Most	RHEV	No	Low	Devs	Teuthology	dgalloway,zack	Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com
pulpito.front	No	Not really	No	Yes	No	No	Medium	QE?	pulpito	zack	Relies on paddles. Still on baremetal mira
paddles.front	No	Yes	Yes	Yes	No	No	Medium	Devs	paddles	dgalloway,zack	Still on baremetal mira
Cobbler	No	No	No	Yes	RHEV	No	Low	dgalloway	Cobbler, apache	dgalloway	Really only needed for creating FOG images
conserver.front	No	Yes?	Some	No	RHEV	No	Low	Devs	conserver	dgalloway
DHCP (store01)	No	Yes	Yes	Yes	No	No	Medium	Devs	dhcpd	dgalloway
DNS	Could	Yes	N/A	Yes	RHEV/OVH	ns1/ns2	Low	Devs	named	dgalloway
FOG	No	Yes	No	Yes	RHEV	No	Medium	Devs	fog	dgalloway
LRC	No	Could	No	Some	Yes	Ish	Medium	Devs	ceph	dgalloway,sage
gw.sepia.ceph.com	Could	Yes	Yes	Yes	RHEV	No	Medium	All	openvpn, nginx	dgalloway
RHEV	No	Could	Yes	No	Yes	Ish	Medium	All	RHEV, gluster	dgalloway
Gluster	No	Could	No	No	Yes	Ish	Medium	All	Gluster	dgalloway	RHGS compatibility must remain aligned with RHEV version

Scheduled Maintenance Plans

CI Infrastructure Procedure

Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.

Notify ceph-devel@
Log into each Jenkins instance, Manage Jenkins → Prepare for Shutdown
Again in Jenkins, go to Manage Jenkins → Manage Plugins
Select All at the bottom and click Download now and install after restart
Wait for all jobs to finish and make sure plugins are downloaded
Once all jobs are completed, ssh to each Jenkins instance
1. Updating the jenkins package or rebooting will restart the service so:
2. systemctl stop jenkins
3. systemctl disable jenkins
4. apt update
5. apt install linux-image-generic (or equivalent to just update the kernel)
6. Reboot the host so you're running the latest kernel
Update Slaves
1. ssh to each static smithi slave (smithi{119..128}
2. ssh to each slave-{centos,ubuntu}-* slave, update packages, and shut down
3. Put each irvingi node in Maintenance mode under the Hosts tab in the RHEV Web UI
4. In the RHEV Web UI, highlight each irvingi host and click Update
5. Bring slave-{centos,ubuntu}-* VMs back online after irvingis update
6. Make sure all static slaves reconnect to Jenkins
Update chacra, mita, shaman, prado
1. If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and apt update && apt upgrade && reboot
2. If a redeploy is needed, see each service's individual wiki page
Once all the other CI hosts are up to date, update each Jenkins instance: apt upgrade
This should restart Jenkins but if it doesn't, systemctl start jenkins
systemctl enable jenkins
Spot check a few jobs to make sure all plugins are working properly
1. You can check this by commenting jenkins test make check in a PR
2. Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?)
3. Make sure postbuild scripts are running when they're supposed to (Is the FAILURE script running when the build PASSED? That's a problem)

Public Facing Sites Procedure

tracker.ceph.com and www.ceph.com

For the most part, these hosts' packages can be updated and hosts rebooted whenever. If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.

Post-update Tasks

Log in to tracker.ceph.com and modify a bug
Spot check a few pages on www.ceph.com
Log in to www.ceph.com if you have a login to wordpress

docs.ceph.com

As long as there isn't a job running, this host can be updated and rebooted whenever.

Post-update Tasks

Does http://docs.ceph.com load?
Is the host reattached as a slave in Jenkins?

download.ceph.com

Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists.

Post-update Tasks

Does https://download.ceph.com load?

Sepia Lab Procedure

Send a planned outage notice to sepia at lists dot ceph.com
Instruct teuthology workers to die
1. Wait for there to be no jobs running. This can take up to 12 hours.
2. If you don't want to wait:
  1. Ask Yuri if scheduled_teuthology runs can be killed
  2. Ask individual devs if their runs can be killed

Kill idle workers

ssh teuthology.front.sepia.ceph.com
sudo su - teuthworker
bin/worker_kill_idle

Once there are no jobs running and no workers (teuthworker@teuthology:~$ bin/worker_count)
1. Update packages on teuthology.front.sepia.ceph.com
2. shutdown -r +5 "Rebooting during planned maintenance. SAVE YOUR WORK!"
3. Update and reboot:
  1. labdashboard.front.sepia.ceph.com
  2. circle.front.sepia.ceph.com
  3. cobbler.front.sepia.ceph.com
  4. conserver.front.sepia.ceph.com
  5. drop.ceph.com
  6. git.ceph.com
  7. fog.front.sepia.ceph.com
  8. ns1.front.sepia.ceph.com
  9. ns2.front.sepia.ceph.com
  10. nsupdate.front.sepia.ceph.com
  11. vpn-pub.ovh.sepia.ceph.com
  12. satellite.front.sepia.ceph.com
  13. sentry.front.sepia.ceph.com
  14. pulpito.front.sepia.ceph.com
Finally, update and reboot gw.sepia.ceph.com ²⁾

Sepia Lab Post Maintenance Tasks

ssh $(whoami)@8.43.84.131 (circle.front/GFW proxy)
Log in to Cobbler
console smithi001.front.sepia.ceph.com (Do you get a SOL session?)
ssh drop.ceph.com
Does git.ceph.com load? Is it up to date?
Log in to FOG ³⁾
dig smithi001.front.sepia.ceph.com @vpn-pub.ovh.sepia.ceph.com
dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com
dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com
teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04 (Do you get a functioning OVH node? dig its hostname against vpn-pub and make sure nsupdate is working)
teuthology-lock --lock-many 1 -m smithi --os-type rhel
1. Run ceph-cm-ansible against that host and verify it subscribes to Satellite and can yum update
Does http://sentry.ceph.com/sepia load?
Does http://pulpito.ceph.com load?
Verify all the reverse proxies in /etc/nginx/sites-enabled on gw.sepia.ceph.com are accessible

Finally, once all post-maintenance tasks are complete,

ssh teuthology.front.sepia.ceph.com
sudo su - teuthworker
bin/worker_start smithi 25
^D^D

Check how many running workers there should be in /home/teuthworker/bin/worker_start and start 1/4 of them at a time. If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks.

Boilerplate Outage Notices

CI

Hi All,

A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.

We will be updating and rebooting the following hosts:
jenkins.ceph.com
2.jenkins.ceph.com
chacra.ceph.com
{1..5}.chacra.ceph.com
shaman.ceph.com
1.shaman.ceph.com
2.shaman.ceph.com

This means:
  - Jenkins will be paused and stop processing new jobs so PR checks will be delayed
  - Once there are no jobs running, all hosts will be updated and rebooted
  - Repos on chacra nodes will be temporarily unavailable

Let me know if you have any questions/concerns.

Thanks,

Sepia Lab

Hi All,

A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC.

We will be updating and rebooting the following hosts:
teuthology.front.sepia.ceph.com
labdashboard.front.sepia.ceph.com
circle.front.sepia.ceph.com
cobbler.front.sepia.ceph.com
conserver.front.sepia.ceph.com
fog.front.sepia.ceph.com
ns1.front.sepia.ceph.com
ns2.front.sepia.ceph.com
nsupdate.front.sepia.ceph.com
vpn-pub.ovh.sepia.ceph.com
satellite.front.sepia.ceph.com
sentry.front.sepia.ceph.com
pulpito.front.sepia.ceph.com
drop.ceph.com
git.ceph.com
gw.sepia.ceph.com

This means:
  - teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete
  - DNS may be temporarily unavailable
  - All aforementioned hosts will be temporarily unavailable for a brief time
  - Your VPN connection will need to be restarted

I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete.

Let me know if you have any questions/concerns.

Thanks,

¹⁾

Can the host be restored/rebuilt with Ansible?

²⁾

See emergency_rhev_web_ui_access_w_o_vpn for emergency backup plan

³⁾

Updating the database can break FOG. See https://forums.fogproject.org/topic/10006/ubuntu-is-fog-s-enemy

Table of Contents

Planned/Scheduled Maintenance Procedures

Frequency/Timing

Maintenance Matrix

Scheduled Maintenance Plans

CI Infrastructure Procedure

Public Facing Sites Procedure

tracker.ceph.com and www.ceph.com

Post-update Tasks

docs.ceph.com

Post-update Tasks

download.ceph.com

Post-update Tasks

Sepia Lab Procedure

Sepia Lab Post Maintenance Tasks

Boilerplate Outage Notices

CI

Sepia Lab