Historically, infrastructure maintenance has been on an as-needed “if it ain't broke, don't fix it” basis. Now that most Infrastructure services are highly available and the lab network/services/suite have stabilized, we're able to afford occasional downtime to perform system updates, hardware swaps, OS upgrades, etc.
This document (like all good documentation) will be ever-evolving.
The PnT Labs team has scheduled monthly maintenance on the second Friday of every month. Since this maintenance only has the potential to disrupt operations in the downstream/Octo lab, it could be a good candidate. Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.
Maintenance should not disrupt testing cycles when an upstream release is imminent.
Maintenance should be announced on the Status Portal and sepia-users mailing list at least a week prior and all stakeholders, yet TBD, should give a green light before proceeding.
Service | Disrupts Community | Disrupts Developers | Backups | Ansible 1) | HA | Failover | Risk | Stakeholders | Applications | SME | Other Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
www.ceph.com | Yes | No | Yes | Some | OVH | No | Medium | N/A | Wordpress, nginx | lvaz, Age of Peers | Leo regularly updates Wordpress and its plugins |
download.ceph.com | Yes | Yes | Yes | Most | OVH | No | Medium | All | Nginx | dgalloway | |
tracker.ceph.com | Yes | Yes | Yes | No | OVH | No | High | All | Redmine | dgalloway,dmick | Redmine and its plugins are tricky. None of us are ruby experts. |
docs.ceph.com | Yes | No | Yes | Some | OVH | No | Low | All | Nginx | dgalloway | |
chacra.ceph.com | No | Yes | Yes | Yes | OVH | No | Low | Release team, Core devs? | chacra, celery, postgres, nginx | dgalloway,alfredo | |
chacra dev instances | No | Yes | No | Yes | OVH | Yes | Low | Devs | chacra, celery, postgres, nginx | dgalloway,alfredo | |
shaman | No | Yes | No | Yes | OVH | Yes | Medium | Devs | shaman, ? | dgalloway,alfredo | |
{apt-mirror,gitbuilder}.ceph.com | No | Yes | No | No | No | No | High | Devs | Apache | dgalloway,dmick | Still on single baremetal mira |
jenkins{2}.ceph.com | Yes | Yes | Some | Yes | OVH | No | Medium | Devs | Jenkins, mita, celery, nginx | dgalloway,alfredo | |
prado.ceph.com | Could | Could | No | Yes | OVH | No | Low | Devs | prado, nginx | dgalloway,alfredo | |
git.ceph.com | Yes? | Yes | No | Yes | RHEV | No | Medium | Devs, others? | git, git-daemon, apache | dgalloway,dmick | |
teuthology VM | No | Yes | Some | Most | RHEV | No | Low | Devs | Teuthology | dgalloway,zack | Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com |
pulpito.front | No | Not really | No | Yes | No | No | Medium | QE? | pulpito | zack | Relies on paddles. Still on baremetal mira |
paddles.front | No | Yes | Yes | Yes | No | No | Medium | Devs | paddles | dgalloway,zack | Still on baremetal mira |
Cobbler | No | No | No | Yes | RHEV | No | Low | dgalloway | Cobbler, apache | dgalloway | Really only needed for creating FOG images |
conserver.front | No | Yes? | Some | No | RHEV | No | Low | Devs | conserver | dgalloway | |
DHCP (store01) | No | Yes | Yes | Yes | No | No | Medium | Devs | dhcpd | dgalloway | |
DNS | Could | Yes | N/A | Yes | RHEV/OVH | ns1/ns2 | Low | Devs | named | dgalloway | |
FOG | No | Yes | No | Yes | RHEV | No | Medium | Devs | fog | dgalloway | |
LRC | No | Could | No | Some | Yes | Ish | Medium | Devs | ceph | dgalloway,sage | |
gw.sepia.ceph.com | Could | Yes | Yes | Yes | RHEV | No | Medium | All | openvpn, nginx | dgalloway | |
RHEV | No | Could | Yes | No | Yes | Ish | Medium | All | RHEV, gluster | dgalloway | |
Gluster | No | Could | No | No | Yes | Ish | Medium | All | Gluster | dgalloway | RHGS compatibility must remain aligned with RHEV version |
Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.
systemctl stop jenkins
systemctl disable jenkins
apt update
apt install linux-image-generic
(or equivalent to just update the kernel)apt update && apt upgrade && reboot
apt upgrade
systemctl start jenkins
systemctl enable jenkins
jenkins test make check
in a PRFor the most part, these hosts' packages can be updated and hosts rebooted whenever. If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.
As long as there isn't a job running, this host can be updated and rebooted whenever.
Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage to the ceph-users and ceph-devel mailing lists.
scheduled_teuthology
runs can be killedssh teuthology.front.sepia.ceph.com sudo su - teuthworker bin/worker_kill_idle
teuthworker@teuthology:~$ bin/worker_count
)shutdown -r +5 "Rebooting during planned maintenance. SAVE YOUR WORK!"
ssh $(whoami)@8.43.84.131
(circle.front/GFW proxy)console smithi001.front.sepia.ceph.com
(Do you get a SOL session?)ssh drop.ceph.com
dig smithi001.front.sepia.ceph.com @vpn-pub.ovh.sepia.ceph.com
dig smithi001.front.sepia.ceph.com @ns1.front.sepia.ceph.com
dig smithi001.front.sepia.ceph.com @ns2.front.sepia.ceph.com
teuthology-lock --lock-many 1 -m ovh --os-type ubuntu --os-version 16.04
(Do you get a functioning OVH node? dig
its hostname against vpn-pub and make sure nsupdate is working)teuthology-lock --lock-many 1 -m smithi --os-type rhel
/etc/nginx/sites-enabled
on gw.sepia.ceph.com are accessibleFinally, once all post-maintenance tasks are complete,
ssh teuthology.front.sepia.ceph.com sudo su - teuthworker bin/worker_start smithi 25 ^D^D
Check how many running workers there should be in /home/teuthworker/bin/worker_start
and start 1/4 of them at a time. If too many start at once, they can overwhelm the teuthology VM with ansible processes or overwhelm FOG with Deploy tasks.
Hi All, A scheduled maintenance of the CI Infrastructure is planned for YYYY-MM-DD at HH:MM UTC. We will be updating and rebooting the following hosts: jenkins.ceph.com 2.jenkins.ceph.com chacra.ceph.com {1..5}.chacra.ceph.com shaman.ceph.com 1.shaman.ceph.com 2.shaman.ceph.com This means: - Jenkins will be paused and stop processing new jobs so PR checks will be delayed - Once there are no jobs running, all hosts will be updated and rebooted - Repos on chacra nodes will be temporarily unavailable Let me know if you have any questions/concerns. Thanks,
Hi All, A scheduled maintenance of the Sepia Lab Infrastructure is planned for YYYY-MM-DD at HH:MM UTC. We will be updating and rebooting the following hosts: teuthology.front.sepia.ceph.com labdashboard.front.sepia.ceph.com circle.front.sepia.ceph.com cobbler.front.sepia.ceph.com conserver.front.sepia.ceph.com fog.front.sepia.ceph.com ns1.front.sepia.ceph.com ns2.front.sepia.ceph.com nsupdate.front.sepia.ceph.com vpn-pub.ovh.sepia.ceph.com satellite.front.sepia.ceph.com sentry.front.sepia.ceph.com pulpito.front.sepia.ceph.com drop.ceph.com git.ceph.com gw.sepia.ceph.com This means: - teuthology workers will be instructed to die and new jobs will not be started until the maintenance is complete - DNS may be temporarily unavailable - All aforementioned hosts will be temporarily unavailable for a brief time - Your VPN connection will need to be restarted I will send a follow-up "all clear" e-mail as a reply to this one once the maintenance is complete. Let me know if you have any questions/concerns. Thanks,