User Tools

Site Tools


Sidebar

General Lab Info (Mainly for Devs)

Hardware

Lab Infrastructure Services

Misc Admin Tasks
These are infrequently completed tasks that don't fit under any specific service

Production Services

OVH = OVH
RHEV = Sepia RHE instance
Baremetal = Host in Sepia lab

The Attic/Legacy Info

tasks:scheduled-maintenance

This is an old revision of the document!


Planned/Scheduled Maintenance Procedures

Historically, infrastructure maintenance has been on an as-needed “if it ain't broke, don't fix it” basis. Now that most Infrastructure services are highly available and the lab network/services/suite have stabilized, we're able to afford occasional downtime to perform system updates, hardware swaps, OS upgrades, etc.

This document (like all good documentation) will be ever-evolving.

Frequency/Timing

The PnT Labs team has scheduled monthly maintenance on the second Friday of every month. Since this maintenance only has the potential to disrupt operations in the downstream/Octo lab, it could be a good candidate. Maintenance on Fridays allows for troubleshooting and rollbacks, if necessary, either late Friday evening or on Saturdays during a time when the lab is usually only running scheduled test suites.

Maintenance should not disrupt testing cycles when an upstream release is imminent.

Maintenance should be announced on the Status Portal and sepia-users mailing list at least a week prior and all stakeholders, yet TBD, should give a green light before proceeding.

Maintenance Matrix

Service Disrupts Community Disrupts Developers Backups Ansible HA Failover Risk Stakeholders Applications SME Other Notes
www.ceph.com Yes No Yes Some OVH No Medium Wordpress lvaz, Age of Peers Leo regularly updates Wordpress and its plugins
download.ceph.com Yes Yes Yes Most OVH No Medium All Nginx dgalloway
tracker.ceph.com Yes Yes Yes No OVH No High All Redmine dgalloway,dmick Redmine and its plugins are tricky. None of us are ruby experts.
docs.ceph.com Yes No Yes Some OVH No Low All Nginx dgalloway
chacra.ceph.com No Yes Yes Yes OVH No Low Release team, Core devs? chacra, celery, postgres, nginx dgalloway,alfredo
chacra dev instances No Yes No Yes OVH Yes Low Devs chacra, celery, postgres, nginx dgalloway,alfredo
shaman No Yes No Yes OVH Yes Medium Devs shaman, ? dgalloway,alfredo
{apt-mirror,gitbuilder}.ceph.com No Yes No No No No High Devs Apache dgalloway,dmick Still on single baremetal mira
jenkins{2}.ceph.com Yes Yes Some Yes OVH No Medium Devs Jenkins, mita, celery, nginx dgalloway,alfredo
prado.ceph.com Could Could No Yes OVH No Low Devs prado, nginx dgalloway,alfredo
git.ceph.com Yes? Yes No Yes RHEV No Medium Devs, others? git, git-daemon, apache dgalloway,dmick
teuthology VM No Yes Some Most RHEV No Low Devs Teuthology dgalloway,zack Relies on Paddles, git.ceph.com, apt-mirror, download.ceph.com, chacra, gitbuilder.ceph.com
pulpito.front No Not really No Yes No No Medium QE? pulpito zack Relies on paddles. Still on baremetal mira
paddles.front No Yes Yes Yes No No Medium Devs paddles dgalloway,zack Still on baremetal mira
Cobbler No No No Yes RHEV No Low dgalloway Cobbler, apache dgalloway Really only needed for creating FOG images
conserver.front No Yes? Some No RHEV No Low Devs conserver dgalloway
DHCP (store01) No Yes Yes No No No Medium Devs dhcpd dgalloway
DNS Could Yes N/A Yes RHEV/OVH Not currently Low Devs named dgalloway
FOG No Yes No Little RHEV No Medium Devs fog dgalloway
LRC No Could No No Yes Ish Medium Devs ceph dgalloway,sage
gw.sepia.ceph.com Could Yes Yes Yes RHEV No Medium All openvpn, nginx dgalloway
RHEV No Could Yes No Yes Ish Medium All RHEV, gluster dgalloway Packages are a mix between CentOS gluster and RHGS
Gluster No Could No No Yes Ish Medium All Gluster dgalloway RHGS compatibility must remain aligned with RHEV version

Scheduled Maintenance Plan

CI Infrastructure

Updating the dev chacra nodes ({1..5}.chacra.ceph.com) has little chance to affect upstream teuthology testing except while the chacra service is redeployed or a host is rebooted. Because of thise, it's relatively safe to perform CI maintenance separate from Sepia lab maintenance. To be extra safe, you could pause the Sepia queue and wait ~30min to make sure no package manager processes get run against a chacra node.

  1. Log into each Jenkins instance, Manage JenkinsPrepare for Shutdown
  2. Again in Jenkins, go to Manage JenkinsManage Plugins
  3. Select All at the bottom and click Download now and install after restart
  4. Wait for all jobs to finish and make sure plugins are downloaded
  5. Once all jobs are completed, ssh to each Jenkins instance
    1. Updating the jenkins package or rebooting will restart the service so:
    2. systemctl stop jenkins
    3. systemctl disable jenkins
    4. apt update
    5. apt install linux-image-generic (or equivalent to just update the kernel)
    6. Reboot the host so you're running the latest kernel
  6. If no service redeploy is needed for chacra, shaman, or mita, just ssh to each of those hosts and apt update && apt upgrade && reboot
  7. If a redeploy is needed, see each service's individual wiki page
    1. Once all the other CI hosts are up to date, update each Jenkins instance: apt upgrade
    2. This should restart Jenkins but if it doesn't, systemctl start jenkins
    3. systemctl enable jenkins
  8. Spot check a few jobs to make sure all plugins are working properly
    1. You can check this by commenting jenkins test make check in a PR
    2. Make sure the Github hooks are working (Was a job triggered? When the job finishes, does it update the status in the PR?)
    3. Make sure postbuild scripts are running when they're supposed to (Is the FAILURE script running when the build PASSED? That's a problem)

Public Facing Sites

tracker.ceph.com and www.ceph.com

For the most part, these hosts' packages can be updated and hosts rebooted whenever. If you're feeling friendly, you could send a heads up to ceph-devel and/or ceph-users.

Post-update Tasks

Check login to tracker.ceph.com and spot check a few pages on www.ceph.com afterwards.

docs.ceph.com

As long as there isn't a job running, this host can be updated and rebooted whenever.

Post-update Tasks

download.ceph.com

Rebooting this host is disruptive to upstream testing and should be part of a planned pre-announced outage.

tasks/scheduled-maintenance.1522787103.txt.gz · Last modified: 2018/04/03 20:25 by djgalloway