====== FOG ====== ===== Summary ===== [[https://fogproject.org/|FOG]] is in use in the Sepia lab. It enables us to reimage baremetal testnodes before every job. Our instance is currently hosted at http://fog.front.sepia.ceph.com/fog. **Username:** fog\\ **Password:** Standard root password ===== How-Tos ===== ==== Adding a new distro ==== The [[https://github.com/ceph/ceph-build/pull/1706|Jenkins job]] does this automatically now. ==== Capturing OS images ==== This can be done manually by basically deciphering the bash monster in https://github.com/ceph/ceph-build/tree/master/sepia-fog-images - Navigate to https://jenkins.ceph.com/job/sepia-fog-images/build - Choose which machine types and distros you want to capture images for - Click **Build** - ... - Profit If capturing any image fails, the job is configured to cancel the OS capture and will leave the testnodes locked so you can debug/investigate. ==== Image capture control flow ==== jenkins job sepia-fog-images runs on the teuthology host, and - clones/updates and sets up teuthology to use teuthology-lock - clones/updates ceph-cm-ansible - locks a machine of the requested type(s) (or uses hosts passed in as arguments), setting their descriptions to "Locked to capture FOG image for Jenkins build ###" - uses /usr/local/sbin/set-next-server.sh on the store01 DHCP server to set the targets to PXE boot from cobbler (rather than fog) and restarts the dhcpd - sshes to ubuntu@cobbler.front to set the right cobbler profile for the host and enable netboot - powercycles the hosts in question - while the hosts are rebooting, Uses curl and the FOG api to GET an image id or POST an image template to create the image, and then sets up fog for image capture - sleeps for 10s to allow the hosts to become inaccessible so it can.. - ..start polling for the sentinel file /ceph-qa-ready which is created at the very end of the process. (The cobbler install flow is documented below) - If there's an error or /ceph-qa-ready isn't present, retry for up to 2 hours. If normal completion is seen, set DHCP back to PXE-from-fog, run ansible-playbook (from teuthology, against the host) with tools/prep-fog-capture.yml, which removes some files from the prior installation: * /etc/udev/rules.d/70-persistent-net.rules * /.cephlab_net_configured * /ceph-qa-ready - disables network configuration, kills the /var/lib/ceph mount and removes from fstab, removes any ssh host keys, unsubscribes from RHEL, removes a katello.facts file, disables periodic dnf makecache jobs, cleans the dnf cache, stops ntp/chrony, and sets the hwclock - restarts dhcpd - waits for any in-progress fog images to complete, pauses the teuthology queue if there are any - powercycles the targets to boot into FOG and capture, and waits for FOG task completion - teuthology-lock --unlock's any locked hosts and unpauses the queue if needed ==== Cobbler install flow for reimaging process ==== - do a normal preseed/kickstart install with cobbler-defined preseed/kickstart files. Some extra definitions: * a smallish set of packages to install * grub serial console setup * install an /etc/rc.local to run once on first reboot * install with ext4 on the appropriate drive, without swap * set up subscription manager * add the cm user with the admin_users' keys and passwordless sudo * turn off cobbler PXE boot - after rebooting to the fresh install, /etc/rc.local runs: * search the nics for any active interfaces, and set them up for DHCP; if they receive no DHCP address, unconfigure them, assumption being they're not on any network we should configure * touch /.cephlab_net_configured when done * try to get a hostname from reverse DNS and configure it * generate SSH host keys * ping the cobbler host to make sure it's reachable * curl the cblr/svc/op/trig/mode/post/system/, which will run the /var/lib/cobbler/triggers/install/post/cephlab_ansible.sh script from cobbler to the target host - cephlab_ansible.sh will (running on the cobbler host): * use scl on Centos 7 to get python 3.8 * clone/update ceph-cm-ansible and ceph-sepia-secrets (root's ssh key allows access to the latter on github) * look for port 22 to be open; there is apparently a way this trigger might run before the install is done, and if so, 22 won't be available; when it's run from the /etc/rc.local in step 2 above, it will find 22 open * create a /var/log/ansible and put log output there in a file named * for CentOS 8 Stream, run tools/convert-to-centos-stream.yml * if the Cobbler profile is named '*-stock', stop there * run ansible cephlab.yml, skipping users,pubkeys,zap - cephlab.yml runs * teuthology.yml for teuthology * testnodes.yml for testnodes * container-host.yml for docker/podman installation * cobbler.yml for cobbler hosts * same with paddles and pulpito * finally, for testnodes, touches the /ceph-qa-ready sentinel, used by the fog capture process above to notice that the installation is finished and proceed with the capture process.