Backups
Cluster Installation Backup
The cluster manager does not include facilities to create backups of a cluster installation. The cluster administrator is responsible for deciding on the best way to back up the cluster, out of the many possible choices.
A backup method is strongly recommended and checking that restoration from backup works is also strongly recommended.
One option that may be appropriate for some cases is simply cloning the head node. A clone can be created by PXE booting the new head node and following the procedure in Section 17.4.8 of the Bright Cluster Manual Administrator Manual.
When setting up a backup mechanism, include the full filesystem of the head node (i.e. including all software images). Unless the compute node hard drives are used to store important data, it is not necessary to back them up.
If no backup infrastructure is already in place at the cluster site, the following open source (GPL) software packages may be used to maintain regular backups.
Bacula requires ports 9101-9103 to be accessible on the head node. Including the following lines in the Shorewall rules file for the head node allows access by those ports from an IP address of 93.184.216.34 on the external network:
ACCEPT net:93.184.216.34 fw tcp 9101
ACCEPT net:93.184.216.34 fw tcp 9102
ACCEPT net:93.184.216.34 fw tcp 9103
The Shorewall service should then be restarted to enforce the added rules.
rsnapshot. rsnapshot allows periodic incremental filesystem snapshots to be written to a local or remote filesystem. Despite its simplicity, it can be a very effective tool to maintain frequent backups of a system. More information is available at http://www.rsnapshot.org.
Local Database and Data Backups and Restoration
The CMDaemon database is stored in the MySQL cmdaemon database and contains most of the stored settings of the cluster. Monitoring data values are stored as binaries in the filesystem, under /var/spool/cmd/monitoring. The administrator is expected to run a regular backup mechanism for the cluster to allow restores of all files from a recent snapshot. As an additional, separate, convenience:
For the CMDaemon database, the entire database is also backed up nightly on the cluster file system itself (“local rotating backup”) for the last seven days.
For the monitoring data, the raw data records are not backed up locally, since these can get very large. However, the configuration of the monitoring data, which is stored in the CMDaemon database, is backed up for the last seven days too.
Database Corruption and Repairs
A corrupted MySQL database is often caused by an improper shutdown of the node. To deal with this, when starting up, MySQL checks itself for corrupted tables, and tries to repair any such by itself. Detected corruption causes an event notice to be sent to cmsh or Base View. When there is database corruption, InfoMessages in the /var/log/cmdaemon log may mention:
Unexpected eof found in association with a table in the database.
can’t find file when referring to an entire missing table.
locked tables.
error numbers from table handlers.
Error while executing a command.
If a basic repair is to be conducted on a database, CMDaemon should first be stopped.
1[root©headnode ~]# service cmd stop
2[root©headnode ~]# myisamchk --recover /var/lib/mysql/mysql/user.MYI
3[root©headnode ~]# service cmd start
If basic repair fails, more extreme repair options—man myisamchk(1) suggests what—can then be tried out.
If CMDaemon is unable to start up due to a corrupted database, messages in the /var/log/cmdaemon file might show something like:
1Oct 11 15:48:19 headnode CMDaemon: Info: Initialize cmdaemon database
2Oct 11 15:48:19 headnode CMDaemon: Info: Attempt to set provisioning Network (280374976710700) not an element of networks
3Oct 11 15:48:19 headnode CMDaemon: Fatal: Database corruption! Load Master Node with key: 280374976782569
4Oct 11 15:48:20 headnode CMDaemon: Info: Sending reconnect command to all nodes which were up before master went down ...
5Oct 11 15:48:26 headnode CMDaemon: Info: Reconnect command processed.
Here it is the CMDaemon Database corruption message that the administrator should be aware of, and which suggests database repairs are required for the CMDaemon database. The severity of the corruption, in this case not even allowing CMDaemon to start up, may mean that a restoration from backup is needed. How to restore from backup is covered next.
Restoring from Local Backup
If the MySQL database repair tools of the previous section do not fix the problem, then for a failover configuration, the dbreclone option should normally provide a CMDaemon and Slurm database that is current. The dbreclone option does not clone the monitoring database.
Cloning Databases
The cm-clone-monitoring-db.sh
helper script that comes with CMDaemon can be used to clone the monitoring database.
Cloning Extra Databases
The file /cm/local/apps/cluster-tools/ha/conf/extradbclone.xml. template can be used as a template to create a file extradbclone.xml in the same directory. The extradbclone.xml file can then be used to define additional databases to be cloned. Running the /cm/local/apps/cmd/scripts/cm-update-mycnf script then updates /etc/my.cnf. The database can then be cloned with this new MySQL configuration by running cmha dbreclone <passive> where <passive> is the hostname of the passive head node.
If the head node is not part of a failover configuration, then a restoration from local backup can be done. The local backup directory is* /var/spool/cmd/backup*, with contents that look like:
1[root©headnode ~]# cd /var/spool/cmd/backup/
2[root©headnode backup]# ls -l
3total 280
4...
5-rw------- 1 root root 33804 Oct 10 04:02 backup-Mon.sql.gz
6-rw------- 1 root root 33805 Oct 9 04:02 backup-Sun.sql.gz
7...
The CMDaemon database snapshots are stored as backup-<day of week>.sql.gz In the example, the latest backup available in the listing for CMDaemon turns out to be backup Tue.sql.gz. The latest backup can then be ungzipped and piped into the MySQL database for the user cmdaemon. The password, <password>, can be retrieved from /cm/local/apps/cmd/etc/cmd.conf, where it is configured in the DBPass directive (Appendix C of the Bright Cluster Manager Administrator Manual).
1gunzip backup-Tue.sql.gz
2service cmd stop #(just to make sure)
3mysql -ucmdaemon -p<password> cmdaemon < backup-Tue.sql
Running service cmd start should have CMDaemon running again, this time with a restored database from the time the snapshot was taken. That means that any changes that were done to the cluster manager after the time the snapshot was taken are no longer implemented.
Monitoring data values are not kept in a database, but in files.