Backups#

Cluster Installation Backup#

The cluster manager does not include facilities to create backups of a cluster installation. The cluster administrator is responsible for deciding on the best way to back up the cluster, out of the many possible choices.

A backup method is strongly recommended and checking that restoration from backup works is also strongly recommended.

One option that may be appropriate for some cases is simply cloning the head node. A clone can be created by PXE booting the new head node and following the procedure in Section 15.4.8 of the BCM Administrator Manual.

When setting up a backup mechanism, include the full filesystem of the head node (i.e. including all software images). Unless the compute node hard drives are used to store important data, it is not necessary to back them up.

If no backup infrastructure is already in place at the cluster site, the following open source (GPL) software packages may be used to maintain regular backups:

Bacula requires ports 9101-9103 to be accessible on the head node. Including the following lines in the Shorewall rules file for the head node allows access by those ports from an IP address of 93.184.216.34 on the external network:
ACCEPT net:93.184.216.34 fw tcp 9101
ACCEPT net:93.184.216.34 fw tcp 9102
ACCEPT net:93.184.216.34 fw tcp 9103

The Shorewall service should then be restarted to enforce the added rules.
rsnapshot. rsnapshot allows periodic incremental filesystem snapshots to be written to a local or remote filesystem. Despite its simplicity, it can be a very effective tool to maintain frequent backups of a system. More information is available at http://www.rsnapshot.org.

Local Database and Data Backups and Restoration#

The CMDaemon database is stored in the MySQL cmdaemon database and contains most of the stored settings of the cluster. Monitoring data values are stored as binaries in the filesystem, under /var/spool/cmd/monitoring. The administrator is expected to run a regular backup mechanism for the cluster to allow restores of all files from a recent snapshot. As an additional, separate, convenience:

For the CMDaemon database, the entire database is also backed up nightly on the cluster file system itself (local rotating backup) for the last seven days.
For the monitoring data, the raw data records are not backed up locally, since these can get very large. However, the configuration of the monitoring data, which is stored in the CMDaemon database, is backed up for the last seven days too.

Database Corruption and Repairs#

A corrupted MySQL database is often caused by an improper shutdown of the node. To deal with this, when starting up, MySQL checks itself for corrupted tables, and tries to repair any such by itself. Detected corruption causes an event notice to be sent to cmsh or Base View. When there is database corruption, InfoMessages in the /var/log/cmdaemon log may mention:

Unexpected eof found in association with a table in the database.
Cannot find file when referring to an entire missing table.
Locked tables.
Error numbers from table handlers.
Error while executing a command.

If a basic repair is to be conducted on a database, CMDaemon should first be stopped.

[root@headnode ~]# service cmd stop
[root@headnode ~]# myisamchk --recover /var/lib/mysql/mysql/user.MYI
[root@headnode ~]# service cmd start

If basic repair fails, more extreme repair options --man myisamchk(1) suggests what can then be tried out.

If CMDaemon is unable to start up due to a corrupted database, messages in the /var/log/cmdaemon file might show something like:

Oct 11 15:48:19 headnode CMDaemon: Info: Initialize cmdaemon database
Oct 11 15:48:19 headnode CMDaemon: Info: Attempt to set provisioning Network (280374976710700) not an element of networks
Oct 11 15:48:19 headnode CMDaemon: Fatal: Database corruption! Load Master Node with key: 280374976782569
Oct 11 15:48:20 headnode CMDaemon: Info: Sending reconnect command to all nodes which were up before master went down ...
Oct 11 15:48:26 headnode CMDaemon: Info: Reconnect command processed.

The above is an example of a CMDaemon database corruption message that the administrator should be aware of, and suggests database repairs are required for the CMDaemon database. The severity of the corruption, in this case not even allowing CMDaemon to start up, may mean that a restoration from backup is needed. How to restore from backup is covered next.

Restoring from Local Backup#

If the MySQL database repair tools of the previous section do not fix the problem, then for a failover configuration, the dbreclone option should normally provide a CMDaemon and Slurm database that is current. The dbreclone option does not clone the monitoring database.

Cloning Databases#

The cm-clone-monitoring-db.sh helper script that comes with CMDaemon can be used to clone the monitoring database.

Cloning Extra Databases#

The /cm/local/apps/cluster-tools/ha/conf/extradbclone.xml file can be used as a template to create a file extradbclone.xml in the same directory. The extradbclone.xml file can then be used to define additional databases to be cloned. Running the /cm/local/apps/cmd/scripts/cm-update-mycnf script updates the /etc/my.cnf file. The database can then be cloned with this new MySQL configuration by running cmha dbreclone <PASSIVE_HOSTNAME> where <PASSIVE_HOSTNAME> is the hostname of the passive head node.

If the head node is not part of a failover configuration, then a restoration from local backup can be done. The local backup directory is /var/spool/cmd/backup and contains files such as the following:

[root@headnode ~]# cd /var/spool/cmd/backup/
[root@headnode backup]# ls -l
total  280
...
-rw------- 1 root root 33804 Oct 10 04:02 backup-Mon.sql.gz
-rw------- 1 root root 33805 Oct  9 04:02 backup-Sun.sql.gz
...

The CMDaemon database snapshots are stored as backup-<day of week>.sql.gz. In the example, the latest backup available in the listing for CMDaemon turns out to be backup Tue.sql.gz. The latest backup can then be un-gzipped and piped into the MySQL database for the user cmdaemon. The password, <PASSWORD>, can be retrieved from the configuration file /cm/local/apps/cmd/etc/cmd.conf, where it is configured using the DBPass directive (Appendix C of the Bright Cluster Manager Administrator Manual).

gunzip backup-Tue.sql.gz
service cmd stop #(just to make sure)
mysql -ucmdaemon -p<password> cmdaemon < backup-Tue.sql

Running service cmd start should have CMDaemon running again, this time with a restored database from the time the snapshot was taken. That means that any changes that were done to the cluster manager after the time the snapshot was taken are no longer implemented.

Monitoring data values are not kept in a database, but in files.