Cluster Management Daemon
The cluster management daemon or CMDaemon is a server process that runs on all nodes of the DGX SuperPOD (including the head node. CMDaemons work together to make the cluster manageable. When applications such as cmsh and Base View communicate with the cluster, they are interacting with the CMDaemon running on the head node. Cluster management applications never communicate directly with CMDaemons running on non-head nodes.
The CMDaemon application starts running on any node automatically when it boots, and the application continues running until the node shuts down. Should CMDaemon be stopped manually for whatever reason, its cluster management functionality becomes unavailable, making it hard for administrators to manage the cluster. However, even with the daemon stopped, the cluster remains fully usable for running computational jobs using a workload manager.
The only route of communication with the CMDaemon is through TCP port 8081. CMDaemon accepts only SSL connections, thereby ensuring all communications are encrypted. Authentication is also managed in the SSL layer using client-side X509v3 certificates (2.2).
On the head node, the CMDaemon uses a MySQL database server to store all its internal data. Raw monitoring data, on the other hand, is stored as binary data outside of the MySQL database.
Controlling CMDaemon
It may be useful to shut down or restart CMDaemon. For instance, a restart may be necessary to activate changes when the CMDaemon configuration file is modified. CMDaemon operation can be controlled through the following init script arguments to service cmd.
cmdaemonctl
command arguments are shown in Table 8.
Table 8. cmdaemonctl command arguments
Argument |
Description |
---|---|
stop |
Stop the CMDaemon |
start |
Start the CMDaemon |
reload |
Reload configuration of the CMDaemon |
force-reload |
Force reload configuration of the CMDaemon |
restart |
Restart the CMDaemon |
try-restart |
Try to restart the CMDaemon, but only if it is running |
status report |
Whether CMDaemon is running |
full-status∗ |
Report detailed statistics about CMDaemon |
upgrade∗ |
Update database schema after version upgrade (expert only) |
debugon∗ |
Enable debug logging (expert only) |
debugoff∗ |
Disable debug logging (expert only) |
logconf∗ |
Reload log configuration |
|
Restarting the CMDaemon on the head node of a cluster:
1[root©dgxsuperpod ~]# service cmd restart
2Redirecting to /bin/systemctl restart cmd.service
3[root©dgxsuperpod ~]#
Viewing the resources used by CMDaemon, and other useful information:
1[root©headnode etc]# service cmd status
2CMDaemon version 2.1 is running (active) Running locally
3Current Time: Fri, 29 Jan 2021 01:48:28 CET
4Startup Time: Thu, 28 Jan 2021 15:45:17 CET Uptime: 10h 3m
5CPU Usage: 66.8112u 50.5393s (0.3%)
6Memory Usage: 172MB
7Sessions Since Startup: 29 Active Sessions: 7
8Number of occupied worker-threads: 7 Number of free worker-threads: 14
9Connections handled: 2397
10Requests processed: 6850 Total read: 1.98MB
11Total written: 170MB
12Average request rate: 11.4requests/m Average bandwidth usage: 4KB/s
Restarting the CMDaemon on a sequence of compute nodes dgx001 to dgx040:
1[root©dgxsuperpod ~]# pdsh -w dgx00[1-9],dgx0[1-3][0-9],dgx040 service cmd restart
This uses pdsh, the parallel shell command.
Configuring CMDaemon
Many cluster configuration changes can be done by modifying the CMDaemon configuration file. For the head node, the file is located at:
1/cm/local/apps/cmd/etc/cmd.conf
For compute nodes, it is located inside of the software image that the node uses.
Appendix C of the Bright Cluster Manager Administrator Manual describes the supported configuration file directives and how they can be used. Normally there is no need to modify the default settings.
After modifying the configuration file, the CMDaemon must be restarted to activate the changes.
CMDaemon Versions
For debugging an issue, knowing the version of CMDaemon that is in use on the cluster can be helpful. The cmdaemonversions command runs within the device mode of cmsh. It lists the CMDaemon version running on the nodes of the cluster.
1[headnode->device]% cmdaemonversions
2Hostname Version index Version hash
3---------------- ------------- ------------
4headnode 146,965 e6f593b676
5dgx001 146,965 e6f593b676
6dgx002 146,965 e6f593b676
A higher version index value indicates a more recent CMDaemon version.
The –join option is a formatting option that gathers versions with the same option:
1[headnode->device]% cmdaemonversions --join
2Version index Version hash Count Hostnames
3------------- ------------ ------------ -------------------------
4146,965 e6f593b676 3 headnode,dgx001..dgx002
Configuring CMDaemon Logging
CMDaemon generates log messages in /var/log/cmdaemon from specific internal subsystems, such as workload management, service management, monitoring, and certs. By default, none of those subsystems generate detailed (debug-level) messages, as that would make the log file grow rapidly.
CMDaemon Logging Configuration Global Debug Mode
A global debug mode can be enabled in CMDaemon using cmdaemonctl:
1[root©headnode ~]# cmdaemonctl -h cmdaemonctl [OPTIONS…] COMMAND ...
2Query or send control commands to the cluster manager daemon.
3-h --help Show this help Commands:
4debugon Turn on CMDaemon debug
5debugoff Turn off CMDaemon debug
6...
7[root©headnode ~]# cmdaemonctl debugon CMDaemon debug level on
Stopping debug level logs from running for too long by executing cmdaemonctl debugoff is a good idea, especially for production clusters. This is important to prevent swamping the cluster with unfeasibly large logs.
CMDaemon Subsystem Logging Configuration Debug Mode
CMDaemon subsystems can generate debug logs separately per subsystem, including by severity level. This can be done by modifying the logging configuration file at:
1/cm/local/apps/cmd/etc/logging.cmd.conf
Within this file, a section with a title of #Available Subsystems lists the available subsystems that can be monitored. These subsystems include MON (for monitoring), DB (for database), HA (for high availability), CERTS (for certificates), CEPH (for Ceph), and so on.
CMDaemon Subsystem Logging Configuration Severity Levels
In addition to the debug setting, other severity levels are info
, warning
, error
, and all
.
Further details on setting subsystem options are given within the logging.cmd.conf file.
For example, to set CMDaemon log output for Monitoring, at a severity level of warning, the file contents for the section severity might look like:
1Severity {
2 warning: MON
3}
CMDaemon Subsystem Logging Configuration Deployment
The new logging configuration can be reloaded from the file by restarting CMDaemon:
1[root©headnode etc]# service cmd restart
Or by reloading the logging configuration:
1[root©headnode etc]# service cmd logconf
Configuration File Modification and the FrozenFile Directive
As part of its tasks, the CMDaemon modifies several system configuration files. Some configuration files are completely replaced, while other configuration files only have some sections modified. Appendix A of the Bright Cluster Manager Administrator Manual lists all system configuration files that are modified.
A file that has been generated entirely by the CMDaemon contains a header:
1# This file was automatically generated by cmd. Do not edit manually!
Such a file will be entirely overwritten, unless the FrozenFile <https://support.brightcomputing.com/manuals/9.2/admin-manual.pdf#section*.929>__ configuration file directive is used to keep it frozen.
Sections of files that have been generated by the CMDaemon will read as follows:
1# This section of this file was automatically generated by cmd.
2Do not edit manually!
3# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
4...
5# END AUTOGENERATED SECTION -- DO NOT REMOVE
Such a file has only the auto-generated sections entirely overwritten, unless the FrozenFile configuration file directive is used to keep these sections frozen. The FrozenFile configuration file directive in cmd.conf is set as in this example:
1FrozenFile = { "/etc/dhcpd.conf", "/etc/postfix/main.cf" }
If the generated file or section of a file has a manually modified part, and when not using FrozenFile, then during overwriting an event is generated, and the manually modified configuration file is backed up to:
1/var/spool/cmd/saved-config-files
Using FrozenFile can be regarded as a configuration technique, and one of various possible configuration techniques.
Configuration File Precedence
While the cluster manager changes as little as possible of the standard distributions that it manages, there can sometimes be unavoidable issues. Sometimes a standard distribution utility or service generates a configuration file that conflicts with what the configuration file generated by the cluster manager conducts. In such a case the configuration file generated by the cluster manager must be given precedence, and the generation of a configuration file from the standard distribution should be avoided. Sometimes using a fully or partially frozen configuration file (3.4) allows a workaround. Otherwise, the functionality of the cluster manager version usually allows the required configuration function to be implemented. Details on the configuration files installed and updated by the package management system are further discussed in Appendix A of the Bright Cluster Manager Administrator Manual.