Cluster Management Daemon#

The cluster management daemon or CMDaemon is a server process that runs on all nodes of the DGX SuperPOD (including the head node. CMDaemons work together to make the cluster manageable. When applications such as cmsh and Base View communicate with the cluster, they are interacting with the CMDaemon running on the head node. Cluster management applications never communicate directly with CMDaemons running on non-head nodes.

The CMDaemon application starts running on any node automatically when it boots, and the application continues running until the node shuts down. Should CMDaemon be stopped manually for whatever reason, its cluster management functionality becomes unavailable, making it hard for administrators to manage the cluster. However, even with the daemon stopped, the cluster remains fully usable for running computational jobs using a workload manager.

The only route of communication with the CMDaemon is through TCP port 8081. CMDaemon accepts only SSL connections, thereby ensuring all communications are encrypted. Authentication is also managed in the SSL layer using client-side X509v3 certificates (2.2).

On the head node, the CMDaemon uses a MySQL database server to store all its internal data. Raw monitoring data, on the other hand, is stored as binary data outside of the MySQL database.

Controlling CMDaemon#

It may be useful to shut down or restart CMDaemon. For instance, a restart may be necessary to activate changes when the CMDaemon configuration file is modified. CMDaemon operation can be controlled through the following init script arguments to service cmd.

cmdaemonctl command arguments are shown in Table 8.

Table 8. cmdaemonctl command arguments

Argument	Description
stop	Stop the CMDaemon
start	Start the CMDaemon
reload	Reload configuration of the CMDaemon
force-reload	Force reload configuration of the CMDaemon
restart	Restart the CMDaemon
try-restart	Try to restart the CMDaemon, but only if it is running
status report	Whether CMDaemon is running
full-status∗	Report detailed statistics about CMDaemon
upgrade∗	Update database schema after version upgrade (expert only)
debugon∗	Enable debug logging (expert only)
debugoff∗	Disable debug logging (expert only)
logconf∗	Reload log configuration
arguments that work with cmdeamonctl as well as with the service command

Restarting the CMDaemon on the head node of a cluster:

[root©dgxsuperpod ~]# service cmd restart
Redirecting to /bin/systemctl restart cmd.service
[root©dgxsuperpod ~]#

Viewing the resources used by CMDaemon, and other useful information:

[root©headnode etc]# service cmd status
CMDaemon version 2.1 is running (active) Running locally
Current Time: Fri, 29 Jan 2021 01:48:28 CET
Startup Time: Thu, 28 Jan 2021 15:45:17 CET Uptime: 10h 3m
CPU Usage: 66.8112u 50.5393s (0.3%)
Memory Usage: 172MB
Sessions Since Startup: 29 Active Sessions: 7
Number of occupied worker-threads: 7 Number of free worker-threads: 14
Connections handled: 2397
Requests processed: 6850 Total read: 1.98MB
Total written: 170MB
Average request rate: 11.4requests/m Average bandwidth usage: 4KB/s

Restarting the CMDaemon on a sequence of compute nodes dgx001 to dgx040:

[root©dgxsuperpod ~]# pdsh -w dgx00[1-9],dgx0[1-3][0-9],dgx040 service cmd restart

This uses pdsh, the parallel shell command.

Configuring CMDaemon#

Many cluster configuration changes can be done by modifying the CMDaemon configuration file. For the head node, the file is located at:

1/cm/local/apps/cmd/etc/cmd.conf

For compute nodes, it is located inside of the software image that the node uses.

Appendix C of the Bright Cluster Manager Administrator Manual describes the supported configuration file directives and how they can be used. Normally there is no need to modify the default settings.

After modifying the configuration file, the CMDaemon must be restarted to activate the changes.

CMDaemon Versions#

For debugging an issue, knowing the version of CMDaemon that is in use on the cluster can be helpful. The cmdaemonversions command runs within the device mode of cmsh. It lists the CMDaemon version running on the nodes of the cluster.

[headnode->device]% cmdaemonversions
Hostname                      Version index Version hash
---------------- ------------- ------------
headnode                      146,965        e6f593b676
dgx001                        146,965        e6f593b676
dgx002                        146,965        e6f593b676

A higher version index value indicates a more recent CMDaemon version.

The –join option is a formatting option that gathers versions with the same option:

[headnode->device]% cmdaemonversions --join
Version index Version hash Count       Hostnames
------------- ------------ ------------ -------------------------
146,965               e6f593b676   3                   headnode,dgx001..dgx002

Configuring CMDaemon Logging#

CMDaemon generates log messages in /var/log/cmdaemon from specific internal subsystems, such as workload management, service management, monitoring, and certs. By default, none of those subsystems generate detailed (debug-level) messages, as that would make the log file grow rapidly.

CMDaemon Logging Configuration Global Debug Mode#

A global debug mode can be enabled in CMDaemon using cmdaemonctl:

[root©headnode ~]# cmdaemonctl -h cmdaemonctl [OPTIONS…] COMMAND ...
Query or send control commands to the cluster manager daemon.
-h --help   Show this help Commands:
debugon     Turn on CMDaemon debug
debugoff    Turn off CMDaemon debug
...
[root©headnode ~]# cmdaemonctl debugon CMDaemon debug level on

Stopping debug level logs from running for too long by executing cmdaemonctl debugoff is a good idea, especially for production clusters. This is important to prevent swamping the cluster with unfeasibly large logs.

CMDaemon Subsystem Logging Configuration Debug Mode#

CMDaemon subsystems can generate debug logs separately per subsystem, including by severity level. This can be done by modifying the logging configuration file at:

1/cm/local/apps/cmd/etc/logging.cmd.conf

Within this file, a section with a title of #Available Subsystems lists the available subsystems that can be monitored. These subsystems include MON (for monitoring), DB (for database), HA (for high availability), CERTS (for certificates), CEPH (for Ceph), and so on.

CMDaemon Subsystem Logging Configuration Severity Levels#

In addition to the debug setting, other severity levels are info, warning, error, and all. Further details on setting subsystem options are given within the logging.cmd.conf file. For example, to set CMDaemon log output for Monitoring, at a severity level of warning, the file contents for the section severity might look like:

Severity {
    warning: MON
}

CMDaemon Subsystem Logging Configuration Deployment#

The new logging configuration can be reloaded from the file by restarting CMDaemon:

[root©headnode etc]# service cmd restart

Or by reloading the logging configuration:

[root©headnode etc]# service cmd logconf

Configuration File Modification and the FrozenFile Directive#

As part of its tasks, the CMDaemon modifies several system configuration files. Some configuration files are completely replaced, while other configuration files only have some sections modified. Appendix A of the Bright Cluster Manager Administrator Manual lists all system configuration files that are modified.

A file that has been generated entirely by the CMDaemon contains a header:

1# This file was automatically generated by cmd. Do not edit manually!

Such a file will be entirely overwritten, unless the FrozenFile <https://support.brightcomputing.com/manuals/9.2/admin-manual.pdf#section*.929>__ configuration file directive is used to keep it frozen.

Sections of files that have been generated by the CMDaemon will read as follows:

# This section of this file was automatically generated by cmd.
Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
...
# END AUTOGENERATED SECTION -- DO NOT REMOVE

Such a file has only the auto-generated sections entirely overwritten, unless the FrozenFile configuration file directive is used to keep these sections frozen. The FrozenFile configuration file directive in cmd.conf is set as in this example:

FrozenFile  =  {  "/etc/dhcpd.conf",  "/etc/postfix/main.cf"  }

If the generated file or section of a file has a manually modified part, and when not using FrozenFile, then during overwriting an event is generated, and the manually modified configuration file is backed up to:

1/var/spool/cmd/saved-config-files

Using FrozenFile can be regarded as a configuration technique, and one of various possible configuration techniques.

Configuration File Precedence#

While the cluster manager changes as little as possible of the standard distributions that it manages, there can sometimes be unavoidable issues. Sometimes a standard distribution utility or service generates a configuration file that conflicts with what the configuration file generated by the cluster manager conducts. In such a case the configuration file generated by the cluster manager must be given precedence, and the generation of a configuration file from the standard distribution should be avoided. Sometimes using a fully or partially frozen configuration file (3.4) allows a workaround. Otherwise, the functionality of the cluster manager version usually allows the required configuration function to be implemented. Details on the configuration files installed and updated by the package management system are further discussed in Appendix A of the Bright Cluster Manager Administrator Manual.