System Health Checks and Debugging

The key to successfully operating and managing a cluster is that all nodes are configured identically for their function, and they operate consistently. When issues arise, it becomes necessary to test systems to see if they are operating correctly.

If an issue is found, it should be removed from the batch partition for initial triage.

Unless the issue is obvious, follow the GPU System debugging guidelines process that is at https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html

In addition, run tools specific to the DGX A100 systems. For health checks, this is the NVIDIA System Management tool (nvsm).

It is also useful to develop a set of single-node and multi-node tests to help validate the operation and performance of the DGX SuperPOD (Table 13). Often it is best to use your own key applications for this purpose as those exercise the system in the way that is most important to the users.

Table 13. DGX SuperPOD validation tools

Software

Purpose

Link

NCCL

Fabric

https://github.com/NVIDIA/nccl-tests

HPL

Math intensive applications with network communications

https://github.com/NVIDIA/deepops/tree/master/workloads/burn-in

In addition, there are standard applications that can be used to validate both single- and multi-node performance. When running the following tests, you should expect that performance between runs of the same configuration on distinct parts of the system should run in a similar time or at a similar performance level. Performance can vary between run-to-run because of system configuration and existing job load. However, over multiple runs on the same sets of hardware a difference is found, it can indicate an issue with some component of that system.

Collecting Log Files

Important log files include:

  1. /var/log/cmdaemon (the most important one—CMDaemon log file).

  2. /var/log/node-installer (node-installer log file).

  3. /var/spool/cmd/<slave-node-name>.rsync (provisioning logs).

Log Subsystems

Each logging message is emitted as part of a subsystem which is defined on a per-translation unit basis. Example subsystems include CONFIG, MIC, GPU, CLOUD, PROV, SERVICE, WLM, DB, USER, JSON, HADOOP, and CMD (can be found in logger.h). Example log output /var/log/cmdaemon:

1Mar 30 03:38:02 headnodeName cmd: [ CLOUD ]  DevDbg:
2Mar 30 03:38:02 headnodeName cmd: [  CMD  ]   Debug: [programrunner.cpp:797           ] ProgramRunner: /cm/local/apps/cmd/scripts/cloudproviders/openstack/openstackcommands.py  [DONE] 0 0
3Mar 30 03:38:02 headnodeName cmd: [  CMD  ] Warning: [magicmanager.cpp:1797           ] This is a warning.

The subsystem enclosed in [ ], followed by the log type (debug, warning, info, error), the location in the source code (present only in -D _DEBUG compiles), and followed by the log message.

Increasing Log Verbosity

Verbosity of individual subsystems can be changed using the /cm/local/apps/cmd/etc/logging.cmd.conf config file. After modifying logging.cmd.conf one must either restart CMDaemon or run service cmd logconf to reload the logging config file. The default settings in this file are as follows:

1Severity {
2    info: *
3 warning: *
4   debug:
5   error: *
6}

Which means that messages from all subsystems in all verbosity levels (except “debug”) are always logged. Modifying logging.cmd.conf is useful when focusing on developing futures for only specific subsystems, as it can be used to quiet down logs from the remaining subsystems and focus only on essentials.

That is, all log messages from the CLOUD subsystem can be enabled, while only allowing WARNING and ERROR messages from all other remaining subsystems.

1Severity {
2     info: CLOUD
3  warning: *
4    debug: CLOUD
5    error: *
6 }

One can also use logging.cmd.conf to optionally enable logging of ThreadIDs, subsystem names, and microsecond-resolutions in the timestamps.

Global Debug Mode

One can toggle the so called global debug mode by means of service cmd debug{on|off} or by means of starting cmd with -d flags. In this mode, the custom settings from logging.cmd.conf are ignored and instead all log messages from all subsystems are always logged in maximum verbosity. Global debug mode is equivalent to using the following logging.cmd.conf:

1Severity {
2    info: *
3 warning: *
4   debug: *
5   error: *
6}

LOGPREFIX

Use the LOGPREFIX(“DeviceManager”)macro to prepend all subsequent log{i,d,dd,e,w}() calls with an additional text, for example:

1Manager::someFun() {
2  LOGPREFIX("SomeFunction:");
3  logdd("entered function");
4  ...
5  logdd("left function");
6}

Will result in the following logs:

1"SomeFunction: entered function"
2"SomeFunction: left function"