DCGM Multi-Node Diagnostics
Overview
The DCGM multi-node diagnostics (mndiag) feature enables users to validate the performance and stress resilience of GPUs across an entire cluster by actively stressing the GPUs and their interconnections in a coordinated and automated fashion.
Terminology
Term |
Description |
---|---|
DCGM |
Data Center GPU Manager |
Mnubergemm |
Multi-node ubergemm: an MPI application which stresses GPUs across hosts simultaneously. |
Headnode |
The node on which the multi-node diagnostics is launched. |
MPI |
Message Passing Interface - a popular interface for launching work that spans hosts, implemented by OpenMPI and other programs. MNubergemm relies on MPI to run. |
DCGM Multi-Node Diagnostics Goals
DCGM multi-node diagnostics are designed to run automated multi-node stress testing by orchestrating and executing the stress test (currently only mnubergemm) simultaneously on all GPUs across multiple nodes to validate interconnects, memory, and compute. The diagnostics use centralized orchestration and monitoring with the help of a head node running the DCGM hostengine to coordinate test execution, monitor progress, and collect results from all participating nodes also running DCGM hostengines. The diagnostics also allow for flexible selection of nodes to participate in the test, thereby supporting complex cluster topologies.
Supported Versions
DCGM 4.3.0 onwards
Supported Products
GB200 NVL (SKU - 0x2941)
Prerequisites
All of the nodes in the cluster, targeted in the test, will require the following to be installed and configured:
OpenMPI
Passwordless ssh access between all nodes in the cluster
Note
Each node is required to be using the same NVIDIA driver version.
Supported Tests for Multi-Node Diagnostics
Mnubergemm
Getting Started with DCGM Multi-Node Diagnostics
Command-line Options
The command line options are designed to control general execution parameters. The following table lists the various options supported by DCGM Diagnostics:
Short option |
Long option |
Parameter |
Description |
---|---|---|---|
|
host_name:port or host_name:socket_address Example: node1:5000;node2;node3:unix:///tmp/dcgm.sock |
Required. List of hosts to run diagnostics on. Each entry can specify a port or Unix socket. |
|
|
IP/FQDN |
Connects to the specified IP or FQDN for the head node host engine. Default: localhost. |
|
|
|
test_name |
Multi-node diagnostics test to run. Only mnubergemm is currently supported and is the default value. |
|
|
test_name.variable_name=variable_value Example: mnubergemm.time_to_run=300;mnubergemm.flag |
Test parameters to set for this run. Multiple parameters can be separated by semicolons. Currently only mnubergemm.time_to_run is supported. |
|
|
Print the output in JSON format. |
|
|
Displays usage information and exits. |
Usage Examples
Basic Multi-Node Diagnostics
Dcgmi should be launched on any one of the nodes in the cluster (the head node) with a list of hostnames/IP addresses of all the nodes that need to be targeted by the test, including the head node. The head node will orchestrate the test run on all the nodes in the list.
dcgmi mndiag --hostList "node1;node2;node3"
Specify Port
If nv-hostengine is running on a non-default port, it should be specified with the host name in the host list.
dcgmi mndiag --hostEngineAddress 10.0.0.1:5555 --hostList "node1:5000;node2;node3:unix:///tmp/dcgm.sock"
Set Test Duration and Pass Additional Parameters
This command sets the test duration to 600 seconds (10 minutes) for the mnubergemm test on node1 and node2. Additional parameters can be passed using the -p option.
dcgmi mndiag --hostList "node1;node2" -p "mnubergemm.time_to_run=600"
Output Results in JSON Format
This command runs the diagnostic on node1 and node2 and outputs the results in JSON format.
dcgmi mndiag --hostList "node1;node2" -j
Custom Head Node Address
If the head node is not the same as the localhost, the head node address can be provided separately.
dcgmi mndiag --hostList "node1;node2" --hostEngineAddress 192.168.1.100
Mnubergemm Test Overview
Mnubergemm is a high-performance benchmarking and testing framework targeting matrix multiplication (GEMM) and related computational workloads, with a focus on distributed and heterogeneous computing environments. It is designed to evaluate the performance, correctness, and robustness of GEMM operations and other compute kernels (such as network, CPU, and error correction tests) across clusters of CPUs and GPUs. The framework supports configurable workloads, fine-tuning, and synchronization across multiple nodes using MPI, making it suitable for stress-testing, performance characterization, and validation of both hardware and software stacks in modern data centers.
Test Setup
An installation of OpenMPI (version 4.1.1 or any later ABI compatible version) on each of participating machines.
Note
This requirement should be satisfied automatically for systems in which the DCGM multi-node package has been installed using the system package manager.
The value of the DCGM_MNDIAG_MPIRUN_PATH environment variable for the nv-hostengine process should be set to the path of the mpirun command.
The value of the LD_LIBRARY_PATH environment variable for the nv-hostengine process should be set to the directory containing the OpenMPI libraries.
Note
This requirement should be satisfied automatically for systems in which
the DCGM multi-node package has been installed using the system package manager on a DCGM-supported Linux distribution and
the nvidia-dcgm.service systemd unit is used to manage the nv-hostengine daemon.
An account is available on all participating machines with the same name as the user invoking
dcgmi
to launch the the multi-node diagnostics.Either
login shells associated for the user launching the multi-node diagnostic are configured to use OpenMPI, or
the installation root directory of OpenMPI is the same on each participating machine and OpenMPI has been configured to propagate the installation root by default.
Non-interactive ssh needs to be configured between all the nodes. Dcgmi and OpenMPI will expect SSH keys to be configured such that the following works:
ssh <remote node>
Warning
SSH Key Security and Non-Interactive Access
DCGM Multi-Node Diagnostics require passwordless SSH access between the head node and all participating nodes. For this to work in a non-interactive context, the SSH private key used for authentication must be stored unencrypted on disk on the head node.
Using an ssh-agent to load encrypted keys into memory will not work in this scenario:
$ eval $(ssh-agent) $ ssh-add <path to SSH key> $ dcgmi mndiag <args>...
This is because the SSH sessions required by the diagnostic are spawned by the nv-hostengine process (managed by the nvidia-dcgm service), not as a child of the user’s shell or dcgmi process. As a result, the agent’s environment is not inherited by these SSH sessions, and the agent cannot provide the key. The SSH private key must be available unencrypted on disk for passwordless access to work in this context.
For more information, see the OpenMPI documentation on non-interactive SSH.
The nv-hostengine daemon should be started on all the nodes.
Each node should have the same NVIDIA driver version.
Supported Parameters
Parameter |
Description |
---|---|
mnubergemm.time_to_run |
Run time of the mnubergemm test (in seconds). Default value is 3600 seconds. |
Sample output
Executing: ./dcgmi mndiag --hostList "node1:5000;node2;node3;node4;node5;node6;node7:4000;node8"
Successfully ran multi-node diagnostics.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| MNUBERGEMM Test | Pass |
| Driver Version | 570.133 |
| DCGM Version | 4.3.0 |
| Hosts Found | 8 |
| Hosts With Errors | 0 |
| Host List | 0: node1 |
| | 1: node2 |
| | 2: node3 |
| | 3: node4 |
| | 4: node5 |
| | 5: node6 |
| | 6: node7 |
| | 7: node8 |
| Total GPUs | 32 |
|----- Host Details ------+------------------------------------------------|
| Host 0 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 1 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 2 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 3 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 4 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 5 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 6 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
| Host 7 | Pass |
| GPUs: 0, 1, 2, 3 | Pass |
+---------------------------+------------------------------------------------+
=========================================
Test completed at: Sat Jun 21 09:49:50 AM PDT 2025
Exit code: 0
Process monitoring will be stopped
=========================================
Failure Conditions
The multi-node diagnostic will fail in the following cases:
Uncorrectable GPU memory errors, persistent compute errors, or other GPU hardware faults are detected on any participating node.
NVLink, PCIe, or other interconnect errors prevent communication between GPUs or nodes, or cause high latency or dropped connections.
The observed compute or memory bandwidth is significantly lower than expected, or performance is inconsistent or abnormally low across GPUs or nodes.
Insufficient GPU or system memory is available for the test workload, or one or more required GPUs are not detected or unavailable on a node.
One or more nodes are unreachable, misconfigured, or have inconsistent software or driver versions.
The test exceeds the allowed runtime (default: 1 hour + 60 seconds latency).
Required binaries (such as
mnubergemm
ormpirun
) are missing or not executable on any node.SSH or network issues prevent launching processes on remote nodes.
Malformed or unsupported test parameters, or invalid host list syntax, are provided.
The test process exits abnormally (crash, signal, or non-zero exit code).
Any error message is detected in the test output indicating a failure on any GPU or node.
If any of these conditions are met, the multi-node diagnostic will report a failure and provide details about the affected nodes and GPUs.