NCCL Tests Plugin

Important

  • This plugin will not be available unless the required environment variable is set. See the Configuration section in the Prerequisites section for details.

Overview

NCCL (pronounced “Nickel”) is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into applications. NCCL Tests check both the performance and the correctness of NCCL operations. For more information about NCCL and its test suite, see the NCCL documentation and the NCCL Tests repository.

The NCCL Tests plugin, part of the diag Level 3 tests, validates system readiness for NCCL-based workloads by testing communication correctness on single-node systems.

Note: This plugin only runs single-node NCCL tests and does not require MPI support. Multi-node NCCL tests are not currently supported.

Prerequisites

Before running the NCCL Tests plugin, ensure the following prerequisites are met:

Software Requirements:

  • NCCL library (libnccl.so) - not bundled with DCGM, must be installed separately.

  • NCCL Tests project - not bundled with DCGM, must be built/installed separately from the NCCL Tests repository.

  • CUDA toolkit (required by NCCL).

Permissions:

  • NCCL Library: Read permission (dynamically loaded, not directly executed by the plugin)

  • NCCL Test Binary: The following permissions are required based on how DCGM is running:

    • When running as non-root:

      • Must be a regular file (not a directory, device, or pipe etc…).

      • Must have read and execute permissions.

    • When running as root (additional security checks enforced):

      • Must be owned by root user and root group (uid 0, gid 0).

      • Must not be writable by group or other users.

      • Must be executable by the owner.

      • Must be a regular file (not a directory, device, or pipe etc…).

      • Symlinks are automatically resolved to their canonical path before validation.

Configuration:

  • The DCGM_NCCL_TESTS_BIN_PATH environment variable must be set to a valid directory containing the NCCL test binaries before starting nv-hostengine or nvidia-dcgm service.

  • The plugin will be gracefully disabled and skipped if this environment variable is not set or points to an invalid directory.

Test Description

The NCCL Tests plugin validates NCCL functionality on the system by executing test binaries from the NCCL Tests project. These tests verify that NCCL collective communication operations work correctly across the GPUs in the system.

Plugin Validation:

The plugin performs the following validation checks:

  1. Binary Execution: Verifies the test binary can be launched successfully.

  2. Exit Code Check: Ensures the binary exits with code 0 (non-zero exit codes result in test failure)

  3. Output Parsing and Validation: If the binary executes successfully, the plugin parses the test output and validates that NCCL operations completed correctly without errors.

Supported Parameters

The following table lists the parameters for the NCCL Tests plugin:

Parameter Name

Type

Default

Description

is_allowed

Bool

True

Specifies whether this test is allowed to run. When set to False, the test will be skipped.

Environment Variables:

The plugin requires the following environment variables:

  • DCGM_NCCL_TESTS_BIN_PATH: Path to the directory containing NCCL test binaries (e.g., /opt/nccl-tests/bin or /usr/local/nccl-tests/build)

If this environment variable is not set, the plugin will be gracefully disabled and will not appear in the diagnostic output.

Sample Commands

Configure environment variables for DCGM running as a systemd service:

# Add environment variables via systemd override
$ sudo systemctl edit nvidia-dcgm

# Add the following lines in the editor that opens:
[Service]
Environment="DCGM_NCCL_TESTS_BIN_PATH=/path/to/nccl-tests/"

# Save and exit, then restart the service
$ sudo systemctl restart nvidia-dcgm

# Run the NCCL tests
$ dcgmi diag -r nccl_tests

Run the NCCL tests as a standalone diagnostic with custom binary location:

# Set environment variables and run nv-hostengine
$ export DCGM_NCCL_TESTS_BIN_PATH=/path/to/nccl-tests/
$ nv-hostengine

# Run the NCCL tests
$ dcgmi diag -r nccl_tests

Run the level 3 diagnostic including the NCCL tests (assuming nv-hostengine is running with the environment variables set):

$ dcgmi diag -r 3

Failure Conditions

The plugin will report a failure if:

  • The test binary cannot be executed or exits with a non-zero status code.

  • When running DCGM as root, the binary fails security checks (incorrect ownership or permissions).

  • The test output cannot be parsed correctly.

  • Any of the validation criteria described in the Plugin Validation section are not met.

Runtime

The NCCL Tests plugin typically completes in a few minutes, depending on:

  • The number of GPUs in the system

  • The interconnect topology and bandwidth

Systems with more GPUs or slower interconnects may take longer to complete the test.