NVIDIA HPC-X Software Toolkit Rev 2.25.1

Spectrum-X NCCL Plugin

The Spectrum-X NCCL Plugin provides a set of plugins designed to optimize NVIDIA’s NCCL library for Spectrum-X and Quantum platforms. It enables more resilient and higher-performance communication across these platforms. In HPC-X, the plugin is located at: $HPCX_DIR/nccl_spectrum-x_plugin/lib.

For NCCL to detect the network plugin, add the plugin path to the library search path environment variable.

The plugin can be loaded by explicitly setting the library search path using LD_LIBRARY_PATH:

Copy
Copied!
            

$ export LD_LIBRARY_PATH=$HPCX_DIR/nccl_spectrum-x_plugin/lib:$LD_LIBRARY_PATH$ <run command>

With HPCX, the plugin can also be loaded by NCCL's environment variable NCCL_NET_PLUGIN=spcx

  • Network Failure Recovery - Automatically detects and recovers from communication link failures, ensuring uninterrupted distributed training and system stability without user intervention.

  • Dynamic Network Load Balance - Dynamically redistributes data across multiple network interfaces based on available throughput and communication patterns, preventing bottlenecks and improving overall transfer efficiency.

  • Topology Awareness – Automatically detects long-haul connections using network topology information and applies optimized transport settings to enhance overall network performance.

  • NCCL Profiler Plugin – Enables detailed monitoring of NCCL’s internal network activity during communication.

Resiliency and Load Balancing

The Spectrum-X NCCL Plugin provides enhanced resiliency and load-balancing capabilities when multiple network devices (ports or NICs) are available per GPU in AI workloads. Starting from version 1.2, the plugin also considers the communication pattern being executed by NCCL, like number of peers and traffic per peer, along with device load characteristics to provide optimal distribution of traffic across multiple network devices.

Topology Injection and Awareness

Starting with version 1.2, the Spectrum-X NCCL Plugin introduces a mechanism to specify network topology and transport parameters. The plugin supports specifying topology through a file format similar to Slurm’s topology format, where each entry defines connections between switches and hosts with relative latency or cable-length values.

Based on this topology, the plugin automatically identifies long-haul connections and take advantage of NVIDIA XGS configuration for these. It also improves the NCCL algorithm selection model taking the latencies across the long-haul connections into account.

For more information, refer to NVIDIA XGS.

NetInspector Profiler

The Spectrum-X NCCL Plugin extends the NCCL Inspector profiler with NetInspector, which adds network-level instrumentation and telemetry.

NetInspector provides:

  • Transaction-level metrics – Per-transaction send/receive data including chunk sizes, resent bytes, and peer addressing.

  • Burst bandwidth telemetry – Real-time bandwidth measurements per InfiniBand device using exponential moving averages.

  • Queue Pair (QP) lifecycle – QP creation, weight updates, and load-balancing metrics.

  • Device resilience – Device failure and recovery events with bandwidth delta tracking.

  • Link error monitoring – Cumulative link-error counts per device.

NetInspector operates at configurable verbosity levels and exports telemetry in JSON Lines (.jsonl) format with microsecond-level timestamps for performance analysis.

Variable

Default

Description

NCCL_INSPECTOR_ENABLE

0

Enables the Inspector plugin. Set to 1 to enable profiling.

NCCL_INSPECTOR_NET_EVENTS

0

Controls network-event verbosity: 0 = disabled, 1 = collective and kernel-channel events only, 2 = full per-transaction logging.

NCCL_PROFILER_PLUGIN

-

Specifies the path to the profiler plugin library. Example: /path/to/libnccl-profiler-inspector.so, or inspector to search in LD_LIBRARY_PATH.

NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS

500

Interval (in microseconds) for the internal dump thread to write output to disk. Lower values increase frequency but may affect performance.

NCCL_INSPECTOR_DUMP_VERBOSE

0

Enables detailed event trace output (sequence numbers, timestamps). Set to 1 for verbose mode.

NCCL_INSPECTOR_DUMP_DIR

-

Sets output directory for profiler logs. Defaults to nccl-inspector-unknown-jobid or nccl-inspector-<slurm_job_id> when running under SLURM.

© Copyright 2025, NVIDIA. Last updated on Nov 24, 2025