Spectrum-X NCCL Plugin
The Spectrum-X NCCL Plugin provides a set of plugins designed to optimize NVIDIA’s NCCL library for Spectrum-X and Quantum platforms. It enables more resilient and higher-performance communication across these platforms. In HPC-X, the plugin is located at: $HPCX_DIR/nccl_spectrum-x_plugin/lib.
For NCCL to detect the network plugin, add the plugin path to the library search path environment variable.
The plugin can be loaded by explicitly setting the library search path using LD_LIBRARY_PATH:
$ export LD_LIBRARY_PATH=$HPCX_DIR/nccl_spectrum-x_plugin/lib:$LD_LIBRARY_PATH$ <run command>
With HPCX, the plugin can also be loaded by NCCL's environment variable NCCL_NET_PLUGIN=spcx
Network Failure Recovery - Automatically detects and recovers from communication link failures, ensuring uninterrupted distributed training and system stability without user intervention.
Dynamic Network Load Balance - Dynamically redistributes data across multiple network interfaces based on available throughput and communication patterns, preventing bottlenecks and improving overall transfer efficiency.
Topology Awareness – Automatically detects long-haul connections using network topology information and applies optimized transport settings to enhance overall network performance.
NCCL Profiler Plugin – Enables detailed monitoring of NCCL’s internal network activity during communication.
Resiliency and Load Balancing
The Spectrum-X NCCL Plugin provides enhanced resiliency and load-balancing capabilities when multiple network devices (ports or NICs) are available per GPU in AI workloads. Starting from version 1.2, the plugin also considers the communication pattern being executed by NCCL, like number of peers and traffic per peer, along with device load characteristics to provide optimal distribution of traffic across multiple network devices.
Topology Injection and Awareness
Starting with version 1.2, the Spectrum-X NCCL Plugin introduces a mechanism to specify network topology and transport parameters. The plugin supports specifying topology through a file format similar to Slurm’s topology format, where each entry defines connections between switches and hosts with relative latency or cable-length values.
Based on this topology, the plugin automatically identifies long-haul connections and take advantage of NVIDIA XGS configuration for these. It also improves the NCCL algorithm selection model taking the latencies across the long-haul connections into account.
For more information, refer to NVIDIA XGS.
NetInspector Profiler
The Spectrum-X NCCL Plugin extends the NCCL Inspector profiler with NetInspector, which adds network-level instrumentation and telemetry.
NetInspector provides:
Transaction-level metrics – Per-transaction send/receive data including chunk sizes, resent bytes, and peer addressing.
Burst bandwidth telemetry – Real-time bandwidth measurements per InfiniBand device using exponential moving averages.
Queue Pair (QP) lifecycle – QP creation, weight updates, and load-balancing metrics.
Device resilience – Device failure and recovery events with bandwidth delta tracking.
Link error monitoring – Cumulative link-error counts per device.
NetInspector operates at configurable verbosity levels and exports telemetry in JSON Lines (.jsonl) format with microsecond-level timestamps for performance analysis.
Variable | Default | Description |
NCCL_INSPECTOR_ENABLE | 0 | Enables the Inspector plugin. Set to |
NCCL_INSPECTOR_NET_EVENTS | 0 | Controls network-event verbosity: |
NCCL_PROFILER_PLUGIN | - | Specifies the path to the profiler plugin library. Example: |
NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS | 500 | Interval (in microseconds) for the internal dump thread to write output to disk. Lower values increase frequency but may affect performance. |
NCCL_INSPECTOR_DUMP_VERBOSE | 0 | Enables detailed event trace output (sequence numbers, timestamps). Set to |
NCCL_INSPECTOR_DUMP_DIR | - | Sets output directory for profiler logs. Defaults to |