NVIDIA HPC-X Software Toolkit Rev 2.26

Spectrum-X NCCL Plugin

The Spectrum-X NCCL Plugin provides a set of plugins designed to optimize NVIDIA’s NCCL library for Spectrum-X and Quantum platforms. It enables more resilient and higher-performance communication across these platforms. In HPC-X, the plugin is located at: $HPCX_DIR/nccl_spectrum-x_plugin/lib.

For NCCL to detect the network plugin, add the plugin path to the library search path environment variable.

The plugin can be loaded by explicitly setting the library search path using LD_LIBRARY_PATH:

Copy
Copied!
            

$ export LD_LIBRARY_PATH=$HPCX_DIR/nccl_spectrum-x_plugin/lib:$LD_LIBRARY_PATH$ <run command>

With HPCX, the plugin can also be loaded by NCCL's environment variable NCCL_NET_PLUGIN=spcx

  • Network Failure Recovery - Automatically detects and recovers from communication link failures, ensuring uninterrupted distributed training and system stability without user intervention.

  • Dynamic Network Load Balance - Dynamically redistributes data across multiple network interfaces based on available throughput and communication patterns, preventing bottlenecks and improving overall transfer efficiency.

  • Topology Awareness – Automatically detects long-haul connections using network topology information and applies optimized transport settings to enhance overall network performance.

  • NCCL Profiler Plugin – Enables detailed monitoring of NCCL’s internal network activity during communication.

Resiliency and Load Balancing

The Spectrum-X NCCL Plugin provides enhanced resiliency and load-balancing capabilities when multiple network devices (ports or NICs) are available per GPU in AI workloads. Starting from version 1.2, the plugin also considers the communication pattern being executed by NCCL, like number of peers and traffic per peer, along with device load characteristics to provide optimal distribution of traffic across multiple network devices.

Topology Injection and Awareness

Starting with version 1.2, the Spectrum-X NCCL Plugin introduces a mechanism to specify network topology and transport parameters. The plugin supports specifying topology through a file format similar to Slurm’s topology format, where each entry defines connections between switches and hosts with relative latency or cable-length values.

Based on this topology, the plugin automatically identifies long-haul connections and takes advantage of NVIDIA XGS configuration for these. It also improves the NCCL algorithm selection model by taking the latencies across the long-haul connections into account.

For more information, refer to NVIDIA XGS.

Configurable Bandwidth Loss Limits

Starting with version 1.3, the Spectrum-X NCCL Plugin introduces configurable thresholds for acceptable bandwidth degradation during transparent failover. Applications can specify multi-tier limits where higher bandwidth loss triggers shorter tolerance windows before a fatal error is raised. Configuration is done through the NCCL_IB_NIC_BW_LOSS_LIMITS environment variable, which accepts comma-separated pairs of loss percentage and duration in minutes:

Copy
Copied!
            

$ export NCCL_IB_NIC_BW_LOSS_LIMITS=25:2,50:1

This example configuration triggers a fatal error if bandwidth loss reaches 25% for 2+ minutes, and 50% for 1+ minute, allowing brief performance drops during recovery while preventing long-running jobs from continuing with severely degraded performance.

SHARP

Plugin supports Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) - an in-network computing technology on InfiniBand Quantum switches.

The following environment variable enables the SHARP aggregation with NCCL when using the plugin.

Copy
Copied!
            

NCCL_COLLNET_ENABLE=1

Note

NVIDIA switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.

The following environment variable enables SHARP allgather overlap when using the plugin. This is useful when SHARP‑based Reduce‑Scatter is enabled, so the Reduce‑Scatter ↔ Allgather phase can overlap.

Copy
Copied!
            

SHARP_COLLNET_OVERLAP_AG=1


NetInspector Profiler

The Spectrum-X NCCL Plugin extends the NCCL Inspector profiler with NetInspector, which adds network-level instrumentation and telemetry.

NetInspector provides:

  • Transaction-level metrics – Per-transaction send/receive data including chunk sizes, resent bytes, div sizes (for app-aware load balancing), and peer addressing.

  • Burst bandwidth telemetry – Real-time bandwidth measurements per InfiniBand device using exponential moving averages (EMA with configurable beta).

  • Burst slowdown detection – Configurable threshold-based alerts when burst bandwidth falls below expected levels, with deficit tracking.

  • Queue Pair (QP) lifecycle – QP creation, weight updates, and cumulative load-balancing weight metrics per device.

  • Device resilience – Device failure and recovery events with total bandwidth, active device count, and bandwidth delta tracking.

  • Link error monitoring – Link error events with IB work completion status codes and cumulative link-error counts per device.

  • Async thread events – InfiniBand asynchronous event monitoring (port state changes, errors, etc.) captured from the IB async thread.

  • Collective performance – Per-collective metrics including algorithm bandwidth, bus bandwidth, message sizes, execution times, and timing source (GPU/CPU). Supports optional verbose mode with detailed event sequence numbers and timestamps per kernel channel.

Export Formats

NetInspector supports two export backends:

  • JSON Lines (.jsonl) – Microsecond-level timestamps in JSON Lines format for performance analysis

  • DTS (DOCA Telemetry Service) – OTLP-compatible metrics and logs export for integration with observability platforms (Prometheus, Grafana, etc.)

Environment Variables

Variable

Default

Description

NCCL_INSPECTOR_ENABLE

0

Enables the Inspector plugin. Set to 1 to enable profiling.

NCCL_INSPECTOR_NET_EVENTS

1

Controls network-event verbosity: 0 = disabled, 1 = enabled (collective and kernel-channel events), 2 = enabled with per-transaction logging.

NCCL_PROFILER_PLUGIN

Specifies the path to the profiler plugin library. Example: /path/to/libnccl-profiler-inspector.so , or inspector to search in LD_LIBRARY_PATH .

NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS

0

Interval (in microseconds) for the internal dump thread. 0 disables the dump thread. Lower values increase export frequency but may affect performance.

NCCL_INSPECTOR_DUMP_TYPE

0

Export backend type: 0 = disabled (none), 1 = JSON, 2 = DTS (DOCA Telemetry Service).

NCCL_INSPECTOR_DUMP_VERBOSE

0

Enables detailed event trace output (sequence numbers, timestamps per kernel channel). Set to 1 for verbose mode.

NCCL_INSPECTOR_DUMP_DIR

(auto-generated)

Sets output directory for profiler logs. Defaults to nccl-inspector-unknown-jobid or nccl-inspector-<SLURM_JOBID> when running under SLURM.

NCCL_INSPECTOR_BURST_SLOWDOWN_THRESHOLD_MBPS

0

Burst slowdown detection threshold in MB/s. When burst bandwidth falls below this threshold, a burst_slowdown event is generated. 0 disables detection.


© Copyright 2026, NVIDIA. Last updated on Mar 1, 2026