Release Notes :: NVIDIA Deep Learning NCCL Documentation

NCCL Release 2.30.3

This is the NCCL 2.30.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.30.3 has been tested with the following:

Deep learning framework containers. Refer to the Support Matrix for the supported container version.
This NCCL release supports CUDA 12.x and CUDA 13.x.

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.

Device API and GIN Enhancements: GIN contexts are no longer shared between device communicators backed by the same host communicator. Added per-context resource sharing modes for GIN, allowing GPU-scope or CTA-scoped resource. Added TrafficClass support to device Added versioning to ncclDevComm. Added timeout support to the device APIs. Added max_rd_atomic and max_dest_rd_atomic support in GIN. Upgraded doca-gpunetio to v2.0.2-rc1.
Elastic Buffers (LSA): Introduced feature to support new use cases where large tensors are split into multi-segment windows, with the active region in GPU memory and the remainder in host memory. Enables larger effective models and reduces memory pressure during spilling. Elastic buffers will support GIN in a future release.
gin.get with Nonblocking Flush: Added support for GPU‑initiated gets and checking completion without stalling. Currently only works with GDAKI (not with CPU proxy) and doesn't work on directNIC and Ampere.
Symmetric Memory: Added AVG operator to ReduceScatter Symmetric kernels. Enabled dynamic memory offload with group support for single-process, multi-GPU scenarios. Added support for GPU-only multi-segment registration for symmetric windows. Added CUDA graph capture and replay support for ncclPutSignal and ncclWaitSignal APIs. One-sided RMA can now use an external network plugin.
Tensor Memory Accelerator (TMA): Added TMA support in select built-in symmetric kernels to offload bulk peer‑to‑peer copies and reductions, improving NVLink bandwidth and latency. Can be enabled with NCCL_SYM_TMA_ENABLE=1.
DDP Support : Enables Dynamic Direct Path (DDP) so that NCCL can take advantage of hardware multipath and out‑of‑order receive for higher network performance on supported systems. Can be enabled with NCCL_IB_OOO_RQ=1.
Port Recovery: Added support for IB port recovery in NCCL. Improved NCCL's ability to recover from transient network issues so communicators can continue operating without full re‑initialization. Can be enabled with NCCL_IB_RESILIENCY_PORT_RECOVERY=1.
Cross Clique Support: Added support for treating multiple cliques as the same NVLINK domain. Can be enabled with NCCL_MNNVL_CROSS_CLIQUE=1.
NCCL Parameter Infrastructure: Added new C APIs to support querying NCCL parameters. Introduced ncclParamGetAllParameterKeys, ncclParamDumpAll, ncclParamGet and ncclParamGetParameter APIs.
NCCL4PY v0.2.0: Added new APIs from NCCL 2.29 release. Added devcomm create/destroy APIs to prepare for device API. Enabled Freethreading support. Added NCCL Inspector P2P event support.
ncclGinBarrierSession can now be created directly for the world team without manual resource allocation.
GIN proxy GFD size increased to 128 bytes with version field added.
GIN proxy CQ polling (ginProgress) moved to per-context to improve performance.
ncclBarrierSession no longer shares resources with ncclLsaBarrierSession or ncclGinBarrierSession.
Redundant NCCL_DEBUG=INFO log volume reduced significantly.
Added NVLSTree tuning that improves performance for various Blackwell systems.
Added p2pMaxPeers to communicator to achieve better tuning for send/recv vs. all2all.
Enabled LL128 protocol in heterogeneous scenarios for Hopper and later GPUs.
Added checks for mismatched Net and CollNet counts across communicators.
Added Graphana template for NCCL inspector dashboard rendering using Prometheus data.
Removed unused members nccl_id, comm, nccl_unique_id, and thread_ranks in the examples (Github PR #1989).
Added NCCL_LIBIBVERBS_SO environment variable to specify an absolute path for libibverbs (Github PR #2043).
Extended suspend memory offload to channel device allocations (Github PR #2060).

Fixed Issues

The following issues have been resolved in NCCL 2.30.3:

Fixed implicit CUDA synchronization in putSignal and CE collectives caused by pageable CPU stack memcpy.
Fixed a hang when using CE collectives and cuda graph under an edge case.
Fixed NULL access issue during finalize when RMA and GIN plugins are both initialized.
Fixed race conditions in all2all GIN/Hybrid examples with more than one CTA.
Fixed ncclGinType_t uint8_t enum compatibility issue in nccl4py.
Fixed several memory leaks in communicator create/destroy code paths.
Fixed a bug in plugin compat layer for v11 related to lazy initialization.
Fixed data corruption in symmetric LL kernels with unaligned buffer.
Fixed plugin name being cleared after communicator destroy (Github Issue #1978).
Fixed deadlock and use-after-free in the inspector plugin (Github Issue #2000).
Fixed incorrect network interface selection caused by inverted Boolean logic in matchSubnet (Github PR #2047).
Fixed regression from 2.29.2 where CPU affinity mask is not restored in initTransportsRank (Github issue #2033).

Known Issues

Applications that use GIN APIs need to be recompiled with 2.30.3 to work with 2.30.3 runtime.
gin.get requires GDAKI and is not supported on Ampere or directNIC platforms.

Updating the GPG Repository Key

To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.