Release Notes :: NVIDIA Deep Learning NCCL Documentation

NCCL Release 2.27.3

This is the NCCL 2.27.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.27.3 has been tested with the following:

Deep learning framework containers. Refer to the Support Matrix for the supported container version.
This NCCL release supports CUDA 12.2, CUDA 12.4, and CUDA 12.9. The provided prebuilt binaries should work with other CUDA 12.x versions as well.

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.

Added symmetric memory API (ncclCommWindowRegister with the NCCL_WIN_COLL_SYMMETRIC flag, ncclCommWindowDeregister). This capability is on by default (NCCL_WIN_ENABLE=0 can be used to disable) and depends on CUMEM being operational.
Implemented specialized kernels taking advantage of symmetrically registered memory. Initial support includes P2P and NVLS connectivity, floating point types up to 32 bits, sum as the reduction operator, and one collective operation per group.
Added support for DGX Spark.
Added support for DirectNIC (CX8) to the internal IB plugin.
Added support for communicator shrinking (ncclCommShrink).
Added support for loading multiple network plugins (NCCL_NET_PLUGIN now accepts a comma-separated list of plugins to load).
Added NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration.
Decreased the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator.
Enabled fine-tuning of NCCL behavior per communicator using new ncclConfig_t members collnetEnable, CTAPolicy, and nvlsCTAs (the environment variable NCCL_CTA_POLICY can also be used to force a particular CTA policy).
Implemented several enhancements to the profiler in terms of expanding the amount of information being provided at initialization, providing GPU-generated timestamps for the GPU kernel events, adding support for network-defined event updates, and more.
Optimized the performance of collective operations on GB200 systems by increasing to 16 the number of NICs used to communicate between MNNVL domains.
Added support for more complex MNNVL topologies.
Disabled implicit fallback to a potentially much slower network transport if the MNNVL fabric initialization was unsuccessful. Such failures typically indicate a misconfigured IMEX support on the system; NCCL_MNNVL_ENABLE=0 can be used to continue without MNNVL.
Disabled the creation of fused NICs for physical devices that haven't been merged.
Improved support for platforms with C2C connectivity by enabling GPUDirect RDMA for the NICs by default (NCCL_NET_GDR_C2C=0 can be used to disable) and by adding support for P2C (PXN over C2C) and the LL128 protocol.
Extended NCCL fault tolerance to support more complex multithreaded scenarios during initialization.
Enabled ncclImplicitOrderLaunch for CUDA 12.9+.
Improved the netSocket transport latency and provided finer control over, e.g., send/receive buffer size.
Improved the readability of the CPU affinity in the debug output.

Fixed Issues

The following issues have been resolved in NCCL 2.27.3:

Enabled graceful fallback if NVLS initialization fails (NCCL_NVLS_ENABLE=1 can be used to preserve the old behavior of returning an error).
Fixed several issues in the profiler support: reporting the correct number of channels used, keeping track of an event identifier, improving backward compatibility, and more.
Fixed multiple issues related to MNNVL support: a hang in alltoall-like communication patterns at a scale of over 80 ranks, NCCL_P2P_DISABLE=1 leaving MNNVL enabled (it now implies NCCL_MNNVL_ENABLE=0), an initialization failure when NCCL_TOPO_FILE was being used, failure to exclude non-local NICs in the graph search, and an incompatibility of the SHM transport with MNNVL.
Fixed a potential race condition when mixing graph and non-graph execution.
Improved PXN connection code to avoid duplicate and unused connections.
Fixed several issues in RAS related to race conditions and memory corruption.
Fixed a potential memory corruption in ncclCommSplit with resource sharing.
Fixed a small memory leak and over-sychronization in asynchronous graph upload.
Added a check for out-of-memory conditions in ncclMemAlloc.
Cleaned up the NCCL socket code: made it more willing to retry in case of errors during connection establishment, added error checking in a few instances, and improved the debug output.
Switched NCCL_DEBUG_FILE to line buffering.
Fixed other minor issues, primarily in the graph search code and the internal IB plugin

Updating the GPG Repository Key

To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.