NCCL Release 2.28.7
This is the NCCL 2.28.7 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
-
Deep learning framework containers. Refer to the Support Matrix for the supported container version.
-
This NCCL release supports CUDA 12.x and CUDA 13.x.
Key Features and Enhancements
This NCCL release includes the following key features and enhancements.
-
Introduced GPU-Initiated Networking (GIN), an experimental device-side API to extend the Device API to support communication of the network. The signatures and functionality may evolve in future releases. Applications using the device APIs must be recompiled with new NCCL releases. The support for ABI compatibility will be added in the future.
-
Improved the Device API with a new ncclBarrierSession feature; removing a restriction on multimem to make it available to as few as two ranks; and removing the distance (NCCL_P2P_LEVEL) considerations from determining the availability of symmetric memory.
-
Added ncclCommRevoke API to enhance fault tolerance by enabling NCCL to quiesce ongoing work on a communicator without freeing resources.
-
Introduced a new NCCL Environment Plugin that allows users to set NCCL parameters without having to set environment variables for the process.
-
Released a set of examples that highlight NCCL's core features, including communicator initialization, point-to-point communication, and collective operations. Advanced features like user buffer registration, symmetric memory, and the device API are also covered.
-
Enhanced NCCL RAS output to add JSON in order to support machine-parsable metrics collection.
-
Integrated CPU Optimizations for NCCL Initialization at Large Scale.
-
Improved Bootstrap AllGather by 2x at large scale by sending bootstrap information bidirectionally.
Fixed Issues
The following issues have been resolved in NCCL 2.28.7:
-
Fixed multicast object leaks in case of failed NVLS user buffer registrations.
-
Fixed potential data corruption with built-in symmetric kernels for small messages with size granularity under 8 bytes or when multiple symmetric operations were aggregated in a group.
-
Generalized the existing point-to-point scheduling to the case of un-even GPU count per node.
-
Fixed a crash when network plugin assignment fails.
-
Fixed a large performance issue with NCCL_CROSS_NIC=0 and certain split mask settings, where NCCL could not find a viable ring.
-
Fixed crash when NCCL was compiled with recent CUDA versions but run on hosts with certain specific older CUDA drivers.
-
Fixed spurious failures when PyTorch was statically linked with NCCL-2.28.3.
Updating the GPG Repository Key
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.