NCCL Release 2.24.3
This is the NCCL 2.24.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
-
Deep learning framework containers. Refer to the Support Matrix for the supported container version.
-
This NCCL release supports CUDA 12.2, CUDA 12.4, and CUDA 12.6.
Key Features and Enhancements
This NCCL release includes the following key features and enhancements.
- Leverage user buffer registration for inter-node copy operations using the ring or tree algorithm.
- Added RAS subsystem and ncclras tool to allow for a report of the NCCL state in case of a hang.
- Added support for 8bit floating point data types (e5m2 and e4m3).
- Added NIC fusion: fuse multiple NICs together as a single, larger one.
- Retry in case of socket connection failure (unreachable host).
- Retry in case of IB QP connection failure.
- Improved support for external network plugins: allow plugins to force a flush, indicate when completion is not needed, allow for full offload of allgather operations when using one GPU per node.
- Extended NCCL_ALGO/NCCL_PROTO syntax and strictly enforce their values.
- Made host memory allocation use cumem functions by default.
Fixed Issues
The following issues have been resolved in NCCL 2.24.3:
- Return ncclInvalidUsage when NCCL_SOCKET_IFNAME is set to an incorrect value.
- Fixed PAT tuning.
- Fixed hangs when running with different CPU architectures.
- Fixed FD leak in UDS.
- Fixed crash when mixing buffer registration and graph buffer registration.
- Made ncclSend/ncclRecv communication with buffer registration functional on network plugins relying on dmabuf for buffer registration.
- Fixed crash in IB code caused by uninitialized fields.
- Fixed case where ncclSend/ncclRecv would return ncclSuccess in non-blocking mode even though the operation was not enqueued onto the stream (Github issue 1495).
Updating the GPG Repository Key
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.