NCCL Release 2.30.7
This is the NCCL 2.30.7 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
-
Deep learning framework containers. Refer to the Support Matrix for the supported container version.
-
This NCCL release supports CUDA 12.x and CUDA 13.x.
Key Features and Enhancements
This NCCL release includes the following key features and enhancements.
-
Zero-SM Collectives: Added hierarchical zero-SM collectives, AllGather and All2all, that use the RMA CPU proxy for inter-node communication and Copy Engines for intra-node communication. Enables better overlap of compute and communication. Enable hierarchical zero-SM collectives with the NCCL_CTA_POLICY_ZERO flag.
-
GIN Enhancements: Added a new experimental GPU Push Interface (GPI) backend for GIN. Added explicit signal semantics with Strong and Weak signals. Added proper ncclGinFenceLevel semantics for barriers. Added a separate NCCL_GIN_IB_TC toggle to control traffic class used by GIN. Added NCCL_GIN_RESOURCE_SHARING_THREAD to enable more optimizations. Optimized QP overhead, including GDAKI mode when counters were not used. Ensured GIN was usable when NIC fusion was enabled. Added a GIN plugin example in plugins/gin/example.
-
Symmetric Memory Improvements: Restructured the RMA plugin architecture. Added support for asymmetric buffer sizes during window registration. Optimized ReduceScatter symmetric kernel performance. Optimized performance for RMA operations using CE. Added batched CE operations to improve performance in the RMA CE put/wait path. Added support for window registration during CUDA graph capture.
-
MPS with MLOPart Support (Experimental): Leveraged the CUDA Memory Locality Optimized Partition (MLOPart) feature. Supported up to 2 ranks per physical GPU with MPS and MLOPart.
-
Added support for IB ports that required global route headers (GRH).
-
Added logic to gin.flush to ensure all prior gets are visible.
-
Added makefile support to compile Python wheels from source.
-
Added NCCL_RMA_DISABLE environment variable to enable or disable RMA (Github PR #2151).
-
Implemented reset-without-zeroing for signals and counters in GIN (Github PR #2155).
-
Pin GIN proxy thread to NUMA-local CPU set (Github PR #2182).
-
Added optimized weight transfer APIs in contrib/nccl_xfer.
-
Added custom kernels in contrib/custom_algos for alltoall and allreduce using NCCL Device API.
-
Added examples of Root Mean Square Normalization (RMSNorm), demonstrating the fusion of computation and communication using the device API.
-
Unified coding style by using clang-format. See docs/dev_guide/nccl_coding_style.md for more details.
-
Dropped support for v11 and v12 GIN plugin APIs.
Fixed Issues
The following issues have been resolved in NCCL 2.30.7:
-
Fixed a deadlock caused by cuda stream allocation under PXN when memseting a buffer at runtime.
-
Reintroduced cudaGridDependencySynchronize in built-in symmetric kernels, ensuring that newly launched kernels cannot access memory modified by prior kernels before it reaches point of coherency.
-
Ignored system headers in include/header processing, thereby avoiding excessive realpath calls in some builds (Github PR #1806).
-
Improved QP load balancing on systems configured with RoCE LAG with the round-robin queue affinity policy (Github PR #2150).
-
Fixed issue when receiving an external TCP request causes the proxy thread's ncclProxyService to hang (Github PR #1834).
-
Fixed rma_proxy MR registration type for host-NUMA cpuAccessSignals, which ensures that the net plugin does not reject the registration due to wrong memory type (Github PR #2187).
-
Fixed GIN init context leak (Github PR #2179).
-
Fixed issue with one-sided host APIs when a custom GIN plugin is used.
-
Fixed one-sided host API issue where requests are dropped at a high message rate (Github Issue #2119).
Known Issues
-
NCCL one-sided host RMA APIs, e.g., ncclPutSignal, require every rank to call the API as a one-time initialization warm-up. This will be fixed in an upcoming release.
Updating the GPG Repository Key
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.