NCCL Release 2.28.3
This is the NCCL 2.28.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
-
Deep learning framework containers. Refer to the Support Matrix for the supported container version.
-
This NCCL release supports CUDA 12.x and CUDA 13.x.
Key Features and Enhancements
This NCCL release includes the following key features and enhancements.
- Added Device API, an experimental device-side API to integrate NCCL communication directly into application kernels. The signatures and functionality may evolve in future releases. No ABI compatibility is currently guaranteed for this API. Applications must be recompiled with each new NCCL release.
- Symmetric kernels were reimplemented using the device API and have support for aggregating symmetric operations using ncclGroupStart/End.
- Added new host collective APIs: ncclAlltoAll, ncclScatter, and ncclGather.
- Added experimental support for CMake as an alternative to building with Makefiles. Known issues to be fixed in an upcoming release: pkg.build and Device API are not supported.
- Reduced SM utilization within a single (MN)NVL domain by using CE (Copy Engine) Collectives. Freed up more SM capacity for the application when using ncclAlltoAll, ncclGather, or ncclScatter. To enable the feature, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.
- Decreased max CTA count from 32 to 16 on Blackwell. This improvement decreased SM overhead by 50%, but it may cause a performance drop on Blackwell. It can be overridden by setting NCCL_MIN_CTAS=32 and NCCL_MAX_CTAS=32, or setting the same in the communicator config. Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.
- Network Plugin added version 11 which supports per-communicator init/finalize. Now passes information about communication operations to be executed on the network end point. Added multi-request net API to help anticipate and optimize multiple send/recv requests.
- Profiler Plugin added support for API events (group, collective, and p2p) and for tracking kernel launches. Added Inspector Profiler example plugin for always-on performance monitoring and a hook to Google’s CoMMA profiler on github.
- Tuner Plugin exposed NCCL tuning constants with ncclTunerConstants_v5_t and added NVL Domain Information API.
- Plugin system added support for multiple plugin types from a single shared object.
- Added new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
- Added NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
- Added NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks" case as opt in.
- Moved nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
- Enabled PXN over C2C by default, which can improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe. It can be disabled by setting NCCL_PXN_C2C=0.
Fixed Issues
The following issues have been resolved in NCCL 2.28.3:
- Allowed FP8 support for non-reductive operations on pre-sm90 devices.
- Fixed NVLS+CollNet and temporarily disabled COLLNET_CHAIN for >8 GPUs.
- Fixed considering non-running interfaces for socket traffic. Now NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit.
- Improved response to RoCE link flaps. Instead of reporting an "unknown event" it will now report "GID table changed."
- Moved libvirt bridge interface to the end of possible interfaces so that it is considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.
Updating the GPG Repository Key
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.