Release Notes :: NVIDIA Deep Learning NCCL Documentation

NVIDIA Deep Learning NCCL Documentation

Release Notes (PDF) - v2.21.5 - Last updated June 18, 2024

NCCL Release 2.8.3

This is the NCCL 2.8.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.8.3 has been tested with the following:

Deep learning framework containers. Refer to the Support Matrix for the supported container version.
This NCCL release supports CUDA 10.1, CUDA 10.2, CUDA 11.0, CUDA 11.1, and CUDA 11.2.

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.

Optimized Tree performance on A100
Improved performance for aggregated operations
Improved performance for all-to-all operations at scale
Reduced memory usage for all-to-all operations at scale
Optimized all-to-all performance on DGX-1

Known Issues

Send/receive operations have a number of limitations:

Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

Fixed Issues

The following issues have been resolved in NCCL 2.8.3:

Hang in LL128 protocol after 2^31 steps.
Topology injection error when using fewer GPUs than described. (github issue #379)
Protocol mismatch causing hangs or crashes when using one GPU per node. (github issue #394)