NVIDIA Deep Learning NCCL Documentation
Release Notes (PDF) - v2.20.5 - Last updated March 6, 2024

NCCL Release 2.8.3

This is the NCCL 2.8.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.8.3 has been tested with the following:

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.
  • Optimized Tree performance on A100

  • Improved performance for aggregated operations

  • Improved performance for all-to-all operations at scale

  • Reduced memory usage for all-to-all operations at scale

  • Optimized all-to-all performance on DGX-1

Known Issues

Send/receive operations have a number of limitations:

  • Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

Fixed Issues

The following issues have been resolved in NCCL 2.8.3:
  • Hang in LL128 protocol after 2^31 steps.

  • Topology injection error when using fewer GPUs than described. (github issue #379)

  • Protocol mismatch causing hangs or crashes when using one GPU per node. (github issue #394)