NCCL Release 2.8.3

This is the NCCL 2.8.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.


NCCL 2.8.3 has been tested with the following:

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.
  • Optimized Tree performance on A100

  • Improved performance for aggregated operations

  • Improved performance for all-to-all operations at scale

  • Reduced memory usage for all-to-all operations at scale

  • Optimized all-to-all performance on DGX-1

Known Issues

Send/receive operations have a number of limitations:

  • Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

Fixed Issues

The following issues have been resolved in NCCL 2.8.3:
  • Hang in LL128 protocol after 2^31 steps.

  • Topology injection error when using fewer GPUs than described. (github issue #379)

  • Protocol mismatch causing hangs or crashes when using one GPU per node. (github issue #394)