NCCL Release 2.8.3
This is the NCCL 2.8.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
Key Features and Enhancements
-
Optimized Tree performance on A100
-
Improved performance for aggregated operations
-
Improved performance for all-to-all operations at scale
-
Reduced memory usage for all-to-all operations at scale
-
Optimized all-to-all performance on DGX-1
Known Issues
Send/receive operations have a number of limitations:
-
Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.
Fixed Issues
-
Hang in LL128 protocol after 2^31 steps.
-
Topology injection error when using fewer GPUs than described. (github issue #379)
-
Protocol mismatch causing hangs or crashes when using one GPU per node. (github issue #394)