NVIDIA Deep Learning NCCL Documentation
Release Notes (PDF) - v2.21.5 - Last updated June 18, 2024

NCCL Release 2.7.5

This is the NCCL 2.7.5 release notes. For previous NCCL release notes, refer to the NCCL Archives.


NCCL 2.7.5 has been tested with the following:

Known Issues

Send/receive operations have a number of limitations:

  • Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

  • Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair.

  • Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale.

  • When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).

  • Send/receive operations are not yet optimized for the DGX-1 NVLink topology.

  • Running inside a virtual machine on an NVswitch platform can cause a crash if NVswitch is not visible inside the virtual machine.

Fixed Issues

The following issues have been resolved in NCCL 2.7.5:
  • Minor fixes for A100 platforms.

  • Add proper message for invalid GroupEnd call.