NCCL Release 2.7.5

This is the NCCL 2.7.5 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.7.5 has been tested with the following:

Known Issues

Send/receive operations have a number of limitations:

  • Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

  • Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair.

  • Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale.

  • When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).

  • Send/receive operations are not yet optimized for the DGX-1 NVLink topology.

  • Running inside a virtual machine on an NVswitch platform can cause a crash if NVswitch is not visible inside the virtual machine.

Fixed Issues

The following issues have been resolved in NCCL 2.7.5:
  • Minor fixes for A100 platforms.

  • Add proper message for invalid GroupEnd call.