NCCL Release 2.7.8

This is the NCCL 2.7.8 release notes. For previous NCCL release notes, refer to the NCCL Archives.


NCCL 2.7.8 has been tested with the following:

Known Issues

Send/receive operations have a number of limitations:

  • Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.

  • Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair.

  • Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale.

  • When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).

  • Send/receive operations are not yet optimized for the DGX-1 NVLink topology.

Fixed Issues

The following issues have been resolved in NCCL 2.7.8:
  • Resolved "Collective Mismatch" errors reported erroneously when using send/recv operations.