Release Notes :: NVIDIA Deep Learning NCCL Documentation

This is the NCCL 2.7.8 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.7.8 has been tested with the following:

Deep learning framework containers. Refer to the Support Matrix for the supported container version.
This NCCL release supports CUDA 10.1, CUDA 10.2, CUDA 11.0, amd CUDA 11.1.

Send/receive operations have a number of limitations:

Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.
Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair.
Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale.
When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).
Send/receive operations are not yet optimized for the DGX-1 NVLink topology.

The following issues have been resolved in NCCL 2.7.8:

Resolved "Collective Mismatch" errors reported erroneously when using send/recv operations.