NCCL Release 2.7.6
This is the NCCL 2.7.6 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Compatibility
-
Deep learning framework containers. Refer to the Support Matrix for the supported container version.
-
This NCCL release supports CUDA 10.1, CUDA 10.2, and CUDA 11.0.
Known Issues
Send/receive operations have a number of limitations:
-
Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls.
-
Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair.
-
Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale.
-
When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).
-
Send/receive operations are not yet optimized for the DGX-1 NVLink topology.
Fixed Issues
-
Fixed crash when NVswitch is not visible inside a virtual machine.