NCCL Release 2.7.3
This is the NCCL 2.7.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.
Key Features and Enhancements
- 
                                 Added support for A100 GPU and related platforms 
- 
                                 Added support for CUDA 11 
- 
                                 Added support for send/receive operations (beta) 
Compatibility
- 
                                 Deep learning framework containers. Refer to the Support Matrix for the supported container version. 
- 
                                 This NCCL release supports CUDA 10.1, CUDA 10.2, and CUDA 11.0. 
Known Issues
Send/receive operations have a number of limitations:
- 
                              Using send/receive operations in combination to launch work on multiple GPUs from a single process can fail or hang if the GPUs process different amounts of data. Setting NCCL_LAUNCH_MODE=PARALLEL can work around the issue, but can also cause other problems. For more information, see the NCCL User Guide section Troubleshooting > Known Issues > Concurrency Between NCCL and CUDA calls. 
- 
                              Aggregation is not supported on a given source-destination pair, meaning that there can only be one send/receive per source-destination pair. 
- 
                              Each source-destination pair allocates a dedicated fifo of 4 MB (see NCCL_BUFFSIZE variable) which can use a significant amount of GPU memory at scale. 
- 
                              When using GPU Direct RDMA, each point-to-point connection will also use resources on the GPU PCI address space, which we might run out of on some models. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE). 
- 
                              Send/receive operations are not yet optimized for the DGX-1 NVLink topology. 
Fixed Issues
- 
                                 Fixed crash when only a subset of GPUs are visible within a container (#326).