NVIDIA Deep Learning NCCL Documentation
Release Notes (PDF) - v2.21.5 - Last updated April 5, 2024

NCCL Release 2.2.12

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.
  • Added support for collective operations aggregation.
  • Added ncclBroadcast function.

Using NCCL 2.2.12

Ensure you are familiar with the following notes when using this release.
  • If NCCL returns an error code, set the environment variable NCCL_DEBUG to WARN to receive an explicit error message.
  • The NCCL 2.x API is different from NCCL 1.x. Some porting may be needed for NCCL 1.x applications to work correctly. Refer to the migration documentation in the NCCL Developer Guide.
  • Starting in 2.2, NCCL supports collective aggregation. You can put multiple NCCL collective operations in between ncclGroupStart() and ncclGroupEnd() to enable this feature.

Known Issues

  • Using multiple processes in conjunction with multiple threads to manage the different GPUs may in some cases cause ncclCommInitRank to fail while establishing IPCs (cudaIpcOpenMemHandle). This problem does not appear when using only processes or only threads. This issue is fixed in recent driver versions, therefore, consider updating to the latest driver.
  • NCCL uses CUDA 9 cooperative group launch by default, which may induce increased latencies in multi-threaded programs. See the NCCL_LAUNCH_MODE knob to restore the original behavior.
  • Driver version 390 and later can cause data corruption when used together with GPU Direct RDMA. Disabling GPU Direct RDMA by setting NCCL_IB_CUDA_SUPPORT=0 or reverting to driver 387 should resolve the issue.

Fixed Issues

  • No longer clear the CPU affinity during initialization functions
  • Fix various large scale issues
  • Reduce the size of the library
  • Fix crash or hang with PyTorch relative to the usage of calls to fork