NVIDIA Deep Learning NCCL Documentation
Release Notes (PDF) - v2.23.4 - Last updated September 16, 2024

NCCL Release 2.18.1

This is the NCCL 2.18.1 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.18.1 has been tested with the following:

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.

  • Add inter-node algorithms for NVLink SHARP: NVLink SHARP + IB SHARP (NVLS), NVLink SHARP + Tree (NVLSTREE).

  • Add ncclCommSplit primitive, with optional resource sharing.

  • Add support for memory management using cuMem functions (disabled by default).

  • Add option to use multiple QPs in round-robin mode.

Known Issues

  • On systems where a NIC shares a PCI switch with only one GPU (like on HGX H100), the Tree algorithm will make data transit through the CPU, making the LL128 protocol unsafe. This could result in data corruption. You can workaround this issue by setting the following:

    NCCL_IB_PCI_RELAXED_ORDERING=0

    Another solution is to disable the LL128 protocol with the following:

    NCCL_PROTO=^LL128
  • On systems with less than 1 NIC per GPU, running bfloat16 reductions with IB SHARP will cause a hang.

  • Running allreduce on H100 platforms with IB SHARP with multiple GPUs per process can result in data corruption if all GPUs on the node are not part of the same process, or cannot access other GPUs buffers directly.

Fixed Issues

The following issues have been resolved in NCCL 2.18.1:

  • Fixed hangs with irregular send/receive patterns (e.g., alltoallv).

  • Use all NICs for Send/Receive operations on systems with more than one NIC per GPU.

  • Increased number of channels on H100 for network communication when bandwidth is not limited by NVLink bandwidth.

  • Improved error reporting in case of IB Verbs errors.

  • Fixed context creation for progress thread.

  • Fixed hang in commReclaim.

  • Fixed performance issue when NVB was disabled.

Updating the GPG Repository Key

To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.