NVIDIA Deep Learning NCCL Documentation
Release Notes (PDF) - v2.21.5 - Last updated June 18, 2024

NCCL Release 2.18.3

This is the NCCL 2.18.3 release notes. For previous NCCL release notes, refer to the NCCL Archives.

Compatibility

NCCL 2.18.3 has been tested with the following:

Fixed Issues

The following issues have been resolved in NCCL 2.18.3:

  • Fixed data corruption on DGX/HGX H100 systems when using LL128 protocol.

  • Fixed hang with IB SHARP and bfloat16 on systems with less than one NIC per GPU.

  • Fixed regression in initialization time.

  • Fixed data corruption with IB SHARP on H100 platforms when combining multiple GPUs per process and multiple processes per node.

  • Fixed crash when shared memory creation fails.

  • Fixed Avg operation with IB SHARP when using Collnet/Chain algorithm.

  • Fixed performance for all-to-all operations at large scale on systems with more than one NIC per GPU.

  • Fixed performance on DGX H800.

  • Fixed race condition in connection progress that caused a crash.

  • Fixed network flush with IB SHARP.

  • Fixed PXN operation when CUDA_VISIBLE_DEVICES is set.

  • Fixed performance of aggregated reduceScatter/allGather operations.

Known Issues

  • Send/receive communication using CUDA_VISIBLE_DEVICES and PXN only works if the GPU mappings to local ranks is the same across nodes. Disabing PXN for Send/Receive communication can workaround the issue (NCCL_P2P_PXN_LEVEL=0).

Updating the GPG Repository Key

To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning on April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install NCCL packages. To ensure continued access to the latest NCCL release, please follow the updated NCCL installation guide.