NCCL Release 2.4.8

This is the NCCL 2.4.8 release notes. This release includes fixes from the previous NCCL 2.4.x releases as well as the following additional changes. For previous NCCL release notes, see the archived NCCL Release Notes.

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.
  • Improved socket transport performance by splitting transfer over multiple sockets.
    Note: This feature adds two new environment variables NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD for users to tune NCCL performance on socket-based networks. See the NCCL documentation for more details.

Compatibility

NCCL 2.4.8 has been tested with the following:

Fixed Issues

The following issues have been resolved in NCCL 2.4.8:
  • Suboptimal performance with TCP over high bandwidth networks. (GitHub issue #209)

Known Issues

  • On single node Power systems with 4 GPUs, some performance regressions have been observed compared to NCCL 2.4.2. These will be addressed in future NCCL releases.
  • By default, NCCL does not enable direct P2P communication through different PCIe root ports on Intel Skylake CPU and later. This is due to a known performance issue when using P2P on these CPU versions. There is now a new BIOS and performance tuning option available (PCIe Peer-to-Peer Serialization) from Intel and their OEM vendors that resolves this P2P bandwidth issue. If the BIOS performance tuning option has been enabled, then NCCL direct P2P connections can be re-enabled by setting NCCL_P2P_LEVEL=5.