NCCL Release 2.4.2

Key Features and Enhancements

This NCCL release includes the following key features and enhancements.
  • Implemented tree-based algorithms for better All Reduce performance at scale and with small and medium size messages.
  • Support for external network plugins (e.g., libfabric).
  • Add ncclCommGetAsyncError() function to report errors happening during collective operations.
  • Add ncclCommAbort() function to destroy a communicator, aborting any outstanding operations.
  • Support different ranks having a different CUDA_VISIBLE_DEVICES.
  • Add a best-effort mechanism to check for size mismatch among collective calls.

Using NCCL 2.4.2

Ensure you are familiar with the following notes when using this release.
  • No notes for this release.

Known Issues

  • None.

Fixed Issues

  • Support communication between Mesos containers (Github issue #155).
  • Fix case where posix_fallocate() returns EINTR (Github issue #137).
  • NCCL threads no longer escape the CPU affinity set by the user or job scheduler.