NCCL Release 2.4.8
This is the NCCL 2.4.8 release notes. This release
includes fixes from the previous NCCL 2.4.x releases as well as the following additional
changes. For previous NCCL release notes, see the archived NCCL Release Notes.
Key Features and Enhancements
This NCCL release includes the following key features and enhancements.
- Improved socket transport performance by splitting transfer over multiple
sockets.
Note: This feature adds two new environment variables NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD for users to tune NCCL performance on socket-based networks. See the NCCL documentation for more details.
Compatibility
NCCL 2.4.8 has been tested with the following:
- Deep learning framework 19.05 containers
- This NCCL release supports; CUDA 9.0, CUDA 9.2 , CUDA 10.0, and CUDA 10.1.
Fixed Issues
The following issues have been resolved in NCCL 2.4.8:
- Suboptimal performance with TCP over high bandwidth networks. (GitHub issue #209)
Known Issues
- On single node Power systems with 4 GPUs, some performance regressions have been observed compared to NCCL 2.4.2. These will be addressed in future NCCL releases.
- By default, NCCL does not enable direct P2P communication through different PCIe root ports on Intel Skylake CPU and later. This is due to a known performance issue when using P2P on these CPU versions. There is now a new BIOS and performance tuning option available (PCIe Peer-to-Peer Serialization) from Intel and their OEM vendors that resolves this P2P bandwidth issue. If the BIOS performance tuning option has been enabled, then NCCL direct P2P connections can be re-enabled by setting NCCL_P2P_LEVEL=5.