RDMA and SHARP collectives are enabled with NVIDIA NCCL (‘nickel’) collective communication library through the NCCL-SHARP plugin.
The NCCL-SHARP plugin is distributed through the following channels:
- Binary distribution with HPC-X. The plugin will be loaded in the environment with HPC-X modules and NCCL will load it automatically. The plugin can be built from the source of other CUDA versions.
- Source distribution: https://github.com/Mellanox/nccl-rdma-sharp-plugins
User can build the plugin from the source and set LD_LIBRARY_PATH to use it by NCCL.
- NVIDIA ConnectX-6 HDR
- NVIDIA Quantum HDR Switch
To check whether the GPUDirect RDMA module is loaded, run:
To run this verification on other Linux flavors:
NCCL version 2.7.3 or higher
Please refer to NVDIA’s Developer Guide for more details: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/index.html
The following environment variables enable the SHARP aggregation with NCCL when using the NCCL-SHARP plugin.
- NCCL variables:
NCCL_ALGO=CollNet(Required to overcome a bug in NCCL <= 2.7.8 )
- SHARP variables: (for guaranteed SAT resources on initialization)
Cluster Topology for Using NVIDIA SHARP SAT with NCCL
NVIDIA switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches connected to same HCA rail from each server.
NCCL Benchmark Example
The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests