NVIDIA NCCL#

This section provides information about NVIDIA Collective Communications Library (NCCL).

NCCL#

NVIDIA Collective Communications Library (NCCL) is a high-performance library designed for efficient and scalable communication primitives in multi-GPU and multi-node environments. It provides optimized collective communication routines such as all-reduce, reduce-scatter, all-gather, broadcast, and reduce, and point-to-point send and recv operations that can be used to implement collectives such as all-to-all.

These operations are crucial for synchronizing and distributing data across multiple GPUs, particularly in distributed deep learning training. NCCL leverages GPUDirect RDMA, NVLink, and other hardware acceleration technologies to achieve low-latency and high-bandwidth communication. Its API is designed to be straightforward and integrates seamlessly with popular deep learning frameworks like PyTorch, TensorFlow, and MXNet, making it a key component to efficiently scale deep learning workloads.

Refer to NCCL Official Documentation: and NVIDIA Collective Communication Library (NCCL) Documentation for more information about NCCL. This guide identifies and documents NCCL performance measurement and considerations in the GB200 platform with multi-node NVLink (MNNVL).

The Open Source NCCL library code can be found here: NVIDIA/nccl GB200 MNNVL is supported in NCCL version 2.25.2 and later.

Listing 6 Application launch script demonstrating NCCL benchmarking on multiple nodes#

 #!/bin/bash

 # SLURM environment information
 GPUS_PER_NODE=${SLURM_NTASKS_PER_NODE}
 MAX_NODES=${SLURM_JOB_NUM_NODES}
 MIN_NODES=1

 # Default NCCL environment variables
 export NCCL_DEBUG=WARN

 # Location of the nccl-tests binaries
 TEST_DIR=./build

 # Default nccl-test arguments
 TEST_ARGS="-dfloat -b8 -e32G -f2"

 LOGDIR=log_SWEEP_N${MAX_NODES}-N${MIN_NODES}_$(hostname)_$(date --date="today" +%Y%m%d-%H%M)
 mkdir ${LOGDIR}

 NODES=${MAX_NODES}
 while [ ${NODES} -ge ${MIN_NODES} ]
 do
      GPUS=$((NODES * GPUS_PER_NODE))

      for TEST in all_reduce_perf reduce_scatter_perf all_gather_perf alltoall_perf
      do
              echo RUNNING $TEST on ${NODES} nodes ${GPUS} GPUs

              srun --nnodes ${NODES} --ntasks ${GPUS} \
                      --ntasks-per-node ${GPUS_PER_NODE} --cpu-bind=none --mpi=pmix \
                      ${TEST_DIR}/${TEST} ${TEST_ARGS} 2>&1 | tee ${LOGDIR}/LOG_${TEST}_N${NODES}n${GPUS}.txt
      done

      echo TEST COMPLETE on ${NODES} nodes ${GPUS} GPUs

      NODES=$((NODES / 2))
 done

 echo COMPLETED all tests on ${MAX_NODES}-${MIN_NODES} nodes

An example SLURM batch script for benchmarking NCCL on GB200 systems can be seen in Listing 6. The script launches the NCCL Open Source nccl-tests to evaluate performance over a range of collective operations and at different scales. It queries the number of nodes and tasks per node from the SLURM environment and then launches several NCCL tests assuming one task per GPU per node. This script reduces the number of nodes and GPUs in each iteration to sweep over numerous different job sizes.

The nccl-tests source code and build instructions can be found here: NVIDIA/nccl-tests

Note

For testing on GB200 systems, the nccl-tests binaries should be built with MPI=1

Tuning Collective Communication#

In general, users should not need to tune or set NCCL environment variables to achieve peak performance. NCCL features advanced topology detection and internal tuning models that automatically select optimal parameters based on various factors, including the number of GPUs, NVLink domain size, NVLink speed, HCA speeds, and PCIe topology.

Specifically, on MNNVL systems such as GB200, NCCL will automatically detect the NVLink domains and identify which GPUs belong to them. It will then select the optimal settings and algorithms to maximize performance both within and between NVLink domains.

If users wish to disable MNNVL support in NCCL, they can set NCCL_MNNVL_ENABLE=0. This will cause NCCL to fall back to the available network configurations on the system, such as InfiniBand or Ethernet (RoCE).

However, for developers and system performance analysis and debugging there are numerous environment variables that can be set, for further details see NCCL documentation

Adjusting NCCL CTAs#

The number of CTAs can significantly impact the performance of NCCL operations. NCCL calculates the best default number of CTAs for each system based on the topology and speed of all communication paths. To avoid oversubscription, increasing the number of CTAs that NCCL uses can improve peak throughput, but it must be balanced with the available GPU resources. Similarly, reducing the number of CTAs that NCCL uses during a collective call allows for more CTAs to be used for the computation, and the reduced communication bandwidth can be hidden if the communication and computation are fully overlapped.

Using User Buffer Registration also allows NCCL to consume fewer CTAs during certain collective operations.

The number of CTAs used by NCCL can be set with the NCCL_MIN_CTAS and NCCL_MAX_CTAS environment variables. To find the optimal setting for your specific workload, test different CTA configurations, which might involve running benchmarks and monitoring performance metrics. Different GPU architectures can have varying optimal CTA configurations so ensure that your tuning is specific to the architecture you are using.

NVLink SHARP#

NVLink SHARP (NVLS) can be toggled off or on (NCCL_NVLS_ENABLE=[0|1]) to optimize performance based on your use case. By default, it is always enabled on NVLink-based systems where it is supported (for example, NVLink Version 3 switches and later. Enabling NVLS can improve performance in scenarios where high bandwidth and low latency are critical. This is particularly beneficial for applications which make use of the NCCL all-reduce collective call.

Disabling NVLS might be advantageous in environments where NCCL memory usage is a concern or where the additional bandwidth provided by NVLS is not necessary. It can also simplify the system configuration and reduce potential points of failure.

Note

To determine the actual impact on performance, always benchmark your application with NVLS enabled or disabled snf use monitoring tools to gather data on throughput, latency, and resource usage.