image image image image image

On This Page

RDMA and SHARP collectives are enabled with NVIDIA NCCL (‘nickel’) collective communication library through the NCCL-SHARP plugin.

The NCCL-SHARP plugin is distributed through the following channels:

  • Binary distribution with HPC-X. The plugin will be loaded in the environment with HPC-X modules and NCCL will load it automatically. The plugin can be built from the source of other CUDA versions.
  • Source distribution: https://github.com/Mellanox/nccl-rdma-sharp-plugins
    User can build the plugin from the source and set LD_LIBRARY_PATH to use it by NCCL.

Requirements

  • NVIDIA ConnectX-6 HDR
  • NVIDIA Quantum HDR Switch
  • MNLX_OFED
  • GPUDirectRDMAIt is important to verify that the GPUDirect RDMA kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GPUDirect RDMA.

    To check whether the GPUDirect RDMA module is loaded, run:

    # service nv_peer_mem status

    To run this verification on other Linux flavors:

    # lsmod | grep nv_peer_mem
  • NCCL version 2.7.3 or higher
    Please refer to NVDIA’s Developer Guide for more details: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/index.html

Control Flags

The following environment variables enable the SHARP aggregation with NCCL when using the NCCL-SHARP plugin.

  • NCCL variables:
    • NCCL_COLLNET_ENABLE=1
    • NCCL_ALGO=CollNet (Required to overcome a bug in NCCL <= 2.7.8 )
  • SHARP variables: (for guaranteed SAT resources on initialization)
      • SHARP_COLL_LOCK_ON_COMM_INIT=1
      • SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD=0
        • [Optional] SHARP_COLL_LOG_LEVEL=3

Cluster Topology for Using NVIDIA SHARP SAT with NCCL

NVIDIA switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches connected to same HCA rail from each server.

NCCL Benchmark Example

The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests

Example

$ mpirun  -np 1024  -map-by ppr:8:node -x UCX_TLS=dc,shm,self   -x LD_LIBRARY_PATH=/sw/nccl/build/lib::/sw/nccl-rdma-sharp-plugins/install/lib:$LD_LIBRARY_PATH -x  NCCL_COLLNET_ENABLE=1  all_reduce_perf -b 4 -e 2G -f 2 -g 1 -w 50 -n 50

           4             1   float     sum    44.53    0.00    0.00  3e-05    44.21    0.00    0.00  3e-05
           8             2   float     sum    45.42    0.00    0.00  3e-05    45.85    0.00    0.00  3e-05
          16             4   float     sum    46.34    0.00    0.00  3e-05    45.84    0.00    0.00  2e-05
          32             8   float     sum    46.20    0.00    0.00  2e-05    46.56    0.00    0.00  2e-05
          64            16   float     sum    46.00    0.00    0.00  2e-05    48.33    0.00    0.00  2e-05
         128            32   float     sum    48.77    0.00    0.01  2e-05    47.23    0.00    0.01  2e-05
         256            64   float     sum    47.88    0.01    0.01  2e-05    47.85    0.01    0.01  2e-05
         512           128   float     sum    51.44    0.01    0.02  3e-05    48.66    0.01    0.02  3e-05
        1024           256   float     sum    51.27    0.02    0.04  4e-05    51.78    0.02    0.04  4e-05
        2048           512   float     sum    57.93    0.04    0.07  4e-05    56.45    0.04    0.07  4e-05
        4096          1024   float     sum    57.32    0.07    0.14  4e-05    93.51    0.04    0.09  4e-05
        8192          2048   float     sum    106.4    0.08    0.15  4e-05    59.70    0.14    0.27  4e-05
       16384          4096   float     sum    103.0    0.16    0.32  4e-05    58.23    0.28    0.56  4e-05
       32768          8192   float     sum    74.85    0.44    0.87  4e-05    137.8    0.24    0.48  4e-05
       65536         16384   float     sum    96.71    0.68    1.35  4e-05    92.89    0.71    1.41  4e-05
      131072         32768   float     sum    115.6    1.13    2.27  4e-05    120.7    1.09    2.17  4e-05
      262144         65536   float     sum    197.7    1.33    2.65  4e-05    167.6    1.56    3.13  4e-05
      524288        131072   float     sum    222.7    2.35    4.70  4e-05    239.2    2.19    4.38  4e-05
     1048576        262144   float     sum    280.9    3.73    7.46  4e-05    197.7    5.30   10.60  4e-05
     2097152        524288   float     sum    218.0    9.62   19.22  4e-05    213.9    9.81   19.59  4e-05
     4194304       1048576   float     sum    257.6   16.28   32.53  4e-05    254.7   16.47   32.90  4e-05
     8388608       2097152   float     sum    354.3   23.68   47.31  4e-05    523.5   16.02   32.02  4e-05
    16777216       4194304   float     sum    505.9   33.16   66.26  4e-05    484.1   34.66   69.24  4e-05
    33554432       8388608   float     sum    639.2   52.50  104.89  4e-05    678.6   49.45   98.80  4e-05
    67108864      16777216   float     sum   1358.2   49.41   98.72  4e-05   1048.6   64.00  127.87  4e-05
   134217728      33554432   float     sum   1737.2   77.26  154.37  4e-05   1777.6   75.51  150.86  4e-05
   268435456      67108864   float     sum   4359.5   61.58  123.03  4e-05   4262.3   62.98  125.83  4e-05
   536870912     134217728   float     sum   5619.7   95.53  190.88  4e-05   5699.0   94.20  188.22  4e-05
  1073741824     268435456   float     sum    12169   88.23  176.30  4e-05    11508   93.30  186.42  4e-05
  2147483648     536870912   float     sum    22618   94.94  189.70  4e-05    21814   98.44  196.70  4e-05
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 41.2497
#