NCCL-RDMA-SHARP Plugins
NCCL-RDMA-SHARP plugins enable RDMA and switch-based collectives (SHARP) with NVIDIA's NCCL library.
This plugin replaces the default NCCL internal inter-node communication with RDMA-based transports. It implements both Point-to-Point transport(Net) (IB verbs (default) and UCX), and Collective transport(CollNet) (including SHARP Collective transport).
NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.
Running NCCL UCX Plugin
To use NCCL UCX plugin:
- For NCCL to detect the network plugin, make sure to add plugin_install_dir to the library search path environment variable, as shown below. - # libnccl_net.so is in <plugin_install_dir>/lib $ export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH $ <run command> 
- Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable. - $ export NCCL_PLUGIN_P2P=ucx $ <run command> 
Performance Tuning
To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.
Example
The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.
To use GPU Direct RDMA for all message sizes in UCX:
Define the following environment variables as shown.
            
            $ export NCCL_UCX_RNDV_THRESH=0 
$ export NCCL_UCX_RNDV_SCHEME=get_zcopy
$ <run command>
    Note that for servers with multiple NICs available, you need to define the following additional variable.
            
            $ export NCCL_UCX_TLS=dc,cuda_copy,cuda_ipc
$ <run command>
    By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the UCX_MEMTYPE_CACHE environment variable as follows.
            
            $ export UCX_MEMTYPE_CACHE=n
$ <run command>
    NCCL Tests Benchmark Example
NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).
Example:
            
            mpirun \
    -np 2 \
    --bind-to socket \
    -x LD_LIBRARY_PATH \
    -x NCCL_UCX_TLS=rc_x,cuda_copy \
    -x NCCL_UCX_RNDV_THRESH=0 \
    -x UCX_MEMTYPE_CACHE=n \
    -x NCCL_COLLNET_ENABLE=0 \
    -x NCCL_PLUGIN_P2P=ucx \
    -x NCCL_DEBUG=info \
    -x NCCL_DEBUG_SUBSYS=NET \
    -x NCCL_IB_HCA=mlx5_0:1 \
    $NCCL_TEST_HOME/build/all_reduce_perf -b 128 -e 128M -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1
 
 
# nThread 1 nGpus 1 minBytes 128 maxBytes 134217728 step: 2(factor) warmup iters: 100 iters: 50 validation: 1
#
# Using devices
#   Rank  0 Pid   7198 on  host1 device  0 [0x06] Tesla V100-SXM2-32GB
#   Rank  1 Pid   4890 on  host2 device  0 [0x06] Tesla V100-SXM2-32GB
host1:7198:7198 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.3<0>
NCCL version 2.6.0a0+cuda10.1
host2:4890:4890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.4<0>
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Worker address length: 55
    The following environment variables enable the SHARP aggregation with NCCL when using the plugin.
            
            NCCL_COLLNET_ENABLE=1
NCCL_ALGO=CollNet
    Mellanox switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.
NCCL Test Benchmark Example
The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests.
            
            mpirun  -np 1024  -map-by ppr:8:node  -x  NCCL_COLLNET_ENABLE=1 -x NCCL_ALGO=CollNet ./nccl-tests/build/all_reduce_perf -b 4 -e 2G -f 2 -g 1 -w 50 -n 50
 
           4             1   float     sum    44.53    0.00    0.00  3e-05    44.21    0.00    0.00  3e-05
           8             2   float     sum    45.42    0.00    0.00  3e-05    45.85    0.00    0.00  3e-05
          16             4   float     sum    46.34    0.00    0.00  3e-05    45.84    0.00    0.00  2e-05
          32             8   float     sum    46.20    0.00    0.00  2e-05    46.56    0.00    0.00  2e-05
          64            16   float     sum    46.00    0.00    0.00  2e-05    48.33    0.00    0.00  2e-05
         128            32   float     sum    48.77    0.00    0.01  2e-05    47.23    0.00    0.01  2e-05
         256            64   float     sum    47.88    0.01    0.01  2e-05    47.85    0.01    0.01  2e-05
         512           128   float     sum    51.44    0.01    0.02  3e-05    48.66    0.01    0.02  3e-05
        1024           256   float     sum    51.27    0.02    0.04  4e-05    51.78    0.02    0.04  4e-05
        2048           512   float     sum    57.93    0.04    0.07  4e-05    56.45    0.04    0.07  4e-05
        4096          1024   float     sum    57.32    0.07    0.14  4e-05    93.51    0.04    0.09  4e-05
        8192          2048   float     sum    106.4    0.08    0.15  4e-05    59.70    0.14    0.27  4e-05
       16384          4096   float     sum    103.0    0.16    0.32  4e-05    58.23    0.28    0.56  4e-05
       32768          8192   float     sum    74.85    0.44    0.87  4e-05    137.8    0.24    0.48  4e-05
       65536         16384   float     sum    96.71    0.68    1.35  4e-05    92.89    0.71    1.41  4e-05
      131072         32768   float     sum    115.6    1.13    2.27  4e-05    120.7    1.09    2.17  4e-05
      262144         65536   float     sum    197.7    1.33    2.65  4e-05    167.6    1.56    3.13  4e-05
      524288        131072   float     sum    222.7    2.35    4.70  4e-05    239.2    2.19    4.38  4e-05
     1048576        262144   float     sum    280.9    3.73    7.46  4e-05    197.7    5.30   10.60  4e-05
     2097152        524288   float     sum    218.0    9.62   19.22  4e-05    213.9    9.81   19.59  4e-05
     4194304       1048576   float     sum    257.6   16.28   32.53  4e-05    254.7   16.47   32.90  4e-05
     8388608       2097152   float     sum    354.3   23.68   47.31  4e-05    523.5   16.02   32.02  4e-05
    16777216       4194304   float     sum    505.9   33.16   66.26  4e-05    484.1   34.66   69.24  4e-05
    33554432       8388608   float     sum    639.2   52.50  104.89  4e-05    678.6   49.45   98.80  4e-05
    67108864      16777216   float     sum   1358.2   49.41   98.72  4e-05   1048.6   64.00  127.87  4e-05
   134217728      33554432   float     sum   1737.2   77.26  154.37  4e-05   1777.6   75.51  150.86  4e-05
   268435456      67108864   float     sum   4359.5   61.58  123.03  4e-05   4262.3   62.98  125.83  4e-05
   536870912     134217728   float     sum   5619.7   95.53  190.88  4e-05   5699.0   94.20  188.22  4e-05
  1073741824     268435456   float     sum    12169   88.23  176.30  4e-05    11508   93.30  186.42  4e-05
  2147483648     536870912   float     sum    22618   94.94  189.70  4e-05    21814   98.44  196.70  4e-05
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 41.2497