NCCL-RDMA-SHARP Plugins

NCCL-RDMA-SHARP plugins enable RDMA and switch-based collectives (SHARP) with NVIDIA's NCCL library.

This plugin replaces the default NCCL internal inter-node communication with RDMA-based transports. It implements both Point-to-Point transport(Net) (IB verbs (default) and UCX), and Collective transport(CollNet) (including SHARP Collective transport).

NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.

Running NCCL UCX Plugin

To use NCCL UCX plugin:

  1. For NCCL to detect the network plugin, make sure to add plugin_install_dir to the library search path environment variable, as shown below.

    Copy
    Copied!
                

    # libnccl_net.so is in <plugin_install_dir>/lib $ export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH $ <run command>

  2. Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable.

    Copy
    Copied!
                

    $ export NCCL_PLUGIN_P2P=ucx $ <run command>

Performance Tuning

To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.

Example

The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.

To use GPU Direct RDMA for all message sizes in UCX:

Define the following environment variables as shown.

Copy
Copied!
            

$ export NCCL_UCX_RNDV_THRESH=0  $ export NCCL_UCX_RNDV_SCHEME=get_zcopy $ <run command>

Note that for servers with multiple NICs available, you need to define the following additional variable.

Copy
Copied!
            

$ export NCCL_UCX_TLS=dc,cuda_copy,cuda_ipc $ <run command>

Warning

By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the UCX_MEMTYPE_CACHE environment variable as follows.

Copy
Copied!
            

$ export UCX_MEMTYPE_CACHE=n $ <run command>

NCCL Tests Benchmark Example

NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).

Example:

Copy
Copied!
            

mpirun \ -np 2 \ --bind-to socket \ -x LD_LIBRARY_PATH \ -x NCCL_UCX_TLS=rc_x,cuda_copy \ -x NCCL_UCX_RNDV_THRESH=0 \ -x UCX_MEMTYPE_CACHE=n \ -x NCCL_COLLNET_ENABLE=0 \ -x NCCL_PLUGIN_P2P=ucx \ -x NCCL_DEBUG=info \ -x NCCL_DEBUG_SUBSYS=NET \ -x NCCL_IB_HCA=mlx5_0:1 \ $NCCL_TEST_HOME/build/all_reduce_perf -b 128 -e 128M -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1     # nThread 1 nGpus 1 minBytes 128 maxBytes 134217728 step: 2(factor) warmup iters: 100 iters: 50 validation: 1 # # Using devices # Rank 0 Pid 7198 on host1 device 0 [0x06] Tesla V100-SXM2-32GB # Rank 1 Pid 4890 on host2 device 0 [0x06] Tesla V100-SXM2-32GB host1:7198:7198 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.3<0> NCCL version 2.6.0a0+cuda10.1 host2:4890:4890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.4<0> host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported host1:7198:7226 [0] NCCL INFO Worker address length: 55 host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported host2:4890:4920 [0] NCCL INFO Worker address length: 55 host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host1:7198:7226 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host1:7198:7226 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA host2:4890:4920 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host2:4890:4920 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1 host1:7198:7226 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported host1:7198:7226 [0] NCCL INFO Worker address length: 55 host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1 host2:4890:4920 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA host2:4890:4920 [0] NCCL INFO Worker address length: 55 host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host2:4890:4920 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1 host2:4890:4920 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0 host1:7198:7226 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA host2:4890:4920 [0] NCCL INFO Worker address length: 55 host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1 host1:7198:7226 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA host1:7198:7226 [0] NCCL INFO Worker address length: 55

The following environment variables enable the SHARP aggregation with NCCL when using the plugin.

Copy
Copied!
            

NCCL_COLLNET_ENABLE=1 NCCL_ALGO=CollNet

Warning

Mellanox switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.

NCCL Test Benchmark Example

The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests.

Copy
Copied!
            

mpirun -np 1024 -map-by ppr:8:node -x NCCL_COLLNET_ENABLE=1 -x NCCL_ALGO=CollNet ./nccl-tests/build/all_reduce_perf -b 4 -e 2G -f 2 -g 1 -w 50 -n 50   4 1 float sum 44.53 0.00 0.00 3e-05 44.21 0.00 0.00 3e-05 8 2 float sum 45.42 0.00 0.00 3e-05 45.85 0.00 0.00 3e-05 16 4 float sum 46.34 0.00 0.00 3e-05 45.84 0.00 0.00 2e-05 32 8 float sum 46.20 0.00 0.00 2e-05 46.56 0.00 0.00 2e-05 64 16 float sum 46.00 0.00 0.00 2e-05 48.33 0.00 0.00 2e-05 128 32 float sum 48.77 0.00 0.01 2e-05 47.23 0.00 0.01 2e-05 256 64 float sum 47.88 0.01 0.01 2e-05 47.85 0.01 0.01 2e-05 512 128 float sum 51.44 0.01 0.02 3e-05 48.66 0.01 0.02 3e-05 1024 256 float sum 51.27 0.02 0.04 4e-05 51.78 0.02 0.04 4e-05 2048 512 float sum 57.93 0.04 0.07 4e-05 56.45 0.04 0.07 4e-05 4096 1024 float sum 57.32 0.07 0.14 4e-05 93.51 0.04 0.09 4e-05 8192 2048 float sum 106.4 0.08 0.15 4e-05 59.70 0.14 0.27 4e-05 16384 4096 float sum 103.0 0.16 0.32 4e-05 58.23 0.28 0.56 4e-05 32768 8192 float sum 74.85 0.44 0.87 4e-05 137.8 0.24 0.48 4e-05 65536 16384 float sum 96.71 0.68 1.35 4e-05 92.89 0.71 1.41 4e-05 131072 32768 float sum 115.6 1.13 2.27 4e-05 120.7 1.09 2.17 4e-05 262144 65536 float sum 197.7 1.33 2.65 4e-05 167.6 1.56 3.13 4e-05 524288 131072 float sum 222.7 2.35 4.70 4e-05 239.2 2.19 4.38 4e-05 1048576 262144 float sum 280.9 3.73 7.46 4e-05 197.7 5.30 10.60 4e-05 2097152 524288 float sum 218.0 9.62 19.22 4e-05 213.9 9.81 19.59 4e-05 4194304 1048576 float sum 257.6 16.28 32.53 4e-05 254.7 16.47 32.90 4e-05 8388608 2097152 float sum 354.3 23.68 47.31 4e-05 523.5 16.02 32.02 4e-05 16777216 4194304 float sum 505.9 33.16 66.26 4e-05 484.1 34.66 69.24 4e-05 33554432 8388608 float sum 639.2 52.50 104.89 4e-05 678.6 49.45 98.80 4e-05 67108864 16777216 float sum 1358.2 49.41 98.72 4e-05 1048.6 64.00 127.87 4e-05 134217728 33554432 float sum 1737.2 77.26 154.37 4e-05 1777.6 75.51 150.86 4e-05 268435456 67108864 float sum 4359.5 61.58 123.03 4e-05 4262.3 62.98 125.83 4e-05 536870912 134217728 float sum 5619.7 95.53 190.88 4e-05 5699.0 94.20 188.22 4e-05 1073741824 268435456 float sum 12169 88.23 176.30 4e-05 11508 93.30 186.42 4e-05 2147483648 536870912 float sum 22618 94.94 189.70 4e-05 21814 98.44 196.70 4e-05 # Out of bounds values : 0 OK # Avg bus bandwidth : 41.2497

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.