NCCL-RDMA-SHARP Plugins

NCCL-RDMA-SHARP plugins enable RDMA and switch-based collectives (SHARP) with NVIDIA's NCCL library.

Overview

This plugin replaces the default NCCL internal inter-node communication with RDMA-based transports. It implements both Point-to-Point transport(Net) (IB verbs (default) and UCX), and Collective transport(CollNet) (including SHARP Collective transport).

NCCL UCX Plugin

NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.

Running NCCL UCX Plugin

To use NCCL UCX plugin:

For NCCL to detect the network plugin, make sure to add plugin_install_dir to the library search path environment variable, as shown below.

Copy
Copied!

            
            # libnccl_net.so is in <plugin_install_dir>/lib
$ export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH
$ <run command>

Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable.

Copy
Copied!

            
            $ export NCCL_PLUGIN_P2P=ucx
$ <run command>

Performance Tuning

To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.

Example

The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.

To use GPU Direct RDMA for all message sizes in UCX:

Define the following environment variables as shown.

Copy
Copied!

            
            $ export NCCL_UCX_RNDV_THRESH=0 
$ export NCCL_UCX_RNDV_SCHEME=get_zcopy
$ <run command>

Note that for servers with multiple NICs available, you need to define the following additional variable.

Copy
Copied!

            
            $ export NCCL_UCX_TLS=dc,cuda_copy,cuda_ipc
$ <run command>

Warning

By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the UCX_MEMTYPE_CACHE environment variable as follows.

Copy
Copied!

            
            $ export UCX_MEMTYPE_CACHE=n
$ <run command>

NCCL Tests Benchmark Example

NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).

Example:

Copy
Copied!

            
            mpirun \
    -np 2 \
    --bind-to socket \
    -x LD_LIBRARY_PATH \
    -x NCCL_UCX_TLS=rc_x,cuda_copy \
    -x NCCL_UCX_RNDV_THRESH=0 \
    -x UCX_MEMTYPE_CACHE=n \
    -x NCCL_COLLNET_ENABLE=0 \
    -x NCCL_PLUGIN_P2P=ucx \
    -x NCCL_DEBUG=info \
    -x NCCL_DEBUG_SUBSYS=NET \
    -x NCCL_IB_HCA=mlx5_0:1 \
    $NCCL_TEST_HOME/build/all_reduce_perf -b 128 -e 128M -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1
 
 
# nThread 1 nGpus 1 minBytes 128 maxBytes 134217728 step: 2(factor) warmup iters: 100 iters: 50 validation: 1
#
# Using devices
#   Rank  0 Pid   7198 on  host1 device  0 [0x06] Tesla V100-SXM2-32GB
#   Rank  1 Pid   4890 on  host2 device  0 [0x06] Tesla V100-SXM2-32GB
host1:7198:7198 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.3<0>
NCCL version 2.6.0a0+cuda10.1
host2:4890:4890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.4<0>
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Worker address length: 55

NCCL SHARP Plugin

The following environment variables enable the SHARP aggregation with NCCL when using the plugin.

Copy
Copied!

            
            NCCL_COLLNET_ENABLE=1
NCCL_ALGO=CollNet

Warning

Mellanox switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.

NCCL Test Benchmark Example

The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests.

Copy
Copied!

mpirun -np 1024 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456 536870912 1073741824 2147483648 # Out of bounds values : 8:node -x NCCL_COLLNET_ENABLE=1 -x NCCL_ALGO=CollNet ./nccl-tests/build/all_reduce_perf -b 4 -e 2G -f 2 -g 1 -w 50 -n 50 1 float sum 44.53 0.00 0.00 3e-05 44.21 0.00 0.00 3e-05 2 float sum 45.42 0.00 0.00 3e-05 45.85 0.00 0.00 3e-05 4 float sum 46.34 0.00 0.00 3e-05 45.84 0.00 0.00 2e-05 8 float sum 46.20 0.00 0.00 2e-05 46.56 0.00 0.00 2e-05 16 float sum 46.00 0.00 0.00 2e-05 48.33 0.00 0.00 2e-05 32 float sum 48.77 0.00 0.01 2e-05 47.23 0.00 0.01 2e-05 64 float sum 47.88 0.01 0.01 2e-05 47.85 0.01 0.01 2e-05 128 float sum 51.44 0.01 0.02 3e-05 48.66 0.01 0.02 3e-05 256 float sum 51.27 0.02 0.04 4e-05 51.78 0.02 0.04 4e-05 512 float sum 57.93 0.04 0.07 4e-05 56.45 0.04 0.07 4e-05 1024 float sum 57.32 0.07 0.14 4e-05 93.51 0.04 0.09 4e-05 2048 float sum 106.4 0.08 0.15 4e-05 59.70 0.14 0.27 4e-05 4096 float sum 103.0 0.16 0.32 4e-05 58.23 0.28 0.56 4e-05 8192 float sum 74.85 0.44 0.87 4e-05 137.8 0.24 0.48 4e-05 16384 float sum 96.71 0.68 1.35 4e-05 92.89 0.71 1.41 4e-05 32768 float sum 115.6 1.13 2.27 4e-05 120.7 1.09 2.17 4e-05 65536 float sum 197.7 1.33 2.65 4e-05 167.6 1.56 3.13 4e-05 131072 float sum 222.7 2.35 4.70 4e-05 239.2 2.19 4.38 4e-05 262144 float sum 280.9 3.73 7.46 4e-05 197.7 5.30 10.60 4e-05 524288 float sum 218.0 9.62 19.22 4e-05 213.9 9.81 19.59 4e-05 1048576 float sum 257.6 16.28 32.53 4e-05 254.7 16.47 32.90 4e-05 2097152 float sum 354.3 23.68 47.31 4e-05 523.5 16.02 32.02 4e-05 4194304 float sum 505.9 33.16 66.26 4e-05 484.1 34.66 69.24 4e-05 8388608 float sum 639.2 52.50 104.89 4e-05 678.6 49.45 98.80 4e-05 16777216 float sum 1358.2 49.41 98.72 4e-05 1048.6 64.00 127.87 4e-05 33554432 float sum 1737.2 77.26 154.37 4e-05 1777.6 75.51 150.86 4e-05 67108864 float sum 4359.5 61.58 123.03 4e-05 4262.3 62.98 125.83 4e-05 134217728 float sum 5619.7 95.53 190.88 4e-05 5699.0 94.20 188.22 4e-05 268435456 float sum 12169 88.23 176.30 4e-05 11508 93.30 186.42 4e-05 536870912 float sum 22618 94.94 189.70 4e-05 21814 98.44 196.70 4e-05 class="value">0 OK class="value">41.2497

On This Page