NCCL-RDMA-SHARP Plugins
NCCL-RDMA-SHARP plugins enable RDMA and switch-based collectives (SHARP) with NVIDIA's NCCL library.
This plugin replaces the default NCCL internal inter-node communication with RDMA-based transports. It implements both Point-to-Point transport(Net) (IB verbs (default) and UCX), and Collective transport(CollNet) (including SHARP Collective transport).
NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.
Running NCCL UCX Plugin
To use NCCL UCX plugin:
For NCCL to detect the network plugin, make sure to add plugin_install_dir to the library search path environment variable, as shown below.
# libnccl_net.so is in <plugin_install_dir>/lib $ export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH $ <run command>
Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable.
$ export NCCL_PLUGIN_P2P=ucx $ <run command>
Performance Tuning
To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.
Example
The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.
To use GPU Direct RDMA for all message sizes in UCX:
Define the following environment variables as shown.
$ export NCCL_UCX_RNDV_THRESH=0
$ export NCCL_UCX_RNDV_SCHEME=get_zcopy
$ <run command>
Note that for servers with multiple NICs available, you need to define the following additional variable.
$ export NCCL_UCX_TLS=dc,cuda_copy,cuda_ipc
$ <run command>
By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the UCX_MEMTYPE_CACHE environment variable as follows.
$ export UCX_MEMTYPE_CACHE=n
$ <run command>
NCCL Tests Benchmark Example
NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).
Example:
mpirun \
-np 2
\
--bind-to socket \
-x LD_LIBRARY_PATH \
-x NCCL_UCX_TLS=rc_x,cuda_copy \
-x NCCL_UCX_RNDV_THRESH=0
\
-x UCX_MEMTYPE_CACHE=n \
-x NCCL_COLLNET_ENABLE=0
\
-x NCCL_PLUGIN_P2P=ucx \
-x NCCL_DEBUG=info \
-x NCCL_DEBUG_SUBSYS=NET \
-x NCCL_IB_HCA=mlx5_0:1
\
$NCCL_TEST_HOME/build/all_reduce_perf -b 128
-e 128M -f 2
-g 1
-n 50
-w 100
-p 0
-z 0
-t 1
-c 1
# nThread 1
nGpus 1
minBytes 128
maxBytes 134217728
step: 2
(factor) warmup iters: 100
iters: 50
validation: 1
#
# Using devices
# Rank 0
Pid 7198
on host1 device 0
[0x06
] Tesla V100-SXM2-32GB
# Rank 1
Pid 4890
on host2 device 0
[0x06
] Tesla V100-SXM2-32GB
host1:7198
:7198
[0
] NCCL INFO NET/IB : Using [0
]mlx5_0:1
/IB ; OOB ib0:1.1
.21.3
<0
>
NCCL version 2.6
.0a0+cuda10.1
host2:4890
:4890
[0
] NCCL INFO NET/IB : Using [0
]mlx5_0:1
/IB ; OOB ib0:1.1
.21.4
<0
>
host1:7198
:7226
[0
] NCCL INFO Thread mode multi is not supported
host1:7198
:7226
[0
] NCCL INFO Worker address length: 55
host2:4890
:4920
[0
] NCCL INFO Thread mode multi is not supported
host2:4890
:4920
[0
] NCCL INFO Worker address length: 55
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host1:7198
:7226
[0
] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0
.
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host1:7198
:7226
[0
] NCCL INFO Ring 00
: 1
[6000
] -> 0
[6000
] [receive] via NET/UCX/0
/GDRDMA
host2:4890
:4920
[0
] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0
.
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host2:4890
:4920
[0
] NCCL INFO Ring 00
: 0
[6000
] -> 1
[6000
] [receive] via NET/UCX/0
/GDRDMA
host1:7198
:7226
[0
] NCCL INFO Thread mode multi is not supported
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 1
host1:7198
:7226
[0
] NCCL INFO Ring 00
: 0
[6000
] -> 1
[6000
] [send] via NET/UCX/0
/GDRDMA
host2:4890
:4920
[0
] NCCL INFO Thread mode multi is not supported
host1:7198
:7226
[0
] NCCL INFO Worker address length: 55
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 1
host2:4890
:4920
[0
] NCCL INFO Ring 00
: 1
[6000
] -> 0
[6000
] [send] via NET/UCX/0
/GDRDMA
host2:4890
:4920
[0
] NCCL INFO Worker address length: 55
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host2:4890
:4920
[0
] NCCL INFO Ring 01
: 0
[6000
] -> 1
[6000
] [receive] via NET/UCX/0
/GDRDMA
host2:4890
:4920
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 1
host2:4890
:4920
[0
] NCCL INFO Ring 01
: 1
[6000
] -> 0
[6000
] [send] via NET/UCX/0
/GDRDMA
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 0
host1:7198
:7226
[0
] NCCL INFO Ring 01
: 1
[6000
] -> 0
[6000
] [receive] via NET/UCX/0
/GDRDMA
host2:4890
:4920
[0
] NCCL INFO Worker address length: 55
host1:7198
:7226
[0
] NCCL INFO GPU Direct RDMA Enabled for
GPU 6000
/ HCA 0
(distance 2
<= 3
), read 1
host1:7198
:7226
[0
] NCCL INFO Ring 01
: 0
[6000
] -> 1
[6000
] [send] via NET/UCX/0
/GDRDMA
host1:7198
:7226
[0
] NCCL INFO Worker address length: 55
The following environment variables enable the SHARP aggregation with NCCL when using the plugin.
NCCL_COLLNET_ENABLE=1
NCCL_ALGO=CollNet
Mellanox switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.
NCCL Test Benchmark Example
The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests.
mpirun -np 1024
-map-by ppr:8
:node -x NCCL_COLLNET_ENABLE=1
-x NCCL_ALGO=CollNet ./nccl-tests/build/all_reduce_perf -b 4
-e 2G -f 2
-g 1
-w 50
-n 50
4
1
float
sum 44.53
0.00
0.00
3e-05
44.21
0.00
0.00
3e-05
8
2
float
sum 45.42
0.00
0.00
3e-05
45.85
0.00
0.00
3e-05
16
4
float
sum 46.34
0.00
0.00
3e-05
45.84
0.00
0.00
2e-05
32
8
float
sum 46.20
0.00
0.00
2e-05
46.56
0.00
0.00
2e-05
64
16
float
sum 46.00
0.00
0.00
2e-05
48.33
0.00
0.00
2e-05
128
32
float
sum 48.77
0.00
0.01
2e-05
47.23
0.00
0.01
2e-05
256
64
float
sum 47.88
0.01
0.01
2e-05
47.85
0.01
0.01
2e-05
512
128
float
sum 51.44
0.01
0.02
3e-05
48.66
0.01
0.02
3e-05
1024
256
float
sum 51.27
0.02
0.04
4e-05
51.78
0.02
0.04
4e-05
2048
512
float
sum 57.93
0.04
0.07
4e-05
56.45
0.04
0.07
4e-05
4096
1024
float
sum 57.32
0.07
0.14
4e-05
93.51
0.04
0.09
4e-05
8192
2048
float
sum 106.4
0.08
0.15
4e-05
59.70
0.14
0.27
4e-05
16384
4096
float
sum 103.0
0.16
0.32
4e-05
58.23
0.28
0.56
4e-05
32768
8192
float
sum 74.85
0.44
0.87
4e-05
137.8
0.24
0.48
4e-05
65536
16384
float
sum 96.71
0.68
1.35
4e-05
92.89
0.71
1.41
4e-05
131072
32768
float
sum 115.6
1.13
2.27
4e-05
120.7
1.09
2.17
4e-05
262144
65536
float
sum 197.7
1.33
2.65
4e-05
167.6
1.56
3.13
4e-05
524288
131072
float
sum 222.7
2.35
4.70
4e-05
239.2
2.19
4.38
4e-05
1048576
262144
float
sum 280.9
3.73
7.46
4e-05
197.7
5.30
10.60
4e-05
2097152
524288
float
sum 218.0
9.62
19.22
4e-05
213.9
9.81
19.59
4e-05
4194304
1048576
float
sum 257.6
16.28
32.53
4e-05
254.7
16.47
32.90
4e-05
8388608
2097152
float
sum 354.3
23.68
47.31
4e-05
523.5
16.02
32.02
4e-05
16777216
4194304
float
sum 505.9
33.16
66.26
4e-05
484.1
34.66
69.24
4e-05
33554432
8388608
float
sum 639.2
52.50
104.89
4e-05
678.6
49.45
98.80
4e-05
67108864
16777216
float
sum 1358.2
49.41
98.72
4e-05
1048.6
64.00
127.87
4e-05
134217728
33554432
float
sum 1737.2
77.26
154.37
4e-05
1777.6
75.51
150.86
4e-05
268435456
67108864
float
sum 4359.5
61.58
123.03
4e-05
4262.3
62.98
125.83
4e-05
536870912
134217728
float
sum 5619.7
95.53
190.88
4e-05
5699.0
94.20
188.22
4e-05
1073741824
268435456
float
sum 12169
88.23
176.30
4e-05
11508
93.30
186.42
4e-05
2147483648
536870912
float
sum 22618
94.94
189.70
4e-05
21814
98.44
196.70
4e-05
# Out of bounds values : 0
OK
# Avg bus bandwidth : 41.2497