Using NVIDIA SHARP with NVIDIA NCCL
RDMA and SHARP collectives are enabled with NVIDIA NCCL (‘nickel’) collective communication library through the NCCL-SHARP plugin.
The NCCL-SHARP plugin is distributed through the following channels:
Binary distribution with HPC-X. The plugin will be loaded in the environment with HPC-X modules and NCCL will load it automatically. The plugin can be built from the source of other CUDA versions.
Source distribution: https://github.com/Mellanox/nccl-rdma-sharp-plugins
User can build the plugin from the source and set LD_LIBRARY_PATH to use it by NCCL.
NVIDIA ConnectX-6 HDR and above
NVIDIA Quantum HDR switch and above
MNLX_OFED
GPUDirectRDMA
It is important to verify that the GPUDirect RDMA kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GPUDirect RDMA.
To check whether the GPUDirect RDMA module is loaded, run:
# service nv_peer_mem status
To run this verification on other Linux flavors:
# lsmod | grep nv_peer_mem
NCCL version 2.7.3 or higher
Please refer to NVDIA’s Developer Guide for more details: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/index.html
The following environment variables enable the SHARP aggregation with NCCL when using the NCCL-SHARP plugin.
NCCL variables:
NCCL_COLLNET_ENABLE=1
NCCL_ALGO=CollNet (Required to overcome a bug in NCCL <= 2.7.8 )
SHARP variables:
For guaranteed SAT resources on initialization: These options are enabled by default with NCCL SHARP Plugin version >= 2.1.x. Users can enable explicitly using following variables:
SHARP_COLL_LOCK_ON_COMM_INIT=1 (
SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD=0
[Optional] SHARP_COLL_LOG_LEVEL=3
NCCL SHARP Plugin variables:
NCCL_SHARP_DISABLE
NCCL SHARP Streaming aggregation is supported on a single NCCL communicator/process group (PG). Applications can selectively enable SHARP on specific Process Group (PG) by setting this variable in the application before creating the PG
NCCL_SHARP_GROUP_SIZE_THRESH
Application can set this code option to selectively enable SHARP on the PG based on the group size
NCCL_IBEXT_DISABLE
NCCL plugin will be disabled and NCCL native communication transports will be used instead
On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches connected to same HCA rail from each server.
The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests
Example:
$ mpirun -np 1024
-map-by ppr:8
:node -x UCX_TLS=dc,shm,self -x LD_LIBRARY_PATH=/sw/nccl/build/lib::/sw/nccl-rdma-sharp-plugins/install/lib:$LD_LIBRARY_PATH -x NCCL_COLLNET_ENABLE=1
all_reduce_perf -b 4
-e 2G -f 2
-g 1
-w 50
-n 50
4
1
float
sum 44.53
0.00
0.00
3e-05
44.21
0.00
0.00
3e-05
8
2
float
sum 45.42
0.00
0.00
3e-05
45.85
0.00
0.00
3e-05
16
4
float
sum 46.34
0.00
0.00
3e-05
45.84
0.00
0.00
2e-05
32
8
float
sum 46.20
0.00
0.00
2e-05
46.56
0.00
0.00
2e-05
64
16
float
sum 46.00
0.00
0.00
2e-05
48.33
0.00
0.00
2e-05
128
32
float
sum 48.77
0.00
0.01
2e-05
47.23
0.00
0.01
2e-05
256
64
float
sum 47.88
0.01
0.01
2e-05
47.85
0.01
0.01
2e-05
512
128
float
sum 51.44
0.01
0.02
3e-05
48.66
0.01
0.02
3e-05
1024
256
float
sum 51.27
0.02
0.04
4e-05
51.78
0.02
0.04
4e-05
2048
512
float
sum 57.93
0.04
0.07
4e-05
56.45
0.04
0.07
4e-05
4096
1024
float
sum 57.32
0.07
0.14
4e-05
93.51
0.04
0.09
4e-05
8192
2048
float
sum 106.4
0.08
0.15
4e-05
59.70
0.14
0.27
4e-05
16384
4096
float
sum 103.0
0.16
0.32
4e-05
58.23
0.28
0.56
4e-05
32768
8192
float
sum 74.85
0.44
0.87
4e-05
137.8
0.24
0.48
4e-05
65536
16384
float
sum 96.71
0.68
1.35
4e-05
92.89
0.71
1.41
4e-05
131072
32768
float
sum 115.6
1.13
2.27
4e-05
120.7
1.09
2.17
4e-05
262144
65536
float
sum 197.7
1.33
2.65
4e-05
167.6
1.56
3.13
4e-05
524288
131072
float
sum 222.7
2.35
4.70
4e-05
239.2
2.19
4.38
4e-05
1048576
262144
float
sum 280.9
3.73
7.46
4e-05
197.7
5.30
10.60
4e-05
2097152
524288
float
sum 218.0
9.62
19.22
4e-05
213.9
9.81
19.59
4e-05
4194304
1048576
float
sum 257.6
16.28
32.53
4e-05
254.7
16.47
32.90
4e-05
8388608
2097152
float
sum 354.3
23.68
47.31
4e-05
523.5
16.02
32.02
4e-05
16777216
4194304
float
sum 505.9
33.16
66.26
4e-05
484.1
34.66
69.24
4e-05
33554432
8388608
float
sum 639.2
52.50
104.89
4e-05
678.6
49.45
98.80
4e-05
67108864
16777216
float
sum 1358.2
49.41
98.72
4e-05
1048.6
64.00
127.87
4e-05
134217728
33554432
float
sum 1737.2
77.26
154.37
4e-05
1777.6
75.51
150.86
4e-05
268435456
67108864
float
sum 4359.5
61.58
123.03
4e-05
4262.3
62.98
125.83
4e-05
536870912
134217728
float
sum 5619.7
95.53
190.88
4e-05
5699.0
94.20
188.22
4e-05
1073741824
268435456
float
sum 12169
88.23
176.30
4e-05
11508
93.30
186.42
4e-05
2147483648
536870912
float
sum 22618
94.94
189.70
4e-05
21814
98.44
196.70
4e-05
# Out of bounds values : 0
OK
# Avg bus bandwidth : 41.2497
#