Benchmark Tests
MVAPICH2 takes advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with NVIDIA® InfiniBand interconnect.
MVAPICH-GDR v2.1, can be downloaded from:
http://mvapich.cse.ohio-state.edu/download/
GPUDirect RDMA can be tested by running the micro-benchmarks from Ohio State University (OSU). Below is an example of running one of the OSU benchmark, which is already bundled with MVAPICH2-GDR v2.1, with GPUDirect RDMA.
[mpirun -np 2
host1 host2 -genv MV2_CPU_MAPPING=0
-genv MV2_USE_CUDA=1
-genv MV2_USE_GPUDI-
RECT=1
/opt/mvapich2/gdr/2.1
/cuda7.0
/gnu/libexec/mvapich2/osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
#Size Bandwidth (MB/s)
...
2097152
6372.60
4194304
6388.63
Please note that MV2_CPU_MAPPING=<core number> has to be a core number from the same socket that shares the same PCI slot with the GPU.
The MV2_GPUDIRECT_LIMIT is used to tune the hybrid design that uses pipelining and GPU- Direct RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit.
Following is a list of runtime parameters that can be used for process-to-rail binding in case the system has multi-rail configuration:
export MV2_USE_CUDA=1
export MV2_USE_GPUDIRECT=1
export MV2_RAIL_SHARING_POLICY=FIXED_MAPPING
export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1
export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G
export MV2_CPU_BINDING_LEVEL=SOCKET
export MV2_CPU_BINDING_POLICY=SCATTER
Additional tuning parameters related to CUDA and GPUDirect RDMA (such as MV2_CUDA_- BLOCK_SIZE) can be found in the MVAPICH2 user guideline. Below is an example of enabling RoCE communication.
mpirun -np 2
host1 host2 -genv MV2_USE_RoCE=1
-genv MV2_DEFAULT_GID_INDEX=2
-genv MV2_DE- FAULT_SERVICE_LEVEL=3
-genv MV2_USE_CUDA=1
MV2_USE_GPUDIRECT=1
/opt/mvapich2/gdr/2.1
/ cuda7.0
/gnu/libexec/mvapich2/osu_bw -d cuda D D
Where:
Parameter | Description |
MV2_USE_RoCE=1 | Enables RoCE communication. |
MV2_DEFAULT_GID_INDEX=<gid index> | Selects the non-default GID index using MV2_DEFAULT_GID_INDEX since all VLAN interfaces appear as additional GID indexes (starting from 1) on the InfiniBand HCA side of the RoCE adapter. You can select a non-default GID index using run-time parameter MV2_DE- FAULT_GID_INDEX(11.84) and RoCE priority service level using MV2_DEFAULT_SER- VICE_LEVEL |
MV2_DEFAULT_SERVICE_LEVEL=<service_level> | Selects RoCE priority service level using MV2_DEFAULT_SERVICE_LEVEL |
HPC-X
To download the HPC-X toolkit, go to https://developer.nvidia.com/networking/hpc-x.
HPC-X is a precompiled OpenMPI, UCX, HCOLL packages build with CUDA support. HPC-X also includes an OSU microbenchmarks build with CUDA support.
To run osu_bw with CUDA buffers using HPC-X:
$ source <path/to/hpcx>/hpcx-init.sh
$ hpcx_load
$ export LD_LIBRARY_PATH=<cuda/install/path>/lib64:$LD_LIBRARY_PATH
# mpirun -np 2
-H host1,host2 -x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:1
-x CUDA_VISIBLE_DEVICES=0
-x UCX_RNDV_SCHEME=get_zcopy $HPCX_OSU_CUDA_DIR/osu_bw D D
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1
1.27
2
2.45
4
5.06
8
10.14
16
20.32
32
40.18
64
74.90
128
147.88
256
293.52
512
517.28
1024
1083.35
2048
2030.86
4096
3660.12
8192
5709.13
16384
11954.09
32768
16159.68
65536
21851.70
131072
23506.87
262144
24169.77
524288
24451.87
1048576
24577.48
2097152
24638.73
4194304
24669.10
OpenMPI/UCX from Sources
Users can also build OpenMPI and UCX from source with CUDA and GPUDirectRDMA support.
To build UCX with CUDA support, download UCX from https://github.com/openucx/ucx, and run:
% ./configure –prefix=<path/to/ucx>
--with-cuda=<cuda/runtime/install/path>
--with-gdrcopy=<gdr_copy/install/path>
% make; make install
To build OpenMPI with UCX/CUDA, run:
% ./configure --prefix=/path/to/openmpi\
--with-cuda<cuda/runtime/install/path>
--with-ucx=<path/to/ucx/install
% make; make install
To build OSU benchmarks with CUDA, download the benchmarks from http://mvapich.cse.ohio-state.edu/benchmarks.
When building the OSU benchmarks, you must verify that the proper flags are set to enable the CUDA part of the tests. Otherwise, the tests will only run using the host memory instead. which is the default setting.
Additionally, make sure that the MPI libraries, OpenMPI, are installed prior to compiling the benchmarks.
export PATH=/path/to/openmpi/bin:$PATH
./configure CC=mpicc CXX=mpicxx -prefix=/path/to/osu-benchmarks \
--enable-cuda --with-cuda=<cudaruntime/install/path>
make
make install
To run osu_bw with CUDA buffers:
% mpirun -np 2
-npernode 1
-mca pml ucx -bind-to-core -x
CUDA_VISIBLE_DEVICES=0
-x UCX_NET_DEVICES=mlx5_0:1
-x
UCX_RNDV_SCHEME=get_zcopy /path/to/osu-benchmarks/
osu_bw -d cuda D D
Tuning
For UCX version 1.9 or earlier, in GPUDIrectRDMA optimized system configurations where the GPU and HCA are connected to the same PCIe Switch fabric, and the MPI processes are bind to the HCA and GPU under the same PCIe switch, please use the following rendezvous protocol for optimal GPUDirectRDMA performance:
-x UCX_RNDV_SCHEME=get_zcopy
Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options:
-x UCX_MEMTYPE_CACHE=0