Running GPUDirect RDMA with MVAPICH-GDR 2.1
MVAPICH2 takes advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with NVIDIA® InfiniBand interconnect.
MVAPICH-GDR v2.1, can be downloaded from:
http://mvapich.cse.ohio-state.edu/download/
GPUDirect RDMA can be tested by running the micro-benchmarks from Ohio State University (OSU). Below is an example of running one of the OSU benchmark, which is already bundled with MVAPICH2-GDR v2.1, with GPUDirect RDMA.
[mpirun -np 2 host1 host2 -genv MV2_CPU_MAPPING=0 -genv MV2_USE_CUDA=1 -genv MV2_USE_GPUDI- RECT=1 /opt/mvapich2/gdr/2.1/cuda7.0/gnu/libexec/mvapich2/osu_bw -d cuda D D # OSU MPI-CUDA Bandwidth Test # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) #Size Bandwidth (MB/s) ... 2097152 6372.60 4194304 6388.63
Please note that MV2_CPU_MAPPING=<core number> has to be a core number from the same socket that shares the same PCI slot with the GPU.
The MV2_GPUDIRECT_LIMIT is used to tune the hybrid design that uses pipelining and GPU- Direct RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit.
Following is a list of runtime parameters that can be used for process-to-rail binding in case the system has multi-rail configuration:
export MV2_USE_CUDA=1 export MV2_USE_GPUDIRECT=1 export MV2_RAIL_SHARING_POLICY=FIXED_MAPPING export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1 export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G export MV2_CPU_BINDING_LEVEL=SOCKET export MV2_CPU_BINDING_POLICY=SCATTER
Additional tuning parameters related to CUDA and GPUDirect RDMA (such as MV2_CUDA_- BLOCK_SIZE) can be found in the MVAPICH2 user guideline. Below is an example of enabling RoCE communication.
mpirun -np 2 host1 host2 -genv MV2_USE_RoCE=1 -genv MV2_DEFAULT_GID_INDEX=2 -genv MV2_DE- FAULT_SERVICE_LEVEL=3 -genv MV2_USE_CUDA=1 MV2_USE_GPUDIRECT=1 /opt/mvapich2/gdr/2.1/ cuda7.0/gnu/libexec/mvapich2/osu_bw -d cuda D D
Where:
Parameter | Description |
---|---|
MV2_USE_RoCE=1 | Enables RoCE communication. |
MV2_DEFAULT_GID_INDEX=<gid index> | Selects the non-default GID index using MV2_DEFAULT_GID_INDEX since all VLAN interfaces appear as additional GID indexes (starting from 1) on the InfiniBand HCA side of the RoCE adapter. You can select a non-default GID index using run-time parameter MV2_DE- FAULT_GID_INDEX(11.84) and RoCE priority service level using MV2_DEFAULT_SER- VICE_LEVEL |
MV2_DEFAULT_SERVICE_LEVEL=<service_level> | Selects RoCE priority service level using MV2_DEFAULT_SERVICE_LEVEL |
Running GPUDirect RDMA with OpenMPI
HPC-X
To download the HPC-X toolkit, go to https://developer.nvidia.com/networking/hpc-x.
HPC-X is a precompiled OpenMPI, UCX, HCOLL packages build with CUDA support. HPC-X also includes an OSU microbenchmarks build with CUDA support.
To run osu_bw with CUDA buffers using HPC-X:
$ source <path/to/hpcx>/hpcx-init.sh $ hpcx_load $ export LD_LIBRARY_PATH=<cuda/install/path>/lib64:$LD_LIBRARY_PATH # mpirun -np 2 -H host1,host2 -x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:1 -x CUDA_VISIBLE_DEVICES=0 -x UCX_RNDV_SCHEME=get_zcopy $HPCX_OSU_CUDA_DIR/osu_bw D D # OSU MPI-CUDA Bandwidth Test v5.6.2 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 1.27 2 2.45 4 5.06 8 10.14 16 20.32 32 40.18 64 74.90 128 147.88 256 293.52 512 517.28 1024 1083.35 2048 2030.86 4096 3660.12 8192 5709.13 16384 11954.09 32768 16159.68 65536 21851.70 131072 23506.87 262144 24169.77 524288 24451.87 1048576 24577.48 2097152 24638.73 4194304 24669.10
OpenMPI/UCX from Sources
Users can also build OpenMPI and UCX from source with CUDA and GPUDirectRDMA support.
To build UCX with CUDA support, download UCX from https://github.com/openucx/ucx, and run:
% ./configure –prefix=<path/to/ucx> --with-cuda=<cuda/runtime/install/path> --with-gdrcopy=<gdr_copy/install/path> % make; make install
To build OpenMPI with UCX/CUDA, run:
% ./configure --prefix=/path/to/openmpi\ --with-cuda<cuda/runtime/install/path> --with-ucx=<path/to/ucx/install % make; make install
To build OSU benchmarks with CUDA, download the benchmarks from http://mvapich.cse.ohio-state.edu/benchmarks.
When building the OSU benchmarks, you must verify that the proper flags are set to enable the CUDA part of the tests. Otherwise, the tests will only run using the host memory instead. which is the default setting.
Additionally, make sure that the MPI libraries, OpenMPI, are installed prior to compiling the benchmarks.
export PATH=/path/to/openmpi/bin:$PATH ./configure CC=mpicc CXX=mpicxx -prefix=/path/to/osu-benchmarks \ --enable-cuda --with-cuda=<cudaruntime/install/path> make make install
To run osu_bw with CUDA buffers:
% mpirun -np 2 -npernode 1 -mca pml ucx -bind-to-core -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_RNDV_SCHEME=get_zcopy /path/to/osu-benchmarks/ osu_bw -d cuda D D
Tuning
For UCX version 1.9 or earlier, in GPUDIrectRDMA optimized system configurations where the GPU and HCA are connected to the same PCIe Switch fabric, and the MPI processes are bind to the HCA and GPU under the same PCIe switch, please use the following rendezvous protocol for optimal GPUDirectRDMA performance:
-x UCX_RNDV_SCHEME=get_zcopy
Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options:
-x UCX_MEMTYPE_CACHE=0