On This Page
MVAPICH2 takes advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with NVIDIA® InfiniBand interconnect.
MVAPICH-GDR v2.1, can be downloaded from:
GPUDirect RDMA can be tested by running the micro-benchmarks from Ohio State University (OSU). Below is an example of running one of the OSU benchmark, which is already bundled with MVAPICH2-GDR v2.1, with GPUDirect RDMA.
2host1 host2 -genv MV2_CPU_MAPPING=
1-genv MV2_USE_GPUDI- RECT=
0/gnu/libexec/mvapich2/osu_bw -d cuda D D # OSU MPI-CUDA Bandwidth Test # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) #Size Bandwidth (MB/s) ...
Please note that MV2_CPU_MAPPING=<core number> has to be a core number from the same socket that shares the same PCI slot with the GPU.
The MV2_GPUDIRECT_LIMIT is used to tune the hybrid design that uses pipelining and GPU- Direct RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit.
Following is a list of runtime parameters that can be used for process-to-rail binding in case the system has multi-rail configuration:
1export MV2_RAIL_SHARING_POLICY=FIXED_MAPPING export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1 export MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=1G export MV2_CPU_BINDING_LEVEL=SOCKET export MV2_CPU_BINDING_POLICY=SCATTER
Additional tuning parameters related to CUDA and GPUDirect RDMA (such as MV2_CUDA_- BLOCK_SIZE) can be found in the MVAPICH2 user guideline. Below is an example of enabling RoCE communication.
2host1 host2 -genv MV2_USE_RoCE=
2-genv MV2_DE- FAULT_SERVICE_LEVEL=
0/gnu/libexec/mvapich2/osu_bw -d cuda D D
Enables RoCE communication.
Selects the non-default GID index using MV2_DEFAULT_GID_INDEX since all VLAN interfaces appear as additional GID indexes (starting from 1) on the InfiniBand HCA side of the RoCE adapter. You can select a non-default GID index using run-time parameter MV2_DE- FAULT_GID_INDEX(11.84) and RoCE priority service level using MV2_DEFAULT_SER- VICE_LEVEL
Selects RoCE priority service level using MV2_DEFAULT_SERVICE_LEVEL
To download the HPC-X toolkit, go to https://developer.nvidia.com/networking/hpc-x.
HPC-X is a precompiled OpenMPI, UCX, HCOLL packages build with CUDA support. HPC-X also includes an OSU microbenchmarks build with CUDA support.
To run osu_bw with CUDA buffers using HPC-X:
$ source <path/to/hpcx>/hpcx-init.sh $ hpcx_load $ export LD_LIBRARY_PATH=<cuda/install/path>/lib64:$LD_LIBRARY_PATH # mpirun -np
2-H host1,host2 -x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:
0-x UCX_RNDV_SCHEME=get_zcopy $HPCX_OSU_CUDA_DIR/osu_bw D D # OSU MPI-CUDA Bandwidth Test v5.
6.2# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s)
OpenMPI/UCX from Sources
Users can also build OpenMPI and UCX from source with CUDA and GPUDirectRDMA support.
To build UCX with CUDA support, download UCX from https://github.com/openucx/ucx, and run:
% ./configure –prefix=<path/to/ucx> --with-cuda=<cuda/runtime/install/path> --with-gdrcopy=<gdr_copy/install/path> % make; make install
To build OpenMPI with UCX/CUDA, run:
% ./configure --prefix=/path/to/openmpi\ --with-cuda<cuda/runtime/install/path> --with-ucx=<path/to/ucx/install % make; make install
To build OSU benchmarks with CUDA, download the benchmarks from http://mvapich.cse.ohio-state.edu/benchmarks.
When building the OSU benchmarks, you must verify that the proper flags are set to enable the CUDA part of the tests. Otherwise, the tests will only run using the host memory instead. which is the default setting.
Additionally, make sure that the MPI libraries, OpenMPI, are installed prior to compiling the benchmarks.
export PATH=/path/to/openmpi/bin:$PATH ./configure CC=mpicc CXX=mpicxx -prefix=/path/to/osu-benchmarks \ --enable-cuda --with-cuda=<cudaruntime/install/path> make make install
To run osu_bw with CUDA buffers:
% mpirun -np
1-mca pml ucx -bind-to-core -x CUDA_VISIBLE_DEVICES=
1-x UCX_RNDV_SCHEME=get_zcopy /path/to/osu-benchmarks/ osu_bw -d cuda D D
For UCX version 1.9 or earlier, in GPUDIrectRDMA optimized system configurations where the GPU and HCA are connected to the same PCIe Switch fabric, and the MPI processes are bind to the HCA and GPU under the same PCIe switch, please use the following rendezvous protocol for optimal GPUDirectRDMA performance:
Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options: