image image image image image

On This Page

Running GPUDirect RDMA with MVAPICH-GDR 2.1

MVAPICH2 takes advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with NVIDIA® InfiniBand interconnect.

MVAPICH-GDR v2.1, can be downloaded from: 

GPUDirect RDMA can be tested by running the micro-benchmarks from Ohio State University (OSU). Below is an example of running one of the OSU benchmark, which is already bundled with MVAPICH2-GDR v2.1, with GPUDirect RDMA.

[mpirun -np 2 host1 host2 -genv MV2_CPU_MAPPING=0 -genv MV2_USE_CUDA=1 -genv MV2_USE_GPUDI-
RECT=1 /opt/mvapich2/gdr/2.1/cuda7.0/gnu/libexec/mvapich2/osu_bw -d cuda D D 
# OSU MPI-CUDA Bandwidth Test
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
#Size             Bandwidth (MB/s) 
2097152           6372.60
4194304           6388.63

Please note that MV2_CPU_MAPPING=<core number> has to be a core number from the same socket that shares the same PCI slot with the GPU.

The MV2_GPUDIRECT_LIMIT is used to tune the hybrid design that uses pipelining and GPU- Direct RDMA for maximum performance while overcoming P2P bandwidth bottlenecks seen on modern systems. GPUDirect RDMA is used only for messages with size less than or equal to this limit. 

Following is a list of runtime parameters that can be used for process-to-rail binding in case the system has multi-rail configuration:

export MV2_USE_CUDA=1 export MV2_USE_GPUDIRECT=1 
export MV2_PROCESS_TO_RAIL_MAPPING=mlx5_0:mlx5_1 

Additional tuning parameters related to CUDA and GPUDirect RDMA (such as MV2_CUDA_- BLOCK_SIZE) can be found in the MVAPICH2 user guideline. Below is an example of enabling RoCE communication.

mpirun -np 2 host1 host2 -genv MV2_USE_RoCE=1 -genv MV2_DEFAULT_GID_INDEX=2 -genv MV2_DE- FAULT_SERVICE_LEVEL=3 -genv MV2_USE_CUDA=1 MV2_USE_GPUDIRECT=1 /opt/mvapich2/gdr/2.1/ cuda7.0/gnu/libexec/mvapich2/osu_bw -d cuda D D





Enables RoCE communication.


Selects the non-default GID index using MV2_DEFAULT_GID_INDEX since all VLAN interfaces appear as additional GID indexes (starting from 1) on the InfiniBand HCA side of the RoCE adapter. You can select a non-default GID index using run-time parameter MV2_DE- FAULT_GID_INDEX(11.84) and RoCE priority service level using MV2_DEFAULT_SER- VICE_LEVEL

MV2_DEFAULT_SERVICE_LEVEL=<service_level>Selects RoCE priority service level using MV2_DEFAULT_SERVICE_LEVEL

Running GPUDirect RDMA with OpenMPI


To download the HPC-X toolkit, go to

HPC-X is a precompiled OpenMPI, UCX, HCOLL packages build with CUDA support. HPC-X also includes an OSU microbenchmarks build with CUDA support.

To run osu_bw with CUDA buffers using HPC-X:

$ source <path/to/hpcx>/

$ hpcx_load

$ export LD_LIBRARY_PATH=<cuda/install/path>/lib64:$LD_LIBRARY_PATH

# mpirun -np 2 -H host1,host2 -x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:1 -x CUDA_VISIBLE_DEVICES=0 -x UCX_RNDV_SCHEME=get_zcopy $HPCX_OSU_CUDA_DIR/osu_bw D D

# OSU MPI-CUDA Bandwidth Test v5.6.2

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size      Bandwidth (MB/s)

1                       1.27

2                       2.45

4                       5.06

8                      10.14

16                     20.32

32                     40.18

64                     74.90

128                   147.88

256                   293.52

512                   517.28

1024                 1083.35

2048                 2030.86

4096                 3660.12

8192                 5709.13

16384               11954.09

32768               16159.68

65536               21851.70

131072              23506.87

262144              24169.77

524288              24451.87

1048576             24577.48

2097152             24638.73

4194304             24669.10

OpenMPI/UCX from Sources

Users can also build OpenMPI and UCX from source with CUDA and GPUDirectRDMA support.

To build UCX with CUDA support, download UCX from, and run:

% ./configure –prefix=<path/to/ucx>



% make; make install

To build OpenMPI with UCX/CUDA, run:

% ./configure --prefix=/path/to/openmpi\




% make; make install

To build OSU benchmarks with CUDA, download the benchmarks from

When building the OSU benchmarks, you must verify that the proper flags are set to enable the CUDA part of the tests. Otherwise, the tests will only run using the host memory instead. which is the default setting. 

Additionally, make sure that the MPI libraries, OpenMPI, are installed prior to compiling the benchmarks.

export PATH=/path/to/openmpi/bin:$PATH
./configure CC=mpicc CXX=mpicxx -prefix=/path/to/osu-benchmarks \
--enable-cuda --with-cuda=<cudaruntime/install/path>
make install

To run osu_bw with CUDA buffers: 

% mpirun -np 2 -npernode 1 -mca pml ucx -bind-to-core  -x 
UCX_RNDV_SCHEME=get_zcopy /path/to/osu-benchmarks/ 
osu_bw -d cuda D D


For UCX version 1.9 or earlier, in GPUDIrectRDMA optimized system configurations where the GPU and HCA are connected to the same PCIe Switch fabric, and the MPI processes are bind to the HCA and GPU under the same PCIe switch, please use the following rendezvous protocol for optimal GPUDirectRDMA performance: 

-x UCX_RNDV_SCHEME=get_zcopy 

Using UCX CUDA memory hooks may not work with static building CUDA applications. As a workaround, extend the configuration with the following options: