Installing GPUDirect RDMA

Warning

Please ensure that you have installed MLNX_OFED before trying to install GPUDirect RDMA. MLNX_OFED can be downloaded from:
www.mellanox.com -> Products -> Software - > InfiniBand/VPI Drivers -> Linux SW/Drivers.

Warning

As of nv_peer_mem v1.1, GPUDirect RDMA can work also with the inbox drivers on the supported distribution packages.

Warning

Memory registration with nv_peer_mem is not supported over DevX umem. As a workaround, the regular ibv_reg_mr() verb should be used.

Warning

nv_peer_mem will be deprecated starting CUDA 11.5 and will only address critical bug fixes until the support is dropped in a future release.

Warning

GPUDirect RDMA kernel mode support is now provided in the form of a fully open source nvidia-peermem kernel module, that is installed as part of the NVIDIA driver. The nvidia_peermem module is a drop-in replacement for nv_peer_mem.

This simplifies the installation workflow for our customers, so that there is no longer a need to retrieve and build code from a separate site. Now, simply installing the driver will suffice.

Please refer to nvidia_peermem documentation for more information.

To install GPUDirect RDMA:

  1. Unzip the package:

    Copy
    Copied!
                

    untar nvidia_peer_memory-1.1.tar.gz

  2. Change the working directory to be nvidia_peer_memory:

    Copy
    Copied!
                

    cd nvidia_peer_memory-1.1       

  3. Build the source packages. (src.rpm for RPM based OS and tarball for DEB based OS) using the build_module.sh script:

    Copy
    Copied!
                

    $ ./build_module.sh   Building source rpm for nvidia_peer_memory... Building debian tarball for nvidia-peer-memory...   Built: /tmp/nvidia_peer_memory-1.1.0.src.rpm Built: /tmp/nvidia-peer-memory_1.1.orig.tar.gz

    Note: On SLES OSes, add “--nodeps”.

To install on RPM based OS:

Copy
Copied!
            

# rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1-0.src.rpm # rpm -ivh <path to generated binary rpm file>

To install on DEB based OS:

Copy
Copied!
            

# cd /tmp # tar xzf /tmp/nvidia-peer-memory_1.1.orig.tar.gz # cd nvidia-peer-memory-1.1 # dpkg-buildpackage -us -uc # dpkg -i <path to generated deb files>

Warning

Please make sure this kernel module is installed and loaded on each of the GPU InfiniBand compute nodes.

To install GPUDirect RDMA for MVAPICH2:

  1. Download the gdrcopy library from https://github.com/NVIDIA/gdrcopy/archive/master.zip, and build it.

    Copy
    Copied!
                

    cd /opt/mvapich2/gdr/2.1/cuda7.0/gnu unzip master.zip cd/opt/mvapich2/gdr/2.1/cuda7.0/gnu/gdrcopy-master make CUDA=/usr/local/cuda-7.0 all

  2. Make sure nv_peer_mem is installed on all compute nodes:

    Copy
    Copied!
                

    service nv_peer_mem status

  3. Make sure gdrcopy is installed on all compute nodes and load the module on each GPU node:

    Copy
    Copied!
                

    cd/opt/mvapich2/gdr/2.1/cuda7.0/gnu/gdrcopy-master ./insmod.sh

To install GPUDirect RDMA using GDR COPY:

GDR COPY is a fast copy library from NVIDIA, used to transfer files between the host and the GPU. Communication libraries (MVAPICH2, UCX) can take advantage of GDRCopy. if available on the systems.

For information on how to install GDR COPY, please refer to https://github.com/NVIDIA/gdrcopy.

To check whether the GDR COPY module is loaded, run:

Copy
Copied!
            

# lsmod | grep gdrdrv

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.