Installing GPUDirect RDMA
Please ensure that you have installed MLNX_OFED before trying to install GPUDirect RDMA. MLNX_OFED can be downloaded from:
www.mellanox.com -> Products -> Software - > InfiniBand/VPI Drivers -> Linux SW/Drivers.
As of nv_peer_mem v1.1, GPUDirect RDMA can work also with the inbox drivers on the supported distribution packages.
Memory registration with nv_peer_mem is not supported over DevX umem. As a workaround, the regular ibv_reg_mr() verb should be used.
nv_peer_mem will be deprecated starting CUDA 11.5 and will only address critical bug fixes until the support is dropped in a future release.
GPUDirect RDMA kernel mode support is now provided in the form of a fully open source nvidia-peermem kernel module, that is installed as part of the NVIDIA driver. The nvidia_peermem module is a drop-in replacement for nv_peer_mem.
This simplifies the installation workflow for our customers, so that there is no longer a need to retrieve and build code from a separate site. Now, simply installing the driver will suffice.
Please refer to nvidia_peermem documentation for more information.
To install GPUDirect RDMA:
Unzip the package:
untar nvidia_peer_memory-
1.1
.tar.gzChange the working directory to be nvidia_peer_memory:
cd nvidia_peer_memory-
1.1
Build the source packages. (src.rpm for RPM based OS and tarball for DEB based OS) using the build_module.sh script:
$ ./build_module.sh Building source rpm
for
nvidia_peer_memory... Building debian tarballfor
nvidia-peer-memory... Built: /tmp/nvidia_peer_memory-1.1
.0
.src.rpm Built: /tmp/nvidia-peer-memory_1.1
.orig.tar.gzNote: On SLES OSes, add “--nodeps”.
To install on RPM based OS:
# rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1
-0
.src.rpm
# rpm -ivh <path to generated binary rpm file>
To install on DEB based OS:
# cd /tmp
# tar xzf /tmp/nvidia-peer-memory_1.1
.orig.tar.gz
# cd nvidia-peer-memory-1.1
# dpkg-buildpackage -us -uc
# dpkg -i <path to generated deb files>
Please make sure this kernel module is installed and loaded on each of the GPU InfiniBand compute nodes.
To install GPUDirect RDMA for MVAPICH2:
Download the gdrcopy library from https://github.com/NVIDIA/gdrcopy/archive/master.zip, and build it.
cd /opt/mvapich2/gdr/
2.1
/cuda7.0
/gnu unzip master.zip cd/opt/mvapich2/gdr/2.1
/cuda7.0
/gnu/gdrcopy-master make CUDA=/usr/local/cuda-7.0
allMake sure nv_peer_mem is installed on all compute nodes:
service nv_peer_mem status
Make sure gdrcopy is installed on all compute nodes and load the module on each GPU node:
cd/opt/mvapich2/gdr/
2.1
/cuda7.0
/gnu/gdrcopy-master ./insmod.sh
To install GPUDirect RDMA using GDR COPY:
GDR COPY is a fast copy library from NVIDIA, used to transfer files between the host and the GPU. Communication libraries (MVAPICH2, UCX) can take advantage of GDRCopy. if available on the systems.
For information on how to install GDR COPY, please refer to https://github.com/NVIDIA/gdrcopy.
To check whether the GDR COPY module is loaded, run:
# lsmod | grep gdrdrv