The platform and server requirements for GPUDirect RDMA are detailed in the following table:
Platform | Type and Version |
HCAs |
|
GPUs |
|
Software/Plugins |
|
Once the NVIDIA software components are installed, it is important to make sure that the GPUDirect kernel module is properly loaded on each of the compute systems where you plan to run the job that requires the GPUDirect RDMA. To do that, run:
service nv_peer_mem status
For other Linux flavors, run:
lsmod | grep nv_peer_mem
Usually, this kernel module is set to load by default by the system startup service. If it is not loaded, GPUDirect RDMA would not work, which would result in a very high latency for message communications.
In this case, to start the module, run:
service nv_peer_mem start
Or for other Linux flavors, run:
modprobe nv_peer_mem
To achieve the best performance for GPUDirect RDMA, it is required that both the HCA and the GPU be physically located on the same PCIe IO root complex.
For additional information on the system's architecture, either review the system manual, or run:
lspci -tv |grep NVIDIA