Learn more about RDMA in the technology overview section.
There are two parts to enabling RDMA for Holoscan:
Skip to the next section if you do not plan to leverage a ConnectX SmartNIC.
The NVIDIA IGX Orin developer kit comes with an embedded ConnectX Ethernet adapter to offer advanced hardware offloads and accelerations. You can also purchase an individual ConnectX adapter and install it on other systems, such as x86_64 workstations.
The following steps are required to ensure your ConnectX can be used for RDMA over Converged Ethernet (RoCE):
Ensure the Mellanox OFED drivers version 23.10 or above are installed:
If not installed, or an older version is installed, you should install the appropriate version from the MLNX_OFED download page, or use the script below:
Ensure the drivers are loaded:
If nothing appears, run the following command:
The ConnectX SmartNIC can function in two separate modes (called link layer):
Holoscan does not support IB at this time (as it is not tested), so the ConnectX will need to use the ETH link layer.
To identify the current mode, run ibstat or ibv_devinfo and look for the Link Layer value. In the example below, the mlx5_0 interface is in Ethernet mode, while the mlx5_1 interface is in Infiniband mode. Do not pay attention to the transport value which is always InfiniBand.
If no results appear after ibstat and sudo lsmod | grep ib_core returns a result like this:
Consider running the following command or rebooting:
To switch the link layer mode, there are two possible options:
On IGX Orin developer kits, you can switch that setting through the BIOS: see IGX Orin documentation.
On any system with a ConnectX (including IGX Orin developer kits), you can run the command below from a terminal (this will require a reboot). sudo ibdev2netdev -v is used to identify the PCI address of the ConnectX (any of the two interfaces is fine to use), and mlxconfig is used to apply the changes.
Note: LINK_TYPE_P1 and LINK_TYPE_P2 are for mlx5_0 and mlx5_1 respectively. You can choose to only set one of them. You can pass ETH or 2 for Ethernet mode, and IB or 1 for InfiniBand.
This is the output of the command above:
Device type: ConnectX7 Name: P3740-B0-QSFP_Ax Description: NVIDIA Prometheus P3740 ConnectX-7 VPI PCIe Switch Motherboard; 400Gb/s; dual-port QSFP; PCIe switch5.0 X8 SLOT0 ;X16 SLOT2; secure boot; Device: 0005:03:00.0 Configurations: Next Boot New LINK_TYPE_P1 ETH(2) ETH(2) LINK_TYPE_P2 IB(1) ETH(2) Apply new Configuration? (y/n) [n] :
First, identify the logical names of your ConnectX interfaces. Connecting a cable in just one of the interfaces on the ConnectX will help you identify which port is which (in the example below, only mlx5_1 – i.e., eth3 – is connected):
For IGX Orin Developer Kits with no live source to connect to the ConnectX QSFP ports, adding -v can show you which logical name is mapped to each specific port:
0005:03.00.0 is the QSFP port closer to the PCI slots0005:03.00.1 is the QSFP port closer to the RJ45 ethernet portsIf you have a cable connected but it does not show Up/Down in the output of ibdev2netdev, you can try to parse the output of dmesg instead. The example below shows that 0005:03:00.1 is plugged, and that it is associated with eth3:
The next step is to set a static IP on the interface you’d like to use so you can refer to it in your Holoscan applications (e.g., Emergent cameras, distributed applications…).
First, check if you already have an address setup. We’ll use the eth3 interface in this example for mlx5_1:
If nothing appears, or you’d like to change the address, you can set an IP and MTU (Maximum Transmission Unit) through the Network Manager user interface, CLI (nmcli), or other IP configuration tools. In the example below, we use ip (ifconfig is legacy) to configure the eth3 interface with an address of 192.168.1.1/24 and a MTU of 9000 (i.e., “jumbo frame”) to send Ethernet frames with a payload greater than the standard size of 1500 bytes:
If you are connecting the ConnectX to another ConnectX with a LinkX interconnect, do the same on the other system with an IP address on the same network segment.
For example, to communicate with 192.168.1.1/24 above (/24 -> 255.255.255.0 submask), setup your other system with an IP between 192.168.1.2 and 192.168.1.254, and the same /24 submask.
Only supported on NVIDIA’s Quadro/workstation GPUs (not GeForce).
Follow the instructions below to enable GPUDirect RDMA:
On dGPU, the GPUDirect RDMA drivers are named nvidia-peermem, and are installed with the rest of the NVIDIA dGPU drivers.
To enable the use of GPUDirect RDMA with a ConnectX SmartNIC (section above), the following steps are required if the MOFED drivers were installed after the peermem drivers:
Load the peermem kernel module manually:
Run the following to load it automatically during boot:
The instructions below describe the steps to test GPUDirect using the
Rivermax SDK. The test applications used by these instructions, generic_sender and generic_receiver, can
then be used as samples in order to develop custom applications that use the
Rivermax SDK to optimize data transfers.
The Linux default path where Rivermax expects to find the license file is /opt/mellanox/rivermax/rivermax.lic, or you can specify the full path and file name for the environment variable RIVERMAX_LICENSE_PATH.
If manually installing the Rivermax SDK from the link above, please note there is no need to follow the steps for installing MLNX_OFED/MLNX_EN in the Rivermax documentation.
Running the Rivermax sample applications requires two systems, a sender and a receiver, connected via ConnectX network adapters. If two Developer Kits are used, then the onboard ConnectX can be used on each system. However, if only one Developer Kit is available, then it is expected that another system with an add-in ConnectX network adapter will need to be used. Rivermax supports a wide array of platforms, including both Linux and Windows, but these instructions assume that another Linux based platform will be used as the sender device while the Developer Kit is used as the receiver.
The $rivermax_sdk variable referenced below corresponds to the path where the Rivermax SDK
package is installed. If the Rivermax SDK was installed via SDK Manager, this path will be:
If the Rivermax SDK was installed via a manual download, make sure to export your path to the SDK:
The install path might differ in future versions of Rivermax.
Determine the logical name for the ConnectX devices that are used by each
system. This can be done by using the lshw -class network command,
finding the product: entry for the ConnectX device, and making note
of the logical name: that corresponds to that device. For example,
this output on a Developer Kit shows the onboard ConnectX device using the
enp9s0f01 logical name (lshw output shortened for
demonstration purposes).
b. Build the sample apps:
e. Launch the generic_sender application:
which gives
Run the generic_receiver application on the receiving system.
a. Bring up the network:
b. Build the generic_receiver app with GPUDirect support from the Rivermax GitHub Repo. Before following the instructions to build with CUDA Toolkit support, apply the changes to the file generic_receiver/generic_receiver.cpp in this PR. This was tested on the NVIDIA IGX Orin Developer Kit with Rivermax 1.31.10.
c. Launch the generic_receiver application from the build directory:
With both the generic_sender and generic_receiver processes
active, the receiver will continue to print out received packet statistics
every second. Both processes can then be terminated with <ctrl-c>.