Rivermax SDK - NVIDIA Docs

The Clara AGX Developer Kit can be used along with the NVIDIA Rivermax SDK to provide an extremely efficient network connection using the onboard ConnectX-6 network adapter that is further optimized for GPU workloads by using GPUDirect. This technology avoids unnecessary memory copies and CPU overhead by copying data directly to or from pinned GPU memory, and supports both the integrated GPU or the RTX6000 add-in dGPU.

The instructions below describe the steps required to install and test the Rivermax SDK with the Clara AGX Developer Kit. The test applications used by these instructions, generic_sender and generic_receiver, can then be used as samples in order to develop custom applications that use the Rivermax SDK to optimize data transfers using GPUDirect.

Note

The Rivermax SDK may also be installed onto the Clara AGX Developer Kit via SDK Manager by selecting it as an additional SDK during the JetPack installation. If Rivermax SDK was previously installed by SDK Manager, many of these instructions can be skipped (see additional notes in the steps below).

Note

Access to the Rivermax SDK Developer Program as well as a valid Rivermax software license is required to use the Rivermax SDK.

Installing Mellanox Drivers (OFED)

The Mellanox OpenFabrics Enterprise Distribution Drivers for Linux (OFED) must be installed in order to use the ConnectX-6 network adapter that is onboard the Clara AGX Developer Kit.

Note

If Rivermax SDK was previously installed via SDK Manager, OFED will already be installed and these steps can be skipped.

Download OFED version 5.4-1.0.3.0:

MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu18.04-aarch64.tgz

If the above link does not work, navigate to the Downloads section on the main OFED page, select either Current Versions or Archive Versions to find version 5.4-1.0.3.0, select Ubuntu, Ubuntu 18.04, aarch64, then download the tgz file.

Note

Newer versions of OFED have not been tested and may not work.

Install OFED:

Copy
Copied!

            
            $ sudo apt install -y apt-utils
$ tar -xvf MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu18.04-aarch64.tgz
$ cd MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu18.04-aarch64
$ sudo ./mlnxofedinstall --force --force-fw-update --vma --add-kernel-support
$ sudo /etc/init.d/openibd restart

Installing GPUDirect

The GPUDirect drivers must be installed to enable the use of GPUDirect when using an RTX6000 add-in dGPU. When using the iGPU the CPU and GPU share the unified memory and the GPUDirect drivers are not required, so this step may be skipped when using the iGPU.

Note

The GPUDirect drivers are not installed by SDK Manager, even when Rivermax SDK is installed, so these steps must always be followed to enable GPUDirect support when using the dGPU.

Download GPUDirect Drivers for OFED:

nvidia-peer-memory_1.1.tar.gz

If the above link does not work, navigate to the Downloads section on the GPUDirect page.

Install GPUDirect:

Copy
Copied!

            
            $ mv nvidia-peer-memory_1.1.tar.gz nvidia-peer-memory_1.1.orig.tar.gz
$ tar -xvf nvidia-peer-memory_1.1.orig.tar.gz
$ cd nvidia-peer-memory-1.1
$ dpkg-buildpackage -us -uc
$ sudo dpkg -i ../nvidia-peer-memory_1.1-0_all.deb
$ sudo dpkg -i ../nvidia-peer-memory-dkms_1.1-0_all.deb
$ sudo service nv_peer_mem start

Verify the nv_peer_mem service is running:

Copy
Copied!

            
            $ sudo service nv_peer_mem status

Enable the nv_peer_mem service at boot time:

Copy
Copied!

            
            $ sudo systemctl enable nv_peer_mem
$ sudo /lib/systemd/systemd-sysv-install enable nv_peer_mem

Installing Rivermax SDK

Note

If Rivermax SDK was previously installed via SDK Manager, the download and install steps (1 and 2) can be skipped. The Rivermax license must still be installed, however, so step 3 must still be followed.

Download version 1.8.21 or newer of the Rivermax SDK from the NVIDIA Rivermax SDK developer page.
1. Click Get Started and login using your NVIDIA developer account.
2. Scroll down to Downloads and click I Agree To the Terms of the NVIDIA Rivermax Software Licence Agreement
3. Select Rivermax SDK 1.8.21, Linux, then download rivermax_ubuntu1804_1.8.21.tar.gz. If a newer version is available, replace 1.8.21 in this and all following steps with the newer version that is available.

Install Rivermax SDK:

Copy
Copied!

            
            $ tar -xvf rivermax_ubuntu1804_1.8.21.tar.gz
$ sudo dpkg -i 1.8.21/Ubuntu.18.04/deb-dist/aarch64/rivermax_11.3.9.21_arm64.deb

Install Rivermax License

Using Rivermax requires a valid license, which can be purchased from the Rivermax Licenses page. Once the license file has been obtained, it must be placed onto the system using the following path:
Copy

Copied!
```
            
            /opt/mellanox/rivermax/rivermax.lic
        
```

Testing Rivermax and GPUDirect

Running the Rivermax sample applications requires two systems, a sender and a receiver, connected via ConnectX network adapters. If two Clara AGX Developer Kits are used then the onboard ConnectX-6 can be used on each system, but if only one Clara AGX is available then it’s expected that another system with an add-in ConnectX network adapter will need to be used. Rivermax supports a wide array of platforms, including both Linux and Windows, but these instructions assume that another Linux based platform will be used as the sender device while the Clara AGX is used as the receiver.

Determine the logical name for the ConnectX devices that are used by each system. This can be done by using the lshw -class network command, finding the product: entry for the ConnectX device, and making note of the logical name: that corresponds to that device. For example, this output on a Clara AGX shows the onboard ConnectX-6 device using the enp9s0f01 logical name (lshw output shortened for demonstration purposes).

Copy
Copied!

            
            $ sudo lshw -class network
*-network:0
     description: Ethernet interface
     product: MT28908 Family [ConnectX-6]
     vendor: Mellanox Technologies
     physical id: 0
     bus info: pci@0000:09:00.0
     logical name: enp9s0f0
     version: 00
     serial: 48:b0:2d:13:9b:6b
     capacity: 10Gbit/s
     width: 64 bits
     clock: 33MHz
     capabilities: pciexpress vpd msix pm bus_master cap_list ethernet physical 1000bt-fd 10000bt-fd autonegotiation
     configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.4-1.0.3 duplex=full firmware=20.27.4006 (NVD0000000001) ip=10.0.0.2 latency=0 link=yes multicast=yes
     resources: iomemory:180-17f irq:33 memory:1818000000-1819ffffff

The instructions that follow will use the enp9s0f0 logical name for ifconfig commands, but these names should be replaced with the corresponding logical names as determined by this step.

Run the generic_sender application on the sending system.

Bring up the network:

Copy
Copied!

            
            $ sudo ifconfig enp9s0f0 up 10.0.0.1

Build the sample apps:
Copy

Copied!
```
            
            $ cd 1.8.21/apps
$ make
        
```
Note

The 1.8.21 path above corresponds to the path where the Rivermax SDK package was extracted in step 2 of the Installing Rivermax SDK section, above. If the Rivermax SDK was installed via SDK Manager, this path will be $HOME/Documents/Rivermax/1.8.21.

Launch the generic_sender application:

Copy
Copied!

            
            $ sudo ./generic_sender -l 10.0.0.1 -d 10.0.0.2 -p 5001 -y 1462 -k 8192 -z 500 -v
...
+#############################################
| Sender index: 0
| Thread ID: 0x7fa1ffb1c0
| CPU core affinity: -1
| Number of streams in this thread: 1
| Memory address: 0x7f986e3010
| Memory length: 59883520[B]
| Memory key: 40308
+#############################################
| Stream index: 0
| Source IP: 10.0.0.1
| Destination IP: 10.0.0.2
| Destination port: 5001
| Number of flows: 1
| Rate limit bps: 0
| Rate limit max burst in packets: 0
| Memory address: 0x7f986e3010
| Memory length: 59883520[B]
| Memory key: 40308
| Number of user requested chunks: 1
| Number of application chunks: 5
| Number of packets in chunk: 8192
| Packet's payload size: 1462
+**********************************************

Run the generic_receiver application on the receiving system.

Bring up the network:

Copy
Copied!

            
            $ sudo ifconfig enp9s0f0 up 10.0.0.2

Build the sample apps with GPUDirect support (CUDA=y):
Copy

Copied!
```
            
            $ cd 1.8.21/apps
$ make CUDA=y
        
```
Note

The 1.8.21 path above corresponds to the path where the Rivermax SDK package was extracted in step 2 of the Installing Rivermax SDK section, above. If the Rivermax SDK was installed via SDK Manager, this path will be $HOME/Documents/Rivermax/1.8.21.

Launch the generic_receiver application:

Copy
Copied!

            
            $ sudo ./generic_receiver -i 10.0.0.2 -m 10.0.0.2 -s 10.0.0.1 -p 5001 -g 0
...
Attached flow 1 to stream.
Running main receive loop...
Got 5877704 GPU packets | 68.75 Gbps during 1.00 sec
Got 5878240 GPU packets | 68.75 Gbps during 1.00 sec
Got 5878240 GPU packets | 68.75 Gbps during 1.00 sec
Got 5877704 GPU packets | 68.75 Gbps during 1.00 sec
Got 5878240 GPU packets | 68.75 Gbps during 1.00 sec
...

With both the generic_sender and generic_receiver processes active, the receiver will continue to print out received packet statistics every second. Both processes can then be terminated with <ctrl-c>

GPUDirect and CUDA Sample

GPUDirect is ideal for applications which receive data from the network adapter and then use the GPU to process the received data directly in GPU memory. The generic_sender and generic_receiver demo applications include a simple demonstration of the use of CUDA with received packets by using a CUDA kernel to compute and then compare a checksum of the packet against an expected checksum as provided by the sender. This additional checksum packet included by the sender also includes a packet sequence number that is used by the receiver to detect when any packets are lost during transmission.

In order to enable the CUDA checksum sample, append the -x parameter to the generic_sender and generic_receiver commands that are run above.

Due to the increased workload by the receiver when the checksum calculation is enabled, you will begin to see dropped packets and/or checksum errors if you try to maintain the same data rate from the sender as you did when the checksum was disabled (i.e. when all received packet data was simply discarded). Because of this the sleep parameter used by the sender, -z, should be increased until there are no more dropped packets or checksum errors. In this example, the sleep parameter was increased from 500 to 40000 in order to ensure the receiver can receive and process the sent packets without any errors or loss:

Copy
Copied!

            
            [Sender]
$ sudo ./generic_sender -l 10.0.0.1 -d 10.0.0.2 -p 5001 -y 1462 -k 8192 -z 40000 -v -x

[Receiver]
$ sudo ./generic_receiver -i 10.0.0.2 -m 10.0.0.2 -s 10.0.0.1 -p 5001 -g 0 -x
...
Got  203968 GPU packets | 2.40 Gbps during 1.02 sec | 0 dropped packets | 0 checksum errors
Got  200632 GPU packets | 2.36 Gbps during 1.00 sec | 0 dropped packets | 0 checksum errors
Got  203968 GPU packets | 2.40 Gbps during 1.01 sec | 0 dropped packets | 0 checksum errors
Got  201608 GPU packets | 2.37 Gbps during 1.01 sec | 0 dropped packets | 0 checksum errors

If you would like to write an application that uses Rivermax and GPUDirect for CUDA data processing, refer to the source code for the generic_sender and generic_receiver applications included with the Rivermax SDK in generic_sender.cpp and generic_receiver.cpp, respectively.

Note

The CUDA checksum calculation in the generic_receiver is included only to show how the data received through GPUDirect can be processed through CUDA. This example is not optimized in any way, and should not be used as an example of how to write a high-performance CUDA application. Please refer to the CUDA Best Practices Guide for an introduction to optimizing CUDA applications.

Troubleshooting

If running the driver installation or sample applications do not work, check the following.

The ConnectX network adapter is recognized by the system. For example, on a Linux system using a ConnectX-6 Dx add-in PCI card:
Copy

Copied!
```
            
            $ lspci
...
0000:05:00.0 Ethernet controller: Mellanox Technologies MT28841
0000:05:00.1 Ethernet controller: Mellanox Technologies MT28841
...
        
```
If the network adapter is not recognized, try rebooting the system and/or reseating the card in the PCI slot.

The ConnectX network adapter is recognized by the OFED driver. For example, on a Linux system using a ConnectX-6 Dx add-in PCI card:

Copy
Copied!

            
              $ sudo mlxfwmanager
  ...
  Device Type:      ConnectX6DX
  Part Number:      MCX623106AC-CDA_Ax
  Description:      ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Crypto and Secure Boot
  PSID:             MT_0000000436
  PCI Device Name:  /dev/mst/mt4125_pciconf0
  Base GUID:        0c42a1030024053a
  Base MAC:         0c42a124053a
  Versions:         Current        Available
     FW             22.31.1014     N/A
     FW (Running)   22.30.1004     N/A
     PXE            3.6.0301       N/A
     UEFI           14.23.0017     N/A

If the device does not appear, first try rebooting and then
:ref:`reinstalling OFED <installing_ofed>` as described above.

The sender and reciever systems can ping each other:

Copy
Copied!

            
            $ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.205 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.206 ms
...

If the systems can not ping each other, try bringing up the network interfaces again using the ifconfig commands.

The nv_peer_mem service is running:

Copy
Copied!

            
            $ sudo service nv_peer_mem status
* nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
   Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
   Active: active (exited) since Mon 2021-01-25 16:45:08 MST; 9min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 6847 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Jan 25 16:45:08 mccoy systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jan 25 16:45:08 mccoy nv_peer_mem[6847]: starting... OK
Jan 25 16:45:08 mccoy systemd[1]: Started LSB: Activates/Deactivates nv_peer_mem to \ start at boot time..

If the service is not running, try starting it again using sudo service nv_peer_mem start.