GPUNetIO Sample Guide

This section contains two samples that show how to enable simple GPUNetIO features. Be sure to correctly set the following environment variables:

Build the sample

Copy
Copied!

            
            export PATH=${PATH}:/usr/local/cuda/bin
export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"
export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib/pkgconfig:/opt/mellanox/grpc/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu/pkgconfigexport LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/opt/mellanox/gdrcopy/src:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu

Info

All the DOCA samples described in this section are governed under the BSD-3 software license agreement.

Note

Please ensure the arch of your GPU is included in the meson.build file before building the samples (e.g., sm_80 for Ampere, sm_89 for L40, sm_90 for H100, etc).

Ethernet Send Wait Time

The sample shows how to enable Accurate Send Scheduling (or wait-on-time) feature in the context of a GPUNetIO application. Accurate Send Scheduling is the ability of an NVIDIA NIC to send packets in the future according to application-provided timestamps.

Note

This feature is supported on ConnectX-7 and later .

This sample demonstrates how to send packets from the GPU using Accurate Send Scheduling by calling the high-level doca_gpu_dev_eth_txq_wait_send function with a BLOCK execution scope.

Info

This NVIDIA blog post offers an example for how this feature has been used in 5G networks.

Synchronizing Clocks

Before starting the sample, it is important to properly synchronize the CPU clock with the NIC clock. This way, timestamps provided by the system clock are synchronized with the time in the NIC.

For this purpose, at least the phc2sys service must be used. To install it on an Ubuntu system:

phc2sys

Copy
Copied!

            
            sudo apt install linuxptp

To start the phc2sys service properly, a config file must be created in /lib/systemd/system/phc2sys.service. Assuming the network interface is ens6f0 :

phc2sys

Copy
Copied!

            
            [Unit]
Description=Synchronize system clock or PTP hardware clock (PHC)
Documentation=man:phc2sys
 
[Service]
Restart=always
RestartSec=5s
Type=simple
ExecStart=/bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp$(ethtool -T ens6f0 | grep PTP | awk '{print $4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
 
[Install]
WantedBy=multi-user.target

Now phc2sys service can be started:

phc2sys

Copy
Copied!

            
            sudo systemctl stop systemd-timesyncd
sudo systemctl disable systemd-timesyncd
sudo systemctl daemon-reload
sudo systemctl start phc2sys.service

To check the status of phc2sys:

phc2sys

Copy
Copied!

            
            $ sudo systemctl status phc2sys.service

Output:

phc2sys

Copy
Copied!

            
            ● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
     Loaded: loaded (/lib/systemd/system/phc2sys.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-03 10:59:13 UTC; 2 days ago
       Docs: man:phc2sys
   Main PID: 337824 (sh)
      Tasks: 2 (limit: 303788)
     Memory: 560.0K
        CPU: 52min 8.199s
     CGroup: /system.slice/phc2sys.service
             ├─337824 /bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T enp23s0f1np1 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R >
             └─337829 /usr/sbin/phc2sys -s /dev/ptp3 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
 
Apr 05 16:35:52 doca-vr-045 phc2sys[337829]: [457395.040] CLOCK_REALTIME rms    8 max   18 freq +110532 +/-  27 delay   770 +/-   3
Apr 05 16:35:53 doca-vr-045 phc2sys[337829]: [457396.071] CLOCK_REALTIME rms    8 max   20 freq +110513 +/-  30 delay   769 +/-   3
Apr 05 16:35:54 doca-vr-045 phc2sys[337829]: [457397.102] CLOCK_REALTIME rms    8 max   18 freq +110527 +/-  30 delay   769 +/-   3
Apr 05 16:35:55 doca-vr-045 phc2sys[337829]: [457398.130] CLOCK_REALTIME rms    8 max   18 freq +110517 +/-  31 delay   769 +/-   3
Apr 05 16:35:56 doca-vr-045 phc2sys[337829]: [457399.159] CLOCK_REALTIME rms    8 max   19 freq +110523 +/-  32 delay   770 +/-   3
Apr 05 16:35:57 doca-vr-045 phc2sys[337829]: [457400.191] CLOCK_REALTIME rms    8 max   20 freq +110528 +/-  33 delay   770 +/-   3
Apr 05 16:35:58 doca-vr-045 phc2sys[337829]: [457401.221] CLOCK_REALTIME rms    8 max   19 freq +110512 +/-  38 delay   770 +/-   3
Apr 05 16:35:59 doca-vr-045 phc2sys[337829]: [457402.253] CLOCK_REALTIME rms    9 max   20 freq +110538 +/-  47 delay   770 +/-   4
Apr 05 16:36:00 doca-vr-045 phc2sys[337829]: [457403.281] CLOCK_REALTIME rms    8 max   21 freq +110517 +/-  38 delay   769 +/-   3
Apr 05 16:36:01 doca-vr-045 phc2sys[337829]: [457404.311] CLOCK_REALTIME rms    8 max   17 freq +110526 +/-  26 delay   769 +/-   3
...

At this point, the system and NIC clocks are synchronized so timestamps provided by the CPU are correctly interpreted by the NIC.

Warning

The timestamps you get may not reflect the real time and day. To get that, you must properly set the ptp4l service with an external grand master on the system. Doing that is out of the scope of this sample.

Running the Sample

To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:

phc2sys

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_send_wait_time
meson build
ninja -C build

The sample sends 8 bursts of 32 raw Ethernet packets or 1kB to a dummy Ethernet address, 10:11:12:13:14:15, in a timed way. Program the NIC to send every t nanoseconds (command line option -t).

The following example programs a system with GPU PCIe address ca:00.0 and NIC PCIe address 17:00.0 to send 32 packets every 5 milliseconds:

Run

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ sudo ./build/doca_gpunetio_send_wait_time -n 17:00.0 -g ca:00.0 -t 5000000[09:22:54:165778][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:195][main] Starting the sample
[09:22:54:438260][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:224][main] Sample configuration:
		GPU ca:00.0
		NIC 17:00.0
		Timeout 5000000ns
EAL: Detected CPU lcores: 128
...
EAL: Probe PCI driver: mlx5_pci (15b3:a2d6) device: 0000:17:00.0 (socket 0)
[09:22:54:819996][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:607][gpunetio_send_wait_time] Wait on time supported mode: DPDK
EAL: Probe PCI driver: gpu_cuda (10de:20b5) device: 0000:ca:00.0 (socket 1)
[09:22:54:830212][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:252][create_tx_buf] Mapping send queue buffer (0x0x7f48e32a0000 size 262144B) with legacy nvidia-peermem mode
[09:22:54:832462][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:657][gpunetio_send_wait_time] Launching CUDA kernel to send packets
[09:22:54:842945][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:664][gpunetio_send_wait_time] Waiting 10 sec for 256 packets to be sent
[09:23:04:883309][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:684][gpunetio_send_wait_time] Sample finished successfully
[09:23:04:883339][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:239][main] Sample finished successfully

To verify that packets are actually sent at the right time, use a packet sniffer on the other side (e.g., tcpdump):

phc2sys

Copy
Copied!

            
            $ sudo tcpdump -i enp23s0f1np1 -A -s 64
 
17:12:23.480318 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
....
17:12:23.480368 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of first burst of 32 packets, bump to +5ms
17:12:23.485321 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...
17:12:23.485369 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of second burst of 32 packets, bump to +5ms
17:12:23.490278 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...

The output should show a jump of approximately 5 milliseconds every 32 packets.

Note

tcpdump may increase latency in sniffing packets and reporting the receive timestamp, so the difference between bursts of 32 packets reported may be less than expected, especially with small interval times like 500 microseconds (-t 500000).

Ethernet Simple Receive

This sample application demonstrates the fundamental steps to build a DOCA GPUNetIO receiver application. It creates one queue for UDP packets and uses a single CUDA kernel to receive those packets from the GPU.

The sample uses the high-level doca_gpu_dev_eth_rxq_recv function, and its execution scope (thread, warp, or block) can be set at runtime using the -e command-line parameter.

Note

Invoking printf from a CUDA kernel is not good practice for release software as it slows down the kernel's overall execution. It should only be used for debugging. To enable packet info printing in this sample, the DOCA_GPUNETIO_SIMPLE_RECEIVE_DEBUG macro must be set to 1.

Build Instructions

Build the sample

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_receive
meson build
ninja -C build

Execution and Testing

This guide assumes a two-machine setup:

Receiver Machine: Runs the doca_gpunetio_simple_receive application.
Packet Generator Machine: Uses an application like nping to send UDP packets.

Packet Generator Machine Example

On the packet generator machine, use nping to send 10 UDP packets to the receiver's IP address (assumed to be 192.168.1.1 in this example).

Command:

nping generator

Copy
Copied!

            
            $ nping --udp -c 10 -p 2090 192.168.1.1 --data-length 1024 --delay 500ms

Output:

Copy
Copied!

            
            Starting Nping 0.7.80 ( https://nmap.org/nping ) at 2023-11-20 11:05 UTC
SENT (0.0018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (0.5018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.0025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.5025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.0032s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.5033s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.0040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.5040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.0047s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.5048s) UDP packet with 1024 bytes to 192.168.1.1:2090
 
Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
UDP packets sent: 10 | Rcvd: 0 | Lost: 10 (100.00%)
Nping done: 1 IP address pinged in 5.50 seconds

Receiver Machine Example

On the receiver machine, run the sample with the appropriate PCI addresses for the NIC and GPU. This example uses BLOCK scope (-e 2).

Command:

nping generator

Copy
Copied!

            
            # Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_receive -n 9f:00.0 -g 8a:00.0 -e 2

Output:

nping generator

Copy
Copied!

            
            [2025-10-27 00:52:32:387590][3382972416][DOCA][INF][doca_log.cpp:633] DOCA version 3.2.0111
[2025-10-27 00:52:32:387627][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:198][main] Starting the sample
[2025-10-27 00:52:32:681807][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:240][main] Sample configuration:
        GPU 8a:00.0
        NIC 9f:00.0
        Shared QP exec scope Block
[2025-10-27 00:52:32:687128][3382972416][DOCA][WRN][engine_model.c:90] adapting queue depth to 128.
[2025-10-27 00:52:32:753758][3382972416][DOCA][WRN][hws_port.c:864] ARGUMENT_256B resource doens't exist, skip creating NAT64 actions
[2025-10-27 00:52:32:755527][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:479][create_rxq] Creating Sample Eth Rxq
[2025-10-27 00:52:32:755713][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:544][create_rxq] Mapping receive queue buffer (0x0x7eef8e000000 size 33554432B dmabuf fd 43) with dmabuf mode
[2025-10-27 00:52:32:795823][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:717][gpunetio_simple_receive] Launching CUDA kernel to receive packets
[2025-10-27 00:52:32:799403][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:721][gpunetio_simple_receive] Waiting for termination
 
# Type Ctrl+C to kill the sample
 
[2025-10-27 00:53:35:034046][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:61][signal_handler] Signal 2 received, preparing to exit!
Exiting from simple receive sample. Total number of received packets: 10
[2025-10-27 00:53:35:034322][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:395][destroy_rxq] Destroying Rxq
[2AcC-10-27 00:53:35:061065][3382972416][DOCA][WRN][hws_group_pool.c:85] group_pool has 1 used groups
[2025-10-27 00:53:35:845568][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:738][gpunetio_simple_receive] Sample finished successfully
[2025-10-27 00:53:35:845583][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:259][main] Sample finished successfully

Ethernet Simple Send

This sample implements a simple GPU Ethernet packet generator that constantly sends a flow of raw Ethernet packets. It demonstrates two different implementation "flavors" for sending packets: one using the low-level API and one using the high-level API.

The behavior is controlled by the -q command-line option:

-q 0 (Low-level API): Disables the shared queue feature and executes using the low-level send functions.
-q 1 (High-level API): Enables the shared queue feature. When this is set, the execution scope (thread, warp, or block) can also be chosen using the -e option.

Build Instructions

DOCA Simple Receive

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_send
meson build
ninja -C build

Execution Example

Assuming a system with a NIC at PCIe address 9f:00.0 and a GPU at 8a:00.0, the following command launches the sample to send 1024-byte packets (-s 1024) using 1024 CUDA threads (-t 1024). It enables the high-level API (-q 1) with a BLOCK execution scope (-e 2).

Command:

DOCA Simple Receive

Copy
Copied!

            
            # Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_send -n 9f:00.0 -g 8a:00.0 -s 1024 -t 1024 -q 1 -e 2

Output:

Copy
Copied!

            
            [2025-10-28 04:38:10:548387][961593344][DOCA][INF][doca_log.cpp:645] DOCA version 3.3.0010
[2025-10-28 04:38:10:548420][961593344][DOCA][INF][gpunetio_simple_send_main.c:363][main] Starting the sample
[2025-10-28 04:38:10:839382][961593344][DOCA][INF][gpunetio_simple_send_main.c:433][main] Sample configuration:
        GPU 8a:00.0
        NIC 9f:00.0
        Packet size 1024
        CUDA threads 1024
        CPU Proxy No
        Shared QP Yes
        Shared QP exec scope Block
 
[2025-10-28 04:38:10:911219][961593344][DOCA][INF][gpunetio_simple_send_sample.c:291][create_txq] Creating Sample Eth Txq
[2025-10-28 04:38:10:916434][961593344][DOCA][INF][gpunetio_simple_send_sample.c:429][create_txq] Mapping send queue buffer (0x0x7fd718800000 size 1048576B dmabuf fd 45) with dmabuf mode
[2025-10-28 04:38:10:917479][961593344][DOCA][INF][gpunetio_simple_send_sample.c:580][gpunetio_simple_send] Launching CUDA kernel to send packets.
[2025-10-28 04:38:10:920972][961593344][DOCA][INF][gpunetio_simple_send_sample.c:584][gpunetio_simple_send] Waiting for ctrl+c termination
 
# Type Ctrl+C to kill the sample
 
[2025-10-28 04:38:22:681160][961593344][DOCA][INF][gpunetio_simple_send_sample.c:67][signal_handler] Signal 2 received, preparing to exit!
[2025-10-28 04:38:22:681176][961593344][DOCA][INF][gpunetio_simple_send_sample.c:590][gpunetio_simple_send] Exiting from sample
[2025-10-28 04:38:22:681387][961593344][DOCA][INF][gpunetio_simple_send_sample.c:204][destroy_txq] Destroying Txq
[2025-10-28 04:38:22:990964][961593344][DOCA][INF][gpunetio_simple_send_sample.c:612][gpunetio_simple_send] Sample finished successfully
[2025-10-28 04:38:22:990980][961593344][DOCA][INF][gpunetio_simple_send_main.c:457][main] Sample finished successfully

Performance Verification

To verify that packets are being sent, you can check the traffic throughput on the same machine using the mlnx_perf command. (Assuming the NIC at 9f:00.0 has the interface name ens6f0np0):

nping generator

Copy
Copied!

            
            $ mlnx_perf -i ens6f0np0
      tx_vport_unicast_packets: 21,033,677
        tx_vport_unicast_bytes: 21,538,485,248 Bps   = 172,307.88 Mbps
                tx_packets_phy: 21,033,286
                  tx_bytes_phy: 21,622,256,208 Bps   = 172,978.4 Mbps
                tx_prio0_bytes: 21,617,084,940 Bps   = 172,936.67 Mbps
              tx_prio0_packets: 21,028,292
                          UP 0: 172,936.67           Mbps = 100.00%
                          UP 0: 21,028,292           Tran/sec = 100.00%

Note

Throughput will vary between different systems based on factors like the PCIe connection between the GPU and NIC, the NIC model, and the GPU model. This sample can be used to evaluate your system's send performance by varying the packet size and number of CUDA threads.

RDMA Client Server

This sample exhibits how to use the GPUNetIO RDMA API to receive and send/write with immediate using a single RDMA queue.

The server has a GPU buffer array A composed by GPU_BUF_NUM doca_gpu_buf elements, each 1kB in size. The client has two GPU buffer arrays, B and C, each composed by GPU_BUF_NUM doca_gpu_buf elements, each 512B in size.

The goal is for the client to fill a single server buffer of 1kB with two GPU buffers of 512B as illustrated in the following figure:

image-2024-4-17_12-29-48-version-1-modificationdate-1769110443280-api-v2.png

To show how to use RDMA write and send, even buffers are sent from the client with write immediate, while odd buffers are sent with send immediate. In both cases, the server must pre-post the RDMA receive operations.

For each buffer, the CUDA kernel code repeats the handshake:

image-2024-6-26_16-43-24-version-1-modificationdate-1769110443727-api-v2.png

Once all buffers are filled, the server double checks that all values are valid. The server output should be as follows:

DOCA RDMA Server side

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0
 
[14:11:43:000930][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:250][main] Starting the sample
...
[14:11:43:686610][1173110][DOCA][INF][rdma_common.c:91][oob_connection_server_setup] Listening for incoming connections
[14:11:45:681523][1173110][DOCA][INF][rdma_common.c:105][oob_connection_server_setup] Client connected at IP: 192.168.2.28 and port: 46274
...
[14:11:45:771807][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:644][rdma_write_server] Before launching CUDA kernel, buffer array A is:
[14:11:45:771822][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 0 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771837][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 1 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771851][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 2 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771864][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 3 -> offset 0: 1111 | offset 128: 1111
RDMA Recv 2 ops completed with immediate values 0 and 1!
RDMA Recv 2 ops completed with immediate values 1 and 2!
RDMA Recv 2 ops completed with immediate values 2 and 3!
RDMA Recv 2 ops completed with immediate values 3 and 4!
[14:11:45:781561][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:671][rdma_write_server] After launching CUDA kernel, buffer array A is:
[14:11:45:781574][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 0 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781583][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 1 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781593][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 2 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781602][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 3 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781640][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:294][main] Sample finished successfully

On the other side, assuming the server is at IP address 192.168.2.28, the client output should be as follows:

DOCA RDMA Client side

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
 
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0 -c 192.168.2.28
 
[16:08:22:335744][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:197][main] Starting the sample
...
[16:08:25:753316][160913][DOCA][INF][rdma_common.c:147][oob_connection_client_setup] Connected with server successfully
......
Client waiting on flag 7f6596735000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 0
Client waiting on flag 7f6596735001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 1
Client waiting on flag 7f6596735002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 2
Client waiting on flag 7f6596735003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 3
[16:08:25:853454][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:241][main] Sample finished successfully

Note

With RDMA, the network device must be specified by name (e.g., mlx5_0 ) instead of the PCIe address (as is the case for Ethernet).

It is also possible to enable the RDMA CM mode, establishing two connections with the same RDMA GPU handler. An example on the client side:

DOCA RDMA Client side with CM

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
 
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/samples/doca_gpunetio_rdma_client_server_write -d mlx5_0 -gpu 17:00.0 -gid 3 -c 10.137.189.28 -cm --server-addr-type ipv4 --server-addr 192.168.2.28
 
[11:30:34:489781][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:461][main] Starting the sample
...
[11:30:35:038828][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:950][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:082039][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:963][rdma_write_client] Client - Connection 1 is established
...
[11:30:35:095282][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1006][rdma_write_client] Establishing connection 2..
[11:30:35:097521][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1016][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:102718][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1029][rdma_write_client] Client - Connection 2 is established
[11:30:35:102783][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1046][rdma_write_client] Client, terminate kernels
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 0. Waiting on the exit flag.
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 1. Waiting on the exit flag.
[11:30:35:122448][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:512][main] Sample finished successfully

In case of RDMA CM, the command option -cm must be specified on the server side.

Warning

Printing from a CUDA kernel is not recommended for performance. It may make sense for debugging purposes and for simple samples like this one.

Verbs Samples

The doca_gpunetio_verbs_* examples demonstrate how to use the GPUNetIO Verbs API in various scenarios. These samples require a client-server setup, with the client needing the -c <server IP> parameter.

The following parameters are supported by all Verbs samples:

-n: Network card handler type (0: AUTO, 1: CPU Proxy, 2: GPU DB). Default is 0 (AUTO).
-e: Execution mode for shared QP (0: per-thread, 1: per-warp). Default is 0 (per-thread).
-d: Network card device name.
-g: GPU device PCIe address.
-gid: GID index for DOCA RDMA (optional).
-i: Number of iterations (optional).
-t: Number of CUDA threads (optional).

Please read the specific sample section below to check for extra parameters.

For the examples in this guide, it is assumed that the GPU PCIe address is 8A:00.0 and the network card PCIe address is specified accordingly.

Note

If the samples are running on a RoCE connection (e.g. the ConnectX/BlueField is set in Ethernet mode instead of Infiniband mode) you may get this (or similar) error: FW failed to modify object, status=BAD_PARAM_ERR (0x3), syndrome=0x1f3b5d .

To fix it, please remove the doca_verbs_ah_attr_set_dlid function from the connect_verbs_qp function in the samples/doca_gpunetio/verbs_common.c file.

Bandwidth Samples

The samples ending with _bw measure the bandwidth of specific GPUNetIO Verbs API functions. In these samples, the client prepares and sends data from a CUDA kernel, while the server waits on the CPU to receive the data, validate it upon receiving a Ctrl+C signal, and reports the outcome.

Key Characteristics

The client outputs the MB/s achieved for preparing and sending messages of various sizes.
The server outputs a message indicating the execution outcome.
All bandwidth samples support the previously listed command-line parameters.

Simplifying QP/CQ/UAR Creation

To simplify the integration of DOCA RDMA Verbs with GPUNetIO, high-level functions like doca_gpu_verbs_create_qp_hl() and doca_gpu_verbs_create_qp_group_hl() are provided in samples/doca_gpunetio/verbs_high_level.cpp. These functions encapsulate the necessary steps for creating QP/CQ/UAR, making it easier for developers to combine DOCA RDMA Verbs and GPUNetIO in their applications.

doca_gpunetio_verbs_write_bw

The doca_gpunetio_verbs_write_bw test measures the bandwidth of RDMA Write operations using the GPUNetIO Verbs API. It launches a CUDA kernel with 1 block and 512 threads by default, where all threads post RDMA Write WQEs in different positions and the last thread submits them. The test then polls the CQE corresponding to the last WQE to ensure all previous WQEs have been executed correctly.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_bw

The doca_gpunetio_verbs_put_bw test measures the bandwidth of RDMA Write operations (referred to as "Put") using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads. The test can also measure individual function latencies by setting the KERNEL_DEBUG_TIMES macro to 1.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_signal_bw

The doca_gpunetio_verbs_put_signal_bw test measures the bandwidth of Put + Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be:

Copy
Copied!

            
            do {
    final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
    doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));

doca_gpunetio_verbs_put_counter_bw

The doca_gpunetio_verbs_put_counter_bw test measures the bandwidth of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be :

Copy
Copied!

            
            do {
    final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
    doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));

doca_gpunetio_verbs_twosided_bw

The doca_gpunetio_verbs_twosided_bw test measures the bandwidth of client-server data exchange via Send/Recv operations using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_get_bw

This sample application measures the bandwidth of one-sided RDMA Read ("Get") operations. It uses the GPUNetIO Verbs API with the shared QP feature.

The application launches a CUDA kernel configured with 2 blocks, each containing 256 threads.

Example usage:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_get_bw -g 8A:00.0 -d mlx5_0

Client (connects to the server's IP, e.g., 192.168.1.63):

Copy
Copied!

            
            doca_gpunetio_verbs_get_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Info

Additional CLI options can be added.

Latency Samples

The samples ending with _lat measure the latency of specific GPUNetIO Verbs API functions by performing a ping-pong exchange between client and server. These tests launch a CUDA kernel with a single CUDA thread and do not support the -e and -t command-line options.

Both client and server output the round-trip time (RTT) latency (half and full) in microseconds for preparing and exchanging messages of different sizes. The server also outputs a message indicating the execution outcome.

doca_gpunetio_verbs_write_lat

The doca_gpunetio_verbs_write_lat test measures the latency of RDMA Write operations using the GPUNetIO Verbs API without the shared QP feature. It is similar to perftest.

This sample support the reliable doorbell feature with command-line option -r <0|1> .

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_signal_lat

The doca_gpunetio_verbs_put_signal_lat test measures the latency of Put and Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It uses high-level API functions from doca_gpunetio_dev_verbs_onesided.cuh and launches a CUDA kernel with a single CUDA thread.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_counter_lat

The doca_gpunetio_verbs_put_counter_lat test measures the latency of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It utilizes high-level API functions from doca_gpunetio_dev_verbs_counter.cuh and the Core Direct counter feature, even with a single CUDA thread.

This sample support the reliable doorbell feature with command-line option -r <0|1> .

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0

Client: (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

GPU DMA Copy

This sample exhibits how to use the DOCA DMA and DOCA GPUNetIO libraries to DMA copy a memory buffer from the CPU to the GPU (with DOCA DMA CPU functions) and from the GPU to the CPU (with DOCA GPUNetIO DMA device functions) from a CUDA kernel. This sample requires a DPU as it uses the DMA engine on it.

DOCA RDMA Client side

Copy
Copied!

            
            $ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_dma_memcpy
 
# Build the sample and then execute
 
$ ./build/doca_gpunetio_dma_memcpy -g 17:00.0 -n ca:00.0
[15:44:04:189462][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:164][main] Starting the sample
EAL: Detected CPU lcores: 64
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Selected IOVA mode 'VA'
EAL: No free 2048 kB hugepages reported on node 0
EAL: No free 2048 kB hugepages reported on node 1
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
EAL: Probe PCI driver: gpu_cuda (10de:2331) device: 0000:17:00.0 (socket 0)
[15:44:04:857251][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:211][init_sample_mem_objs] The CPU source buffer value to be copied to GPU memory: This is a sample piece of text from CPU
[15:44:04:857359][862197][DOCA][WRN][doca_mmap.cpp:1743][doca_mmap_set_memrange] Mmap 0x55aec6206140: Memory range isn't cache-line aligned - addr=0x55aec52ceb10. For best performance align address to 64B
[15:44:04:858839][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:158][init_sample_mem_objs] The GPU source buffer value to be copied to CPU memory: This is a sample piece of text from GPU
[15:44:04:921702][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:570][submit_dma_memcpy_task] Success, DMA memcpy job done successfully
CUDA KERNEL INFO: The GPU destination buffer value after the memcpy: This is a sample piece of text from CPU 
CPU received message from GPU: This is a sample piece of text from GPU
[15:44:04:930087][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:364][gpu_dma_cleanup] Cleanup DMA ctx with GPU data path
[15:44:04:932658][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:404][gpu_dma_cleanup] Cleanup DMA ctx with CPU data path
[15:44:04:954156][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:197][main] Sample finished successfully

On This Page

Build the sample

phc2sys

phc2sys

phc2sys

phc2sys

phc2sys

phc2sys

Run

phc2sys

Build the sample

nping generator

nping generator

nping generator

DOCA Simple Receive

DOCA Simple Receive

nping generator

DOCA RDMA Server side

DOCA RDMA Client side

DOCA RDMA Client side with CM

DOCA RDMA Client side