GPUNetIO Sample Guide
This section contains two samples that show how to enable simple GPUNetIO features. Be sure to correctly set the following environment variables:
Build the sample
export PATH=${PATH}:/usr/local/cuda/bin
export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"
export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib/pkgconfig:/opt/mellanox/grpc/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu/pkgconfigexport LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/opt/mellanox/gdrcopy/src:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu
All the DOCA samples described in this section are governed under the BSD-3 software license agreement.
Please ensure the arch of your GPU is included in the meson.build file before building the samples (e.g., sm_80 for Ampere, sm_89 for L40, sm_90 for H100, etc).
The sample shows how to enable Accurate Send Scheduling (or wait-on-time) feature in the context of a GPUNetIO application. Accurate Send Scheduling is the ability of an NVIDIA NIC to send packets in the future according to application-provided timestamps.
This feature is supported on ConnectX-7 and later .
This sample demonstrates how to send packets from the GPU using Accurate Send Scheduling by calling the high-level doca_gpu_dev_eth_txq_wait_send function with a BLOCK execution scope.
This NVIDIA blog post offers an example for how this feature has been used in 5G networks.
Synchronizing Clocks
Before starting the sample, it is important to properly synchronize the CPU clock with the NIC clock. This way, timestamps provided by the system clock are synchronized with the time in the NIC.
For this purpose, at least the phc2sys service must be used. To install it on an Ubuntu system:
phc2sys
sudo apt install linuxptp
To start the phc2sys service properly, a config file must be created in
/lib/systemd/system/phc2sys.service. Assuming the network interface is ens6f0 :
phc2sys
[Unit]
Description=Synchronize system clock or PTP hardware clock (PHC)
Documentation=man:phc2sys
[Service]
Restart=always
RestartSec=5s
Type=simple
ExecStart=/bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp$(ethtool -T ens6f0 | grep PTP | awk '{print $4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
[Install]
WantedBy=multi-user.target
Now phc2sys service can be started:
phc2sys
sudo systemctl stop systemd-timesyncd
sudo systemctl disable systemd-timesyncd
sudo systemctl daemon-reload
sudo systemctl start phc2sys.service
To check the status of phc2sys:
phc2sys
$ sudo systemctl status phc2sys.service
Output:
phc2sys
● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
Loaded: loaded (/lib/systemd/system/phc2sys.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2023-04-03 10:59:13 UTC; 2 days ago
Docs: man:phc2sys
Main PID: 337824 (sh)
Tasks: 2 (limit: 303788)
Memory: 560.0K
CPU: 52min 8.199s
CGroup: /system.slice/phc2sys.service
├─337824 /bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T enp23s0f1np1 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R >
└─337829 /usr/sbin/phc2sys -s /dev/ptp3 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
Apr 05 16:35:52 doca-vr-045 phc2sys[337829]: [457395.040] CLOCK_REALTIME rms 8 max 18 freq +110532 +/- 27 delay 770 +/- 3
Apr 05 16:35:53 doca-vr-045 phc2sys[337829]: [457396.071] CLOCK_REALTIME rms 8 max 20 freq +110513 +/- 30 delay 769 +/- 3
Apr 05 16:35:54 doca-vr-045 phc2sys[337829]: [457397.102] CLOCK_REALTIME rms 8 max 18 freq +110527 +/- 30 delay 769 +/- 3
Apr 05 16:35:55 doca-vr-045 phc2sys[337829]: [457398.130] CLOCK_REALTIME rms 8 max 18 freq +110517 +/- 31 delay 769 +/- 3
Apr 05 16:35:56 doca-vr-045 phc2sys[337829]: [457399.159] CLOCK_REALTIME rms 8 max 19 freq +110523 +/- 32 delay 770 +/- 3
Apr 05 16:35:57 doca-vr-045 phc2sys[337829]: [457400.191] CLOCK_REALTIME rms 8 max 20 freq +110528 +/- 33 delay 770 +/- 3
Apr 05 16:35:58 doca-vr-045 phc2sys[337829]: [457401.221] CLOCK_REALTIME rms 8 max 19 freq +110512 +/- 38 delay 770 +/- 3
Apr 05 16:35:59 doca-vr-045 phc2sys[337829]: [457402.253] CLOCK_REALTIME rms 9 max 20 freq +110538 +/- 47 delay 770 +/- 4
Apr 05 16:36:00 doca-vr-045 phc2sys[337829]: [457403.281] CLOCK_REALTIME rms 8 max 21 freq +110517 +/- 38 delay 769 +/- 3
Apr 05 16:36:01 doca-vr-045 phc2sys[337829]: [457404.311] CLOCK_REALTIME rms 8 max 17 freq +110526 +/- 26 delay 769 +/- 3
...
At this point, the system and NIC clocks are synchronized so timestamps provided by the CPU are correctly interpreted by the NIC.
The timestamps you get may not reflect the real time and day. To get that, you must properly set the ptp4l service with an external grand master on the system.
Doing that is out of the scope of this sample.
Running the Sample
To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:
phc2sys
# Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_send_wait_time
meson build
ninja -C build
The sample sends 8 bursts of 32 raw Ethernet packets or 1kB to a dummy Ethernet address, 10:11:12:13:14:15, in a timed way. Program the NIC to send every t nanoseconds (command line option -t).
The following example programs a system with GPU PCIe address ca:00.0
and NIC PCIe address 17:00.0 to send 32 packets every 5 milliseconds:
Run
# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ sudo ./build/doca_gpunetio_send_wait_time -n 17:00.0 -g ca:00.0 -t 5000000[09:22:54:165778][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:195][main] Starting the sample
[09:22:54:438260][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:224][main] Sample configuration:
GPU ca:00.0
NIC 17:00.0
Timeout 5000000ns
EAL: Detected CPU lcores: 128
...
EAL: Probe PCI driver: mlx5_pci (15b3:a2d6) device: 0000:17:00.0 (socket 0)
[09:22:54:819996][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:607][gpunetio_send_wait_time] Wait on time supported mode: DPDK
EAL: Probe PCI driver: gpu_cuda (10de:20b5) device: 0000:ca:00.0 (socket 1)
[09:22:54:830212][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:252][create_tx_buf] Mapping send queue buffer (0x0x7f48e32a0000 size 262144B) with legacy nvidia-peermem mode
[09:22:54:832462][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:657][gpunetio_send_wait_time] Launching CUDA kernel to send packets
[09:22:54:842945][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:664][gpunetio_send_wait_time] Waiting 10 sec for 256 packets to be sent
[09:23:04:883309][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:684][gpunetio_send_wait_time] Sample finished successfully
[09:23:04:883339][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:239][main] Sample finished successfully
To verify that packets are actually sent at the right time, use a packet sniffer on the other side (e.g.,
tcpdump):
phc2sys
$ sudo tcpdump -i enp23s0f1np1 -A -s 64
17:12:23.480318 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
....
17:12:23.480368 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of first burst of 32 packets, bump to +5ms
17:12:23.485321 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...
17:12:23.485369 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of second burst of 32 packets, bump to +5ms
17:12:23.490278 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...
The output should show a jump of approximately 5 milliseconds every 32 packets.
tcpdump may increase latency in sniffing packets and reporting the receive timestamp, so the difference between bursts of 32 packets reported may be less than expected, especially with small interval times like 500 microseconds (-t 500000).
This sample application demonstrates the fundamental steps to build a DOCA GPUNetIO receiver application. It creates one queue for UDP packets and uses a single CUDA kernel to receive those packets from the GPU.
The sample uses the high-level doca_gpu_dev_eth_rxq_recv function, and its execution scope (thread, warp, or block) can be set at runtime using the -e command-line parameter.
Invoking printf from a CUDA kernel is not good practice for release software as it slows down the kernel's overall execution. It should only be used for debugging. To enable packet info printing in this sample, the DOCA_GPUNETIO_SIMPLE_RECEIVE_DEBUG macro must be set to 1.
Build Instructions
Build the sample
# Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_receive
meson build
ninja -C build
Execution and Testing
This guide assumes a two-machine setup:
Receiver Machine: Runs the
doca_gpunetio_simple_receiveapplication.Packet Generator Machine: Uses an application like
npingto send UDP packets.
Packet Generator Machine Example
On the packet generator machine, use nping to send 10 UDP packets to the receiver's IP address (assumed to be 192.168.1.1 in this example).
Command:
nping generator
$ nping --udp -c 10 -p 2090 192.168.1.1 --data-length 1024 --delay 500ms
Output:
Starting Nping 0.7.80 ( https://nmap.org/nping ) at 2023-11-20 11:05 UTC
SENT (0.0018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (0.5018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.0025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.5025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.0032s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.5033s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.0040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.5040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.0047s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.5048s) UDP packet with 1024 bytes to 192.168.1.1:2090
Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
UDP packets sent: 10 | Rcvd: 0 | Lost: 10 (100.00%)
Nping done: 1 IP address pinged in 5.50 seconds
Receiver Machine Example
On the receiver machine, run the sample with the appropriate PCI addresses for the NIC and GPU. This example uses BLOCK scope (-e 2).
Command:
nping generator
# Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_receive -n 9f:00.0 -g 8a:00.0 -e 2
Output:
nping generator
[2025-10-27 00:52:32:387590][3382972416][DOCA][INF][doca_log.cpp:633] DOCA version 3.2.0111
[2025-10-27 00:52:32:387627][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:198][main] Starting the sample
[2025-10-27 00:52:32:681807][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:240][main] Sample configuration:
GPU 8a:00.0
NIC 9f:00.0
Shared QP exec scope Block
[2025-10-27 00:52:32:687128][3382972416][DOCA][WRN][engine_model.c:90] adapting queue depth to 128.
[2025-10-27 00:52:32:753758][3382972416][DOCA][WRN][hws_port.c:864] ARGUMENT_256B resource doens't exist, skip creating NAT64 actions
[2025-10-27 00:52:32:755527][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:479][create_rxq] Creating Sample Eth Rxq
[2025-10-27 00:52:32:755713][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:544][create_rxq] Mapping receive queue buffer (0x0x7eef8e000000 size 33554432B dmabuf fd 43) with dmabuf mode
[2025-10-27 00:52:32:795823][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:717][gpunetio_simple_receive] Launching CUDA kernel to receive packets
[2025-10-27 00:52:32:799403][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:721][gpunetio_simple_receive] Waiting for termination
# Type Ctrl+C to kill the sample
[2025-10-27 00:53:35:034046][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:61][signal_handler] Signal 2 received, preparing to exit!
Exiting from simple receive sample. Total number of received packets: 10
[2025-10-27 00:53:35:034322][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:395][destroy_rxq] Destroying Rxq
[2AcC-10-27 00:53:35:061065][3382972416][DOCA][WRN][hws_group_pool.c:85] group_pool has 1 used groups
[2025-10-27 00:53:35:845568][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:738][gpunetio_simple_receive] Sample finished successfully
[2025-10-27 00:53:35:845583][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:259][main] Sample finished successfully
This sample implements a simple GPU Ethernet packet generator that constantly sends a flow of raw Ethernet packets. It demonstrates two different implementation "flavors" for sending packets: one using the low-level API and one using the high-level API.
The behavior is controlled by the -q command-line option:
-q 0(Low-level API): Disables the shared queue feature and executes using the low-level send functions.-q 1(High-level API): Enables the shared queue feature. When this is set, the execution scope (thread, warp, or block) can also be chosen using the-eoption.
Build Instructions
DOCA Simple Receive
# Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_send
meson build
ninja -C build
Execution Example
Assuming a system with a NIC at PCIe address 9f:00.0 and a GPU at 8a:00.0, the following command launches the sample to send 1024-byte packets (-s 1024) using 1024 CUDA threads (-t 1024). It enables the high-level API (-q 1) with a BLOCK execution scope (-e 2).
Command:
DOCA Simple Receive
# Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_send -n 9f:00.0 -g 8a:00.0 -s 1024 -t 1024 -q 1 -e 2
Output:
[2025-10-28 04:38:10:548387][961593344][DOCA][INF][doca_log.cpp:645] DOCA version 3.3.0010
[2025-10-28 04:38:10:548420][961593344][DOCA][INF][gpunetio_simple_send_main.c:363][main] Starting the sample
[2025-10-28 04:38:10:839382][961593344][DOCA][INF][gpunetio_simple_send_main.c:433][main] Sample configuration:
GPU 8a:00.0
NIC 9f:00.0
Packet size 1024
CUDA threads 1024
CPU Proxy No
Shared QP Yes
Shared QP exec scope Block
[2025-10-28 04:38:10:911219][961593344][DOCA][INF][gpunetio_simple_send_sample.c:291][create_txq] Creating Sample Eth Txq
[2025-10-28 04:38:10:916434][961593344][DOCA][INF][gpunetio_simple_send_sample.c:429][create_txq] Mapping send queue buffer (0x0x7fd718800000 size 1048576B dmabuf fd 45) with dmabuf mode
[2025-10-28 04:38:10:917479][961593344][DOCA][INF][gpunetio_simple_send_sample.c:580][gpunetio_simple_send] Launching CUDA kernel to send packets.
[2025-10-28 04:38:10:920972][961593344][DOCA][INF][gpunetio_simple_send_sample.c:584][gpunetio_simple_send] Waiting for ctrl+c termination
# Type Ctrl+C to kill the sample
[2025-10-28 04:38:22:681160][961593344][DOCA][INF][gpunetio_simple_send_sample.c:67][signal_handler] Signal 2 received, preparing to exit!
[2025-10-28 04:38:22:681176][961593344][DOCA][INF][gpunetio_simple_send_sample.c:590][gpunetio_simple_send] Exiting from sample
[2025-10-28 04:38:22:681387][961593344][DOCA][INF][gpunetio_simple_send_sample.c:204][destroy_txq] Destroying Txq
[2025-10-28 04:38:22:990964][961593344][DOCA][INF][gpunetio_simple_send_sample.c:612][gpunetio_simple_send] Sample finished successfully
[2025-10-28 04:38:22:990980][961593344][DOCA][INF][gpunetio_simple_send_main.c:457][main] Sample finished successfully
Performance Verification
To verify that packets are being sent, you can check the traffic throughput on the same machine using the mlnx_perf command. (Assuming the NIC at 9f:00.0 has the interface name ens6f0np0):
nping generator
$ mlnx_perf -i ens6f0np0
tx_vport_unicast_packets: 21,033,677
tx_vport_unicast_bytes: 21,538,485,248 Bps = 172,307.88 Mbps
tx_packets_phy: 21,033,286
tx_bytes_phy: 21,622,256,208 Bps = 172,978.4 Mbps
tx_prio0_bytes: 21,617,084,940 Bps = 172,936.67 Mbps
tx_prio0_packets: 21,028,292
UP 0: 172,936.67 Mbps = 100.00%
UP 0: 21,028,292 Tran/sec = 100.00%
Throughput will vary between different systems based on factors like the PCIe connection between the GPU and NIC, the NIC model, and the GPU model. This sample can be used to evaluate your system's send performance by varying the packet size and number of CUDA threads.
This sample exhibits how to use the GPUNetIO RDMA API to receive and send/write with immediate using a single RDMA queue.
The server has a GPU buffer array A composed by GPU_BUF_NUM doca_gpu_buf elements, each 1kB in size. The client has two GPU buffer arrays, B and C, each composed by GPU_BUF_NUM doca_gpu_buf elements, each 512B in size.
The goal is for the client to fill a single server buffer of 1kB with two GPU buffers of 512B as illustrated in the following figure:
To show how to use RDMA write and send, even buffers are sent from the client with write immediate, while odd buffers are sent with send immediate. In both cases, the server must pre-post the RDMA receive operations.
For each buffer, the CUDA kernel code repeats the handshake:
Once all buffers are filled, the server double checks that all values are valid. The server output should be as follows:
DOCA RDMA Server side
# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0
[14:11:43:000930][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:250][main] Starting the sample
...
[14:11:43:686610][1173110][DOCA][INF][rdma_common.c:91][oob_connection_server_setup] Listening for incoming connections
[14:11:45:681523][1173110][DOCA][INF][rdma_common.c:105][oob_connection_server_setup] Client connected at IP: 192.168.2.28 and port: 46274
...
[14:11:45:771807][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:644][rdma_write_server] Before launching CUDA kernel, buffer array A is:
[14:11:45:771822][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 0 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771837][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 1 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771851][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 2 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771864][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 3 -> offset 0: 1111 | offset 128: 1111
RDMA Recv 2 ops completed with immediate values 0 and 1!
RDMA Recv 2 ops completed with immediate values 1 and 2!
RDMA Recv 2 ops completed with immediate values 2 and 3!
RDMA Recv 2 ops completed with immediate values 3 and 4!
[14:11:45:781561][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:671][rdma_write_server] After launching CUDA kernel, buffer array A is:
[14:11:45:781574][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 0 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781583][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 1 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781593][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 2 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781602][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 3 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781640][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:294][main] Sample finished successfully
On the other side, assuming the server is at IP address 192.168.2.28, the client output should be as follows:
DOCA RDMA Client side
# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0 -c 192.168.2.28
[16:08:22:335744][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:197][main] Starting the sample
...
[16:08:25:753316][160913][DOCA][INF][rdma_common.c:147][oob_connection_client_setup] Connected with server successfully
......
Client waiting on flag 7f6596735000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 0
Client waiting on flag 7f6596735001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 1
Client waiting on flag 7f6596735002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 2
Client waiting on flag 7f6596735003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 3
[16:08:25:853454][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:241][main] Sample finished successfully
With RDMA, the network device must be specified by name (e.g., mlx5_0 ) instead of the PCIe address (as is the case for Ethernet).
It is also possible to enable the RDMA CM mode, establishing two connections with the same RDMA GPU handler. An example on the client side:
DOCA RDMA Client side with CM
# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/samples/doca_gpunetio_rdma_client_server_write -d mlx5_0 -gpu 17:00.0 -gid 3 -c 10.137.189.28 -cm --server-addr-type ipv4 --server-addr 192.168.2.28
[11:30:34:489781][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:461][main] Starting the sample
...
[11:30:35:038828][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:950][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:082039][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:963][rdma_write_client] Client - Connection 1 is established
...
[11:30:35:095282][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1006][rdma_write_client] Establishing connection 2..
[11:30:35:097521][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1016][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:102718][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1029][rdma_write_client] Client - Connection 2 is established
[11:30:35:102783][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1046][rdma_write_client] Client, terminate kernels
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 0. Waiting on the exit flag.
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 1. Waiting on the exit flag.
[11:30:35:122448][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:512][main] Sample finished successfully
In case of RDMA CM, the command option -cm must be specified on the server side.
Printing from a CUDA kernel is not recommended for performance. It may make sense for debugging purposes and for simple samples like this one.
The doca_gpunetio_verbs_* examples demonstrate how to use the GPUNetIO Verbs API in various scenarios. These samples require a client-server setup, with the client needing the -c <server IP> parameter.
The following parameters are supported by all Verbs samples:
-n: Network card handler type (0: AUTO, 1: CPU Proxy, 2: GPU DB). Default is 0 (AUTO).
-e: Execution mode for shared QP (0: per-thread, 1: per-warp). Default is 0 (per-thread).
-d: Network card device name.
-g: GPU device PCIe address.
-gid: GID index for DOCA RDMA (optional).
-i: Number of iterations (optional).
-t: Number of CUDA threads (optional).
Please read the specific sample section below to check for extra parameters.
For the examples in this guide, it is assumed that the GPU PCIe address is 8A:00.0 and the network card PCIe address is specified accordingly.
If the samples are running on a RoCE connection (e.g. the ConnectX/BlueField is set in Ethernet mode instead of Infiniband mode) you may get this (or similar) error: FW failed to modify object, status=BAD_PARAM_ERR (0x3), syndrome=0x1f3b5d .
To fix it, please remove the doca_verbs_ah_attr_set_dlid function from the connect_verbs_qp function in the samples/doca_gpunetio/verbs_common.c file.
Bandwidth Samples
The samples ending with _bw measure the bandwidth of specific GPUNetIO Verbs API functions. In these samples, the client prepares and sends data from a CUDA kernel, while the server waits on the CPU to receive the data, validate it upon receiving a Ctrl+C signal, and reports the outcome.
Key Characteristics
The client outputs the MB/s achieved for preparing and sending messages of various sizes.
The server outputs a message indicating the execution outcome.
All bandwidth samples support the previously listed command-line parameters.
Simplifying QP/CQ/UAR Creation
To simplify the integration of DOCA RDMA Verbs with GPUNetIO, high-level functions like doca_gpu_verbs_create_qp_hl() and doca_gpu_verbs_create_qp_group_hl() are provided in samples/doca_gpunetio/verbs_high_level.cpp. These functions encapsulate the necessary steps for creating QP/CQ/UAR, making it easier for developers to combine DOCA RDMA Verbs and GPUNetIO in their applications.
doca_gpunetio_verbs_write_bw
The doca_gpunetio_verbs_write_bw test measures the bandwidth of RDMA Write operations using the GPUNetIO Verbs API. It launches a CUDA kernel with 1 block and 512 threads by default, where all threads post RDMA Write WQEs in different positions and the last thread submits them. The test then polls the CQE corresponding to the last WQE to ensure all previous WQEs have been executed correctly.
Example command lines:
Server:
doca_gpunetio_verbs_write_bw -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_write_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63
doca_gpunetio_verbs_put_bw
The doca_gpunetio_verbs_put_bw test measures the bandwidth of RDMA Write operations (referred to as "Put") using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads. The test can also measure individual function latencies by setting the KERNEL_DEBUG_TIMES macro to 1.
Example command lines:
Server:
doca_gpunetio_verbs_put_bw -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_put_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63
doca_gpunetio_verbs_put_signal_bw
The doca_gpunetio_verbs_put_signal_bw test measures the bandwidth of Put + Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.
Example command lines:
Server:
doca_gpunetio_verbs_put_signal_bw -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_put_signal_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63
The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be:
do {
final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));
doca_gpunetio_verbs_put_counter_bw
The doca_gpunetio_verbs_put_counter_bw test measures the bandwidth of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.
Example command lines:
Server:
doca_gpunetio_verbs_put_counter_bw -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_put_counter_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63
The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be :
do {
final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));
doca_gpunetio_verbs_twosided_bw
The doca_gpunetio_verbs_twosided_bw test measures the bandwidth of client-server data exchange via Send/Recv operations using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads.
Example command lines:
Server:
doca_gpunetio_verbs_twosided_bw -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_twosided_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63
doca_gpunetio_verbs_get_bw
This sample application measures the bandwidth of one-sided RDMA Read ("Get") operations. It uses the GPUNetIO Verbs API with the shared QP feature.
The application launches a CUDA kernel configured with 2 blocks, each containing 256 threads.
Example usage:
Server:
doca_gpunetio_verbs_get_bw -g 8A:
00.0-d mlx5_0Client (connects to the server's IP, e.g.,
192.168.1.63):doca_gpunetio_verbs_get_bw -g 8A:
00.0-d mlx5_0 -c192.168.1.63InfoAdditional CLI options can be added.
Latency Samples
The samples ending with _lat measure the latency of specific GPUNetIO Verbs API functions by performing a ping-pong exchange between client and server. These tests launch a CUDA kernel with a single CUDA thread and do not support the -e and -t command-line options.
Both client and server output the round-trip time (RTT) latency (half and full) in microseconds for preparing and exchanging messages of different sizes. The server also outputs a message indicating the execution outcome.
doca_gpunetio_verbs_write_lat
The doca_gpunetio_verbs_write_lat test measures the latency of RDMA Write operations using the GPUNetIO Verbs API without the shared QP feature. It is similar to perftest.
This sample support the reliable doorbell feature with command-line option -r <0|1> .
Example command lines:
Server:
doca_gpunetio_verbs_write_lat -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_write_lat -g 8A:
00.0-d mlx5_0 -c192.168.1.63
doca_gpunetio_verbs_put_signal_lat
The doca_gpunetio_verbs_put_signal_lat test measures the latency of Put and Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It uses high-level API functions from doca_gpunetio_dev_verbs_onesided.cuh and launches a CUDA kernel with a single CUDA thread.
Example command lines:
Server:
doca_gpunetio_verbs_put_signal_lat -g 8A:
00.0-d mlx5_0Client (additional command-line options can be added):
doca_gpunetio_verbs_put_signal_lat -g 8A:
00.0-d mlx5_0 -c192.168.1.63
doca_gpunetio_verbs_put_counter_lat
The doca_gpunetio_verbs_put_counter_lat test measures the latency of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It utilizes high-level API functions from doca_gpunetio_dev_verbs_counter.cuh and the Core Direct counter feature, even with a single CUDA thread.
This sample support the reliable doorbell feature with command-line option -r <0|1> .
Example command lines:
Server:
doca_gpunetio_verbs_put_counter_lat -g 8A:
00.0-d mlx5_0Client: (additional command-line options can be added):
doca_gpunetio_verbs_put_counter_lat -g 8A:
00.0-d mlx5_0 -c192.168.1.63
This sample exhibits how to use the DOCA DMA and DOCA GPUNetIO libraries to DMA copy a memory buffer from the CPU to the GPU (with DOCA DMA CPU functions) and from the GPU to the CPU (with DOCA GPUNetIO DMA device functions) from a CUDA kernel. This sample requires a DPU as it uses the DMA engine on it.
DOCA RDMA Client side
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_dma_memcpy
# Build the sample and then execute
$ ./build/doca_gpunetio_dma_memcpy -g 17:00.0 -n ca:00.0
[15:44:04:189462][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:164][main] Starting the sample
EAL: Detected CPU lcores: 64
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Selected IOVA mode 'VA'
EAL: No free 2048 kB hugepages reported on node 0
EAL: No free 2048 kB hugepages reported on node 1
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
EAL: Probe PCI driver: gpu_cuda (10de:2331) device: 0000:17:00.0 (socket 0)
[15:44:04:857251][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:211][init_sample_mem_objs] The CPU source buffer value to be copied to GPU memory: This is a sample piece of text from CPU
[15:44:04:857359][862197][DOCA][WRN][doca_mmap.cpp:1743][doca_mmap_set_memrange] Mmap 0x55aec6206140: Memory range isn't cache-line aligned - addr=0x55aec52ceb10. For best performance align address to 64B
[15:44:04:858839][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:158][init_sample_mem_objs] The GPU source buffer value to be copied to CPU memory: This is a sample piece of text from GPU
[15:44:04:921702][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:570][submit_dma_memcpy_task] Success, DMA memcpy job done successfully
CUDA KERNEL INFO: The GPU destination buffer value after the memcpy: This is a sample piece of text from CPU
CPU received message from GPU: This is a sample piece of text from GPU
[15:44:04:930087][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:364][gpu_dma_cleanup] Cleanup DMA ctx with GPU data path
[15:44:04:932658][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:404][gpu_dma_cleanup] Cleanup DMA ctx with CPU data path
[15:44:04:954156][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:197][main] Sample finished successfully