DOCA GPUNetIO - NVIDIA Docs

This document provides an overview and configuration instructions for DOCA GPUNetIO API.

Introduction

Note

The quality status of DOCA libraries is listed here.

DOCA GPUNetIO enables real-time GPU processing for network packets, making it ideal for application domains such as:

Signal processing
Network security
Information gathering
Input reconstruction

Traditional approaches often rely on a CPU-centric model, where the CPU coordinates with the NIC to receive packets in GPU memory using GPUDirect RDMA. Afterward, the CPU notifies a CUDA kernel on the GPU to process the packets. However, on low-power platforms, this CPU dependency can become a bottleneck, limiting GPU performance and increasing latency.

DOCA GPUNetIO addresses this challenge by offering a GPU-centric solution that removes the CPU from the critical path. By combining multiple NVIDIA technologies, it provides a highly efficient and scalable method for network packet processing.

Technologies integrated with DOCA GPUNetIO:

GPUDirect RDMA – Enables direct packet transfer between the NIC and GPU memory, eliminating unnecessary memory copies
GPUDirect Async Kernel-Initiated (GDAKI) – Allows CUDA kernels to control network operations without CPU intervention
- In GPUNetIO, GDAKI is enabled for Ethernet, InfiniBand, and RoCE
- GDAKI is also named IBGDA when used with InfiniBand
GDRCopy Library – Allows the CPU to access GPU memory directly
NVIDIA BlueField DMA Engine – Supports GPU-triggered memory copies

The following is an example diagram of a CPU-centric approach:

image2023-3-17_17-16-6-version-1-modificationdate-1763548710497-api-v2.png

The following is an example diagram of a GPU-centric approach:

image2023-4-19_11-47-9-version-1-modificationdate-1763548714953-api-v2.png

Key features of DOCA GPUNetIO include :

GPUDirect Async Kernel-Initiated (GDAKI)
- GDAKI network communications – a GPU CUDA kernel can control network communications to send or receive data
  - GPU can control Ethernet communications (Ethernet/IP/UDP/TCP/ICMP)
  - GPU can control RDMA communications (InfiniBand or RoCE are supported)
  - CPU intervention is unnecessary in the application critical path
GPUDirect RDMA
- Enables direct data transfers to and from GPU memory without CPU staging copies.
DMA Engine Control
- CUDA kernels can initiate memory copies using BlueField's DMA engine
Semaphores for Low-Latency Communication
- Supports efficient message passing between CUDA kernels or between CUDA kernels and CPU threads
Smart Memory Allocation
- Allocates aligned GPU memory buffers, optimizing memory access
- GDRCopy library to allocate a GPU memory buffer accessible from the CPU
Accurate Send Scheduling
- Provides precise control over Ethernet packet transmission based on user-defined timestamps.

NVIDIA applications that use DOCA GPUNetIO include:

Aerial 5G SDK – For ultra-low latency 5G network operations
NIXL – NVIDIA Inference Xfer Library (NIXL) is targeted for accelerating point to point communications in AI inference frameworks (e.g., NVIDIA Dynamo)
Morpheus – Enhancing cybersecurity through accelerated packet inspection
Holoscan Advanced Network Operator – Powering real-time data processing in edge AI environments
UCX – new GDAKI module
– GIN enabled via GPUNetIO GPU communications

For more information about DOCA GPUNetIO, refer to the following NVIDIA blog posts:

Changes From Previous Releases

Changes in 3.2.0

DOCA GPUNetIO CPU shared library doesn't depend on CUDA RT anymore
- libdoca_gpunetio.so only links to libcuda.so now
Removed any DPDK dependency from GPUNetIO samples
DOCA GPUNetIO Ethernet functions reworked and released as inline functions in open source CUDA header files. Features added:
- Share QP for the send side functions on 3 different scopes: block, warp and block
- CPU proxy execution mode
- Memory consistency algorithm for pre-Hopper GPUs can be explicitly enabled on the receive side
DOCA GPUNetIO Verbs functions extended:
- Memory consistency algorithm for pre-Hopper GPUs can be explicitly enabled with RDMA Recv or RDMA Read
DOCA GPUNetIO Comm Channel new API to enable the communication between DPU and GPU via DOCA Comm Channel
dmabuf for GPU memory enabled in all GPUNetIO Verbs and Ethernet samples
- nvidia-peermem is used as fallback mechanism
Enabled GPUNetIO Verbs two-sided receive functions to handle up to 32B of data in-lined within the CQE, eliminating the need to allocate and register receive buffers for small messages.

Installation

DOCA GPUNetIO is included in the doca-all package, which is available from the DOCA downloads portal for all supported operating systems.

To install the required DOCA GPUNetIO components, use the package manager for your OS.

For Ubuntu/Debian:

Copy
Copied!

            
            apt install doca-all doca-sdk-gpunetio libdoca-sdk-gpunetio-dev

For RHEL:

Copy
Copied!

            
            yum install doca-all doca-sdk-gpunetio doca-sdk-gpunetio-devel

Note

To achieve the best performance when building any DOCA GPUNetIO sample or application, you must set the buildtype to release in the meson.build file (e.g., buildtype = 'release'). Building in the default debug mode will result in significantly lower performance.

System Configuration

To run a DOCA GPUNetIO application, the system must be configured with both a GPU and a NIC (either ConnectX or BlueField), connected to the system via PCIe.

image-2025-7-10_11-18-26-version-1-modificationdate-1763548705560-api-v2.png

The system's internal hardware topology should be GPUDirect-RDMA-friendly to maximize the internal throughput between the GPU and the NIC. To verify the type of connection between the GPU and NIC:

Copy
Copied!

            
            $ nvidia-smi topo -m
	    GPU0	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	    NODE	NODE	12-23,36-47	    1				N/A
NIC0	NODE	 X 	    PIX
NIC1	NODE	PIX	     X
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
 
NIC Legend:
 
  NIC0: mlx5_0
  NIC1: mlx5_1

To maximize throughput between the GPU and NIC, the system should have a PIX (or PXB) topology with a dedicated PCIe connection. A PHB topology is still acceptable if the GPU and NIC are on the same PCIe Host Bridge and NUMA node, although performance may vary depending on the platform. For optimal performance, it's recommended to avoid NODE and SYS topologies, as they may negatively impact performance despite the application remaining functional.

Note

DOCA GPUNetIO has been fully tested on bare-metal systems and within Docker containers. Support for virtualized environments is currently considered experimental.

NIC Configuration

ConnectX NIC

Note

Ensure the NVIDIA® ConnectX® firmware is compatible with the current DOCA release. NVIDIA recommends using ConnectX-7 or later adapters. ConnectX6-Dx still works but some features (e.g., Ethernet Accurate Send Scheduling) are not supported.

Start MST:

Copy
Copied!

            
            $ sudo mst start

Check MST status to get the MST device identifier:

Copy
Copied!

            
            $ sudo mst status -v

Example output:

Copy
Copied!

            
            MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA
ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0.1    b5:00.1   mlx5_1          net-ens6f1                0
ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0      b5:00.0   mlx5_0          net-ens6f0                0

Configure ConnectX NIC:

For Ethernet transport, r un the following commands, replacing <mst_device> with the actual MST device name (e.g., /dev/mst/mt4125_pciconf0 ):

Copy
Copied!

            
            mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=1 KEEP_ETH_LINK_UP_P2=1 KEEP_IB_LINK_UP_P1=0 KEEP_IB_LINK_UP_P2=0
# This is required only if application uses the Accurate Send Scheduling feature
mlxconfig -d <mst_device> --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

Info

The following example assumes that the adapter is dual port. If single port, only P1 options apply.

For InfiniBand transport, run:

Copy
Copied!

            
            mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=0 KEEP_ETH_LINK_UP_P2=0 KEEP_IB_LINK_UP_P1=1 KEEP_IB_LINK_UP_P2=1
# Accurate Send Scheduling feature can't be used with InfiniBand

Info

The following example assumes that the adapter is dual port. If single port, only P1 options apply.

Perform a cold reboot to apply the changes :

Copy
Copied!

            
            ipmitool power cycle

BlueField NIC

To use NVIDIA BlueField-2 or BlueField-3 with DOCA GPUNetIO, the DPU must be in NIC mode to expose the internal ConnectX to the host application.

Start MST:

Copy
Copied!

            
            $ sudo mst start

Check MST status to get the mst device identifier:

Copy
Copied!

            
            $ sudo mst status -v

Example output:

Copy
Copied!

            
            MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE                         MST                           PCI       RDMA            NET                                     NUMA  
BlueField3(rev:1)                   /dev/mst/mt41692_pciconf0.1   9f:00.1   mlx5_1          net-ens6f1np1                           1
BlueField3(rev:1)                   /dev/mst/mt41692_pciconf0     9f:00.0   mlx5_0          net-ens6f0np0                           1

Configure BlueField NIC:

For Ethernet transport:

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=2 LINK_TYPE_P2=2 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
# This is required only if application uses the Accurate Send Scheduling feature
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

For InfiniBand transport:

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=1 LINK_TYPE_P2=1 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
# Accurate Send Scheduling feature can't be used with Infiniband

Do a cold reboot to apply the changes :

Copy
Copied!

            
            ipmitool power cycle

Example verification command for Ethernet.

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q LINK_TYPE_P1 LINK_TYPE_P2 INTERNAL_CPU_MODEL INTERNAL_CPU_PAGE_SUPPLIER INTERNAL_CPU_ESWITCH_MANAGER INTERNAL_CPU_IB_VPORT0 INTERNAL_CPU_OFFLOAD_ENGINE ACCURATE_TX_SCHEDULER REAL_TIME_CLOCK_ENABLE

Example output (Ethernet):

Copy
Copied!

            
                     LINK_TYPE_P1                                ETH(2)
         LINK_TYPE_P2                                ETH(2)
         INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1)
         INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)
         INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)
         INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)
         INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)
         ACCURATE_TX_SCHEDULER                       True(1)
         REAL_TIME_CLOCK_ENABLE                      True(1)

PCIe Configuration

On some x86 systems, the Access Control Services (ACS) must be disabled to ensure direct communication between the NIC and GPU, whether they reside on the same converged accelerator DPU or on different PCIe slots in the system. The recommended solution is to disable ACS control via BIOS (e.g., Supermicro or HPE) on your PCIe bridge. Alternatively, it is also possible to disable it via command line, but it may not be as effective as the BIOS option.

The following lspci -tvvv output illustrates a typical system topology:

Copy
Copied!

            
            $ lspci -tvvv...+-[0000:b0]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[b1-b6]----00.0-[b2-b6]--+-00.0-[b3]--+-00.0  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
 |                                           |            +-00.1  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
 |                                           |            \-00.2  Mellanox Technologies MT42822 BlueField-2 SoC Management Interface
 |                                           \-01.0-[b4-b6]----00.0-[b5-b6]----08.0-[b6]----00.0  NVIDIA Corporation Device 20b8

The PCIe switch address to consider is b2:00.0 (entry point of the DPU). ACSCtl must have all negative values:

PCIe set

Copy
Copied!

            
            setpci -s b2:00.0 ECAP_ACS+0x6.w=0000

To verify that the setting has been applied correctly:

PCIe check

Copy
Copied!

            
            $ sudo lspci -s b2:00.0 -vvvv | grep -i ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Please refer to this page and this page for more information.

If the application still does not report any received packets, try to disable IOMMU. On some systems, it can be done from the BIOS looking for the the VT-d or IOMMU from the NorthBridge configuration and change that setting to Disable and save it. The system may also require adding intel_iommu=off or amd_iommu=off to the kernel options. That can be done through the grub command line as follows:

IOMMU

Copy
Copied!

            
            $ sudo vim /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="iommu=off intel_iommu=off <more options>"
$ sudo update-grub
$ sudo reboot

GPU Configuration

CUDA Dependency

DOCA GPUNetIO components have a dependency on CUDA. These dependencies differ for the CPU-side shared library versus the GPU-side datapath components.

CPU Shared Library (libdoca_gpunetio.so) This library has a dependency on libcuda.so (CUDA Driver API). Because it does not use the CUDA Runtime API, it is not subject to potential versioning issues associated with the runtime.
GPU Datapath Components The data path functions are delivered as both header files and a static library, which have different requirements:
- Header-only APIs (GPUNetIO Ethernet, GPUNetIO Verbs): These are inlined functions. Since they are compiled with your application, they are flexible and can be used with any recent CUDA version (e.g., CUDA 12.x or 13.x).
- Static Library APIs (GPUNetIO DMA, CommCh, RDMA): This library is pre-built with CUDA 13.0. Therefore, any application using functions from this static library must be built with CUDA 13.0 or newer.

It is generally recommended to use CUDA 12.6 or newer wherever possible to take advantage of new features.

To decrease initial application startup latency, it is highly recommended to enable NVIDIA driver persistence mode:

Copy
Copied!

            
            nvidia-smi -pm 1

GDRCopy Installation

To enable direct CPU access to GPU memory without using CUDA APIs, DOCA requires the GDRCopy kernel module and library.

Install necessary packages:

Copy
Copied!

            
            sudo apt install -y check kmod

Clone the GDRCopy repository:

Copy
Copied!

            
            git clone https://github.com/NVIDIA/gdrcopy.git /opt/mellanox/gdrcopy

Build GDRCopy:

Copy
Copied!

            
            cd /opt/mellanox/gdrcopy && make

Load the GDRCopy kernel module:

Copy
Copied!

            
            ./insmod.sh

Check if the gdrdrv and nvidia-peermem modules are loaded:

Copy
Copied!

            
            lsmod | egrep gdrdrv

Example output:

Copy
Copied!

            
            gdrdrv                 24576  0
nvidia              55726080  4 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

Export the GDRCopy library path:

Copy
Copied!

            
            export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mellanox/gdrcopy/src

Ensure CUDA library paths are in the environment variables:

Copy
Copied!

            
            export PATH="/usr/local/cuda/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"

Note

GDRCopy is optional. If not installed, DOCA GPUNetIO cannot allocate memory using the DOCA_GPU_MEM_TYPE_GPU_CPU flag. If GDRCopy is not detected, DOCA GPUNetIO will log warning messages.

If GDRCopy is not required for your application, you can safely ignore the related warning messages. To use GDRCopy, ensure its installation path is included in the LD_LIBRARY_PATH environment variable or specified using the GDRCOPY_PATH_L environment variable.

GPU Memory Mapping

To enable the NIC to send and receive packets using GPU memory, a memory mapping mechanism must be used. DOCA supports two methods:

dmabuf (default method): The preferred, modern method for mapping GPU memory.
nvidia-peermem (fallback method): A legacy method used if dmabuf is not available or fails.

Using dmabuf

This is the primary method for mapping GPU memory. The prerequisites for this approach are:

Linux Kernel version 6.2 or later
libibverbs version 1.14.44 or later
CUDA Toolkit:
- Version 12.5 or older: Must be installed with the -m=kernel-open flag (implying open-source NVIDIA driver mode).
- Version 12.6 or newer: Open kernel mode is enabled by default.

Using nvidia-peermem

This method is used if dmabuf is unavailable. It requires the nvidia-peermem kernel module, which is installed with the CUDA Toolkit, to be loaded:

Launch nvidia-peermem

Copy
Copied!

            
            sudo modprobe nvidia-peermem

Implementation and Fallback Logic

The recommended implementation is to attempt to get a dmabuf file descriptor first. If that fails, the application should fall back to the nvidia-peermem method.

The following code snippet demonstrates how to use dmabuf for GPU memory mapping with DOCA mmap, including the fallback logic:

GPU Configuration

Copy
Copied!

            
            /* Get the dmabuf file-descriptor for the GPU memory buffer from CUDA */
result = doca_gpu_dmabuf_fd(gpu_dev, gpu_buffer_addr, gpu_buffer_size, &(dmabuf_fd));
 
if (result != DOCA_SUCCESS) {
    /* Fallback to nvidia-peermem legacy method if dmabuf fails */
    doca_mmap_set_memrange(gpu_buffer_mmap, gpu_buffer_addr, gpu_buffer_size);
} else {
    /* Create DOCA mmap using dmabuf */
    doca_mmap_set_dmabuf_memrange(gpu_buffer_mmap, dmabuf_fd, gpu_buffer_addr, 0, gpu_buffer_size);
}

Handling dmabuf Failure

A failure in doca_gpu_dmabuf_fd (the if block in the example) likely indicates that the NVIDIA driver is not in open-source mode.

When doca_mmap_start is subsequently called, DOCA will attempt to map the GPU memory. If dmabuf was not set, it will automatically fall back to the legacy nvidia-peermem method. In this case, the following warning message is logged:

GPU Configuration

Copy
Copied!

            
            [DOCA][WRN][linux_devx_adapter.cpp:374] devx adapter 0x5566a16018e0: Registration using dmabuf is not supported, falling back to legacy registration

Note

If your application can rely on nvidia-peermem and does not strictly require dmabuf, this warning message can be safely ignored.

Sample Implementations

GPUNetIO Ethernet samples use DOCA mmap with dmabuf and nvidia-peermem as the fallback (following the logic in the code example above).
GPUNetIO Verbs samples show an alternative verbs-based method, using ibv_reg_dmabuf_mr (for dmabuf) and ibv_reg_mr (as the fallback).

GPU BAR1 Size

Every time a GPU buffer is mapped to the NIC (e.g., buffers associated with send or receive queues), a portion of the GPU BAR1 mapping space is used. Therefore, it is important to check that the BAR1 mapping is large enough to hold all the bytes the DOCA GPUNetIO application is trying to map. To verify the BAR1 mapping space of a GPU you can use nvidia-smi:

BAR1 mapping

Copy
Copied!

            
            $ nvidia-smi -q
 
==============NVSMI LOG==============
.....
Attached GPUs                             : 1
GPU 00000000:CA:00.0
    Product Name                          : NVIDIA A100 80GB PCIe
    Product Architecture                  : Ampere
    Persistence Mode                      : Enabled
.....
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB

By default, some GPUs (e.g. RTX models) may have a very small BAR1 size:

BAR1 mapping

Copy
Copied!

            
            $ nvidia-smi -q | grep -i bar -A 3
	BAR1 Memory Usage
	Total : 256 MiB
	Used : 6 MiB
	Free : 250 MiB

If the BAR1 size is not enough, DOCA GPUNetIO applications may exit with errors because DOCA mmap fails to map the GPU memory buffers to the NIC (e.g., Failed to start mmap DOCA Driver call failure). To overcome this issue, the GPU BAR1 must be increased from the BIOS. The system should have "Resizable BAR" option enabled. For further information, refer to this NVIDIA forum post.

Running without Root Privileges

All DOCA GPUNetIO samples and applications using Ethernet rely on DOCA Flow. Therefore, they must be executed with sudo or root privileges.

However, Verbs, RDMA and DMA samples can be run without sudo privileges if a specific option is enabled in the NVIDIA driver:

Create a configuration file for the NVIDIA driver:

Copy
Copied!

            
            cat <<EOF | sudo tee /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RegistryDwords="PeerMappingOverride=1;"
EOF

Perform a cold reboot to ensure the changes take effect.

Verify that the configuration has been applied using the following command:

Copy
Copied!

            
            $ grep RegistryDwords /proc/driver/nvidia/params

You should see the following output confirming the setting:

Copy
Copied!

            
            RegistryDwords: "PeerMappingOverride=1;"

Architecture

A DOCA GPUNetIO network application is split into two fundamental phases:

Configuration Phase (CPU): The CPU handles all initial setup, such as device configuration, memory allocation, and launching CUDA kernels.
Data Path Phase (GPU): The GPU and NIC interact directly to execute high-speed packet processing functions.

DOCA GPUNetIO provides the building blocks to create a full data path pipeline that runs entirely on the GPU, often in combination with other libraries like DOCA Ethernet, DOCA Verbs, DOCA RDMA or DOCA DMA.

Setup and Component Model

During the setup phase, the CPU-based application must:

Prepare all required objects (e.g., queues, contexts) on the CPU.
Export a GPU-specific handle for these objects.
Launch a CUDA kernel, passing the object's GPU handle to it so the kernel can work with the object during the data path phase.

This "CPU-setup, GPU-run" model is why DOCA GPUNetIO is composed of several distinct components:

libdoca_gpunetio.so (CPU Control Path) A shared library containing control-path functions. The CPU application uses these to prepare the GPU, allocate memory, and configure objects.
libdoca_gpunetio_device.a (GPU Data Path – Static Library) A static library containing data-path functions for GPUNetIO RDMA, GPUNetIO DMA, and GPUNetIO CommCh. These functions are invoked by the GPU from within a CUDA kernel.
doca_gpunetio_dev_*.cuh (GPU Data Path – Headers) A set of header files providing inline data-path functions for GPUNetIO Ethernet and GPUNetIO Verbs. These are compiled directly into the application's CUDA kernels.

The following diagram presents the typical flow:

image-2025-4-18_12-5-4-version-1-modificationdate-1763548717153-api-v2.png

Library Linking

The pkgconfig file for the CPU shared library is doca-gpunetio.pc.
There is no pkgconfig file for the GPU static library (libdoca_gpunetio_device.a). If your application requires these CUDA device functions, you must explicitly link this library.
- Default path: /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio_device.a

GDAKI Data Path for Different Protocols

DOCA GPUNetIO provides GPU GDAKI (GPU Direct Access Kernel Interface) functions to control objects for various transports and protocols that were created using other DOCA libraries. This section explains the correlation between DOCA GPUNetIO and these other libraries.

Ethernet GDAKI Communications

To enable GPU-accelerated communications over the Ethernet transport, an application must use a combination of three DOCA libraries:

DOCA GPUNetIO: For GPU-specific handles and data path functions.
DOCA Ethernet: To create and manage the underlying TX/RX queues.
DOCA Flow: To steer packets to the correct GPU-managed queues.

Control Path Phase: Initial CPU Configuration

Before any data path operations can occur on the GPU, the CPU must first configure all the necessary resources.

Create a DOCA Core device handler for the network card.
Create a DOCA GPUNetIO device handler for the GPU.
Use the DOCA Ethernet library to:
- Create the required Send Queues (TXQ) and/or Receive Queues (RXQ).
- Set the data path for these queue handlers to the GPU.
- Export a GPU-specific handle that represents these queues.
Use the DOCA Flow library to create and install flow steering rules that direct the desired types of packets to the newly created DOCA Ethernet receive queues.

Data Path Phase: GPU Kernel Execution

After the configuration phase is complete, the application can launch a CUDA Kernel, passing the GPU handles for the Ethernet queues as input arguments. This allows DOCA GPUNetIO CUDA device functions to operate directly on the queues from within the kernel.

All GPUNetIO Ethernet CUDA device functions are provided as inlined functions in the following header files:

doca_gpunetio_dev_eth_rxq.cuh
doca_gpunetio_dev_eth_txq.cuh

These functions are provided in two distinct APIs:

Low-level API: Provides fine-grained control over fundamental mlx5 elements, such as posting Work Queue Entries (WQEs), ringing the network card's doorbell, and polling for Completion Queue Entries (CQEs).
High-level API: Provides more complex, pre-packaged functions that implement advanced features:
- Shared Send QP: Allows a single Send Queue to be safely accessed concurrently by different CUDA threads, warps, or blocks.
- Cooperative Receive QP: Allows a single thread, all threads in a warp, or all threads in a block to cooperate for parallel packet reception from a single Receive Queue.
- Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the receive side.

Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible.

Info

For examples of how to use both the high-level and low-level GPUNetIO Ethernet APIs, refer to the "GPUNetIO Samples" section.

Example Use Cases and Further Reading

Refer to the DOCA GPU Packet Processing Application Guide (doca_gpu_packet_processing) and samples (doca_gpunetio_simple_send, doca_gpunetio_simple_receive, doca_gpunetio_send_wait_time) for examples of Ethernet GPU communications.

Tip

For a deeper understanding of the underlying Ethernet send and receive structures, objects, and functions, refer to the DOCA Ethernet library documentation.

An example diagram when multiple queues and/or semaphores are used to receive Ethernet traffic:

image2023-4-3_18-18-20-version-1-modificationdate-1763548711390-api-v2.png

Receiving and dispatching packets to another CUDA kernel is not required. A simpler scenario can have a single CUDA kernel receiving and processing packets:

image2023-4-4_12-13-32-version-1-modificationdate-1763548711710-api-v2.png

RDMA Verbs GDAKI Communications (IBGDA)

DOCA GPUNetIO provides GPU data path functions for objects created with the DOCA RDMA and DOCA Verbs libraries. This enables GPU communications over RDMA transport protocols (IB or RoCE).

DOCA GPUNetIO and DOCA RDMA

This approach uses the high-level DOCA RDMA library, which abstracts most low-level mlx5 and IBVerbs details. The GPUNetIO CUDA data path functions follow a similarly high-level API.

Key characteristics:

Provides a high-level API for generic RDMA operations (Write, Send, Read, Recv).
Delivered as a closed-source CUDA static library (libdoca_gpunetio_device.a).
Does not include built-in shared queue management. Applications must manually manage simultaneous access to queues from different CUDA threads.
Optimized for basic GDAKI applications, requiring minimal knowledge of low-level IBVerbs or mlx5 architecture.

Weak vs. Strong Operation Modes

Some RDMA GPU functions offer two operation modes:

Weak Mode: The application is responsible for calculating the next available position in the queue.
- Helper functions (e.g., doca_gpu_rdma_get_info) provide the next available position and queue size mask (for index wrapping).
- The developer must specify the exact queue descriptor number, ensuring no descriptors are skipped.
- More complex, but offers better performance and allows developers to optimize for GPU memory coalescing.
Strong Mode: The GPU function automatically enqueues the RDMA operation in the next available position.
- Simpler to manage, as the developer does not need to track the position.
- May introduce extra latency due to atomic operations. It also does not guarantee that sequential operations use sequential memory locations.
  
  Note
  
  All strong mode functions operate at the CUDA block level. It is not possible to access the same RDMA queue from two different CUDA blocks simultaneously.

Configuration and Usage

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA RDMA to:
- Create send and/or receive queue handlers.
- Set the queue handlers' data path to the GPU.
- Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the RDMA queues as input arguments. Use the functions defined in doca_gpunetio_dev_rdma.cuh (starting with doca_gpu_dev_rdma_*) for RDMA communications in the kernel.

Example Use Cases

Refer to the doca_gpunetio_rdma_client_server_write sample for examples of GPUNetIO RDMA functions.

Tip

For a deeper understanding of RDMA operations, refer to the DOCA RDMA documentation.

DOCA GPUNetIO and DOCA Verbs

This approach uses the lower-level DOCA Verbs library. The GPUNetIO Verbs CUDA data path functions are provided as inlined functions in the doca_gpunetio_dev_verbs_*.cuh header files.

These functions are offered as two different APIs:

Low-level API: For direct manipulation of fundamental RDMA mlx5 elements, such as posting Work Queue Entries (WQEs), ringing doorbells, and polling Completion Queues (CQEs). This supports both one-sided (Read, Write, Atomic) and two-sided (Send, Recv) operations.
High-level API: More complex helper functions that implement common patterns:
- Shared QP: Allows a single QP to be safely accessed concurrently by different CUDA threads or warps.
- Combined Operations: Building blocks for concatenating multiple operations (e.g., put_signal, which combines an RDMA Write and an Atomic Fetch-and-Add).
- Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the RDMA Get or Receive side.

Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible. The samples/doca_gpunetio/verbs_high_level.cpp file provides helper functions (e.g., doca_gpu_verbs_create_qp_hl()) that simplify the CPU-side setup for these Verbs QPs.

Warning

The GPUNetIO Verbs APIs are currently experimental. Please report any issues encountered to help improve code quality and robustness.

Configuration and Usage

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA Verbs to:
- Create send and/or receive queue handlers.
- Set the queue handlers' data path to the GPU.
- Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the Verbs queues as input arguments.

Example Use Cases

Refer to samples doca_gpunetio_verbs_* for examples of GPUNetIO Verbs functions.

Tip

For a deeper understanding of Verbs operations, refer to the DOCA Verbs documentation.

DMA GDAKI Memory Copies

To enable GPU-triggered memory copies using the DMA engine, an application requires DOCA GPUNetIO and DOCA DMA libraries.

Initial CPU Configuration Phase

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA DMA to:
- Create DMA queue handlers.
- Set queue handlers' data path on the GPU.
- Export a GPU handler representing those queues.

Data Path Phase on GPU

After completing the configuration phase, launch a CUDA Kernel, passing the GPU handlers for DMA queues as input arguments. This enables DOCA GPUNetIO CUDA device functions to operate within the CUDA Kernel.

For DMA memory copies, use functions defined in doca_gpunetio_dev_dma.cuh.

Example Use Case

Refer to the sample doca_gpunetio_dma_memcpy for an example of triggering DMA memory copies from a CUDA Kernel.

Tip

For a deeper understanding of DMA operations, refer to the DOCA DMA documentation.

API

This section details the specific structures and operations related to the main DOCA GPUNetIO API on CPU and GPU. GPUNetIO headers are:

doca_gpunetio.h – CPU functions to create GPU handlers, allocate GPU memory, etc..
doca_gpunetio_dev_eth_rxq.cuh – Open source GPU functions to manage a DOCA Ethernet receive queue
doca_gpunetio_dev_eth_txq.cuh – Open source GPU functions to manage a DOCA Ethernet send queue
doca_gpunetio_dev_verbs_*.cuh – Open source GPU functions to manage a DOCA Verbs object
doca_gpunetio_dev_buf.cuh – GPU functions to manage a DOCA buffer array
doca_gpunetio_dev_sem.cuh – GPU functions to manage a DOCA GPUNetIO semaphore
doca_gpunetio_dev_rdma.cuh – GPU functions to manage a DOCA RDMA queue
doca_gpunetio_dev_dma.cuh – GPU functions to manage a DOCA DMA queue

This section lists the main functions of DOCA GPUNetIO. All DOCA Core, Ethernet, Verbs, RDMA and DMA objects used in combination with GPUNetIO have a GPU export function to obtain a GPU handler for that object.

CPU Functions

In this section there is the list of DOCA GPUNetIO functions that can be used on the CPU only.

doca_gpu_mem_type

This enum lists all the possible memory types that can be allocated with GPUNetIO.

Copy
Copied!

            
            enum doca_gpu_mem_type {
    DOCA_GPU_MEM_TYPE_GPU        = 0,
    DOCA_GPU_MEM_TYPE_GPU_CPU    = 1,
    DOCA_GPU_MEM_TYPE_CPU_GPU    = 2,
};

Note

With regards to the syntax, the text string after the DOCA_GPU_MEM_TYPE_ prefix signifies <where-memory-resides>_<who-has-access>.

DOCA_GPU_MEM_TYPE_GPU – memory resides on the GPU and is accessible from the GPU only
DOCA_GPU_MEM_TYPE_GPU_CPU – memory resides on the GPU and is accessible also by the CPU
DOCA_GPU_MEM_TYPE_CPU_GPU – memory resides on the CPU and is accessible also by the GPU

Typical usage of the DOCA_GPU_MEM_TYPE_GPU_CPU memory type is to send a notification from the CPU to the GPU (e.g., a CUDA kernel periodically checking to see if the exit condition set by the CPU is met).

doca_gpu_create

This is the first function a GPUNetIO application must invoke to create an handler on a GPU device. The function initializes a pointer to a structure in memory with type struct doca_gpu *.

Copy
Copied!

            
            doca_error_t doca_gpu_create(const char *gpu_bus_id, struct doca_gpu **gpu_dev);

gpu_bus_id – <PCIe-bus>:<device>.<function> of the GPU device you want to use in your application
gpu_dev [out] – GPUNetIO handler to that GPU device

To get the PCIe address, users can use the commands lspci or nvidia-smi.

As GPUNetIO relies on CUDA Driver API only, it is mandatory to initialize a CUDA context before calling this function. This can be done with CUDA Driver API or with a simple CUDA Runtime call cudaFree(0).

doca_gpu_mem_alloc

This CPU function allocates different flavors of memory.

Copy
Copied!

            
            doca_error_t doca_gpu_mem_alloc(struct doca_gpu *gpu_dev, size_t size, size_t alignment, enum doca_gpu_mem_type mtype, void **memptr_gpu, void **memptr_cpu)

gpu_dev – GPUNetIO device handler
size – Size, in bytes, of the memory area to allocate
alignment – Memory address alignment to use. If 0, default one will be used
mtype – Type of memory to allocate
memptr_gpu [out] – GPU pointer to use to modify that memory from the GPU if memory is allocated on or is visible by the GPU
memptr_cpu[out] – CPU pointer to use to modify that memory from the CPU if memory is allocated on or is visible by the CPU. Can be NULL if memory is GPU-only

Warning

Make sure to use the right pointer on the right device! I f an application tries to access the memory using the memptr_gpu address from the CPU, a segmentation fault will result.

doca_gpu_semaphore_create

Creates a new instance of a DOCA GPUNetIO semaphore. A semaphore is composed by a list of items each having, by default, a status flag, number of packets, and the index of a doca_gpu_buf in a doca_gpu_buf_arr.

For example, a GPUNetIO semaphore can be used in applications where a CUDA kernel is responsible for receiving packets in a doca_gpu_buf_arr array associated with an Ethernet receive queue object, doca_gpu_eth_rxq (see section "doca_gpu_dev_eth_rxq_receive_*"), and dispatching packet info to a second CUDA kernel which processes them.

Another way to use a GPUNetIO semaphore is to exchange data across different entities like two CUDA kernels or a CUDA kernel and a CPU thread. The reason for this scenario may be that the CUDA kernel needs to provide the outcome of the packet processing to the CPU which would in turn compile a statistics report. Therefore, it is possible to associate a custom application-defined structure with each item in the semaphore. This way, the semaphore can be used as a message passing object.

image2023-4-18_12-6-25-version-1-modificationdate-1763548714663-api-v2.png

Entities communicating through a semaphore must adopt a poll/update mechanism according to the following logic:

Update:
1. Populate the next item of the semaphore (packets' info and/or custom application-defined info).
2. Set status flag to READY.
Poll:
1. Wait for the next item to have a status flag equal to READY.
2. Read and process info.
3. Set status flag to DONE.

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_create(struct doca_gpu *gpu_dev, struct doca_gpu_semaphore **semaphore)

gpu_dev – GPUNetIO handler
semaphore [out] – GPUNetIO semaphore handler associated to the GPU device

doca_gpu_semaphore_set_memory_type

This function defines the type of memory for the semaphore allocation.

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_set_memory_type(struct doca_gpu_semaphore *semaphore, enum doca_gpu_mem_type mtype)

semaphore – GPUNetIO semaphore handler
mtype – Type of memory to allocate the custom info structure
- If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type.
- If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_set_items_num

This function defines the number of items in a semaphore.

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_set_items_num(struct doca_gpu_semaphore *semaphore, uint32_t num_items)

semaphore – GPUNetIO semaphore handler
num_items – Number of items to allocate

doca_gpu_semaphore_set_custom_info

This function associates an application-specific structure to semaphore items as explained under "doca_gpu_semaphore_create".

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_set_custom_info(struct doca_gpu_semaphore *semaphore, uint32_t nbytes, enum doca_gpu_mem_type mtype)

semaphore – GPUNetIO semaphore handler
nbytes – Size of the custom info structure to associate
mtype – Type of memory to allocate the custom info structure
- If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type
- If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_get_status

From the CPU, query the status of a semaphore item. If the semaphore is allocated with DOCA_GPU_MEM_GPU, this function results in a segmentation fault.

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_get_status(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, enum doca_gpu_semaphore_status *status)

semaphore_cpu – GPUNetIO semaphore CPU handler
idx – Semaphore item index
status [out] – Output semaphore status

doca_gpu_semaphore_get_custom_info_addr

From the CPU, retrieve the address of the custom info structure associated to a semaphore item. If the semaphore or the custom info is allocated with DOCA_GPU_MEM_GPU this function results in a segmentation fault.

Copy
Copied!

            
            doca_error_t doca_gpu_semaphore_get_custom_info_addr(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, void **custom_info)

semaphore_cpu – GPUNetIO semaphore CPU handler
idx – Semaphore item index
custom_info [out] – Output semaphore custom info address

doca_gpu_verbs_export_qp

The doca_gpu_verbs_export_qp function creates a GPUNetIO handler from a DOCA Verbs QP object. It takes a DOCA Verbs QP as input and returns a DOCA GPUNetIO Verbs QP object ( struct doca_gpu_verbs_qp ) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler ( struct doca_gpu_dev_verbs_qp ) using the doca_gpu_verbs_get_qp_dev function.

Copy
Copied!

            
            doca_error_t doca_gpu_verbs_export_qp(struct doca_gpu *gpu_dev, struct doca_dev *dev, struct doca_verbs_qp *qp, enum doca_gpu_dev_verbs_nic_handler nic_handler, void *gpu_qp_umem_dev_ptr, struct doca_verbs_cq *cq_sq, struct doca_verbs_cq *cq_rq, struct doca_gpu_verbs_qp **qp_out);

gpu_dev: GPUNetIO device handler.
dev: DOCA device handler.
qp: DOCA Verbs QP handler.
nic_handler: Type of NIC handler.
gpu_qp_umem_dev_ptr: GPU memory pointer to UMEM.
cq_sq and cq_rq: CQs associated with the Send and Receive Queues in the QP.

Note

While either cq_sq or cq_rq can be NULL, they cannot both be NULL simultaneously.
qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_get_qp_dev

Extracts a GPU device handler (struct doca_gpu_dev_verbs_qp) from a DOCA GPUNetIO Verbs QP object.

Copy
Copied!

            
            doca_error_t doca_gpu_verbs_get_qp_dev(struct doca_gpu_verbs_qp *qp, struct doca_gpu_dev_verbs_qp **qp_gpu);

qp: DOCA GPUNetIO Verbs QP handler.
qp_gpu: DOCA GPUNetIO Verbs QP GPU device handler in GPU memory.

doca_gpu_verbs_unexport_qp

Unexports a previously exported DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp).

Copy
Copied!

            
            doca_error_t doca_gpu_verbs_export_qp(ruct doca_gpu *gpu_dev, struct doca_gpu_verbs_qp *qp);

gpu_dev: GPUNetIO device handler.
qp: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_bridge_export_qp

The doca_gpu_verbs_bridge_export_qp function creates a DOCA GPUNetIO Verbs QP object from application-defined parameters, acting as a bridge between IBVerbs/mlx5 objects and DOCA GPUNetIO. This allows applications to create required objects like QP, CQ, and UAR using IBVerbs and mlx5 commands and then pass the relevant information to this DOCA GPUNetIO function.

The function returns a DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler (struct doca_gpu_dev_verbs_qp) using the doca_gpu_verbs_get_qp_dev function.

It is the application's responsibility to ensure that all passed parameters are correctly created and set.

Copy
Copied!

            
            doca_error_t doca_gpu_verbs_bridge_export_qp(struct doca_gpu *gpu_dev, uint32_t sq_qpn, void *sq_wqe_addr, uint16_t sq_wqe_num, uint32_t *sq_dbrec, uint64_t *sq_db, size_t uar_size, uint32_t sq_cqn, void *sq_cqe_addr, uint32_t sq_cqe_num, uint32_t *sq_cq_dbrec, uint32_t rq_qpn, void *rq_wqe_addr, uint16_t rq_wqe_num, uint32_t *rq_dbrec, uint32_t rcv_wqe_size, uint32_t rq_cqn, void *rq_cqe_addr, uint32_t rq_cqe_num, uint32_t *rq_cq_dbrec, enum doca_gpu_dev_verbs_nic_handler nic_handler, struct doca_gpu_verbs_qp **qp_out);

gpu_dev: GPUNetIO device handler.
sq_qpn: Send QP queue number.
sq_wqe_addr: Send QP WQE buffer memory address.
sq_wqe_num: Send QP number of WQEs.
sq_dbrec: Send QP Doorbell Record address.
sq_db: Send QP Doorbell address.
uar_size: Size of UAR.
sq_cqn: Send CQ number.
sq_cqe_addr: Send CQ CQE buffer memory address.
sq_cqe_num: Send CQ number of CQEs.
qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

DOCA PE

A DOCA Ethernet Txq context, exported for GPUNetIO usage, can be tracked via DOCA PE on the CPU side to check if there are errors when sending packets or to retrieve notification info after sending a packet with any of the doca_gpu_dev_eth_txq_*_enqueue_* functions on the GPU. An example can be found in the DOCA GPU packet processing application with ICMP traffic.

A DOCA Comch Producer or Consumer context, exported for GPUNetIO usage must still be attached to a DOCA PE on the CPU to ensure that remote producer and consumer connections and disconnections are handled properly.

GPU Functions – Ethernet

This section provides a list of DOCA GPUNetIO functions that can be used for Ethernet GDAKI network operations within GPU CUDA kernels.

Header Files

doca_gpunetio_eth_def.h: Contains constants, enums, and structure definitions.
doca_gpunetio_dev_eth_common.cuh: Provides CUDA utility functions common to both Txq and Rxq header files.
doca_gpunetio_dev_eth_txq.cuh: Provides CUDA functions for Ethernet send operations. This includes both high-level (with shared QP feature) and low-level (post WQE, ring DB, poll CQ, etc.) functions.
doca_gpunetio_dev_eth_rxq.cuh: Provides CUDA functions for Ethernet receive operations.

Note

The new GPUNetIO Ethernet APIs are experimental. Please report any issues encountered during use, as your feedback is crucial for enhancing the code's quality and robustness.

Note

Refer to the header files for a complete list of GPUNetIO Ethernet functions. This guide covers only the most essential APIs.

Note

The doca_gpunetio_eth_def.h file contains the macro #define DOCA_GPUNETIO_ETH_ENABLE_DEBUG 0. This default setting disables debug prints, including CQE errors. To enable error messages, set this macro to 1.

Execution Scope and Shared Queue

The execution scope for high-level GPUNetIO Ethernet APIs (thread, warp, or block) is specified via the doca_gpu_dev_eth_exec_scope enum as a template argument. This tells the function how many threads are participating in the operation.

Send (Txq): The high-level send API supports the shared queue feature. This allows different threads, belonging to different scopes, to concurrently access the same Txq without race conditions.
Receive (Rxq): The high-level receive API does not support the shared queue feature. If a thread, warp, or block is using an Rxq, other threads, warps, or blocks cannot use the same Rxq in parallel.

The following subsections detail the send execution scopes.

Thread Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD)

Each thread acts as a standalone entity.
Each thread posts one or more WQEs and is responsible for its own submit (doorbell ring).
Performance: This is the slowest approach due to high contention on atomic operations.

image-2025-10-27_11-39-59-version-1-modificationdate-1763548703080-api-v2.png

Warp Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_WARP)

All threads within the warp must call the function.
Each thread posts a WQE to a different location, but only the first thread (lane_idx 0) performs the submit operation.
Performance: Reduces contention to a per-warp basis.

image-2025-10-27_11-38-9-version-1-modificationdate-1763548703397-api-v2.png

Block Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_BLOCK)

All threads within the block must call the function.
Each thread posts a WQE to a different location, but only the first thread (threadIdx 0) performs the submit operation.
Performance: Reduces contention to a per-block basis.

Ring DB and CPU Proxy

The doca_gpu_dev_eth_nic_handler enum, passed as a template argument, determines who is responsible for ringing the network card's doorbell.

The available modes are:

DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO: Automatically detects the best doorbell ringing option.
DOCA_GPUNETIO_ETH_NIC_HANDLER_CPU_PROXY: Enables CPU proxy mode. GPU submit functions provide information to the CPU, which then rings the doorbell.
DOCA_GPUNETIO_ETH_NIC_HANDLER_GPU_SM_DB: Enables regular GDAKI mode, where CUDA threads ring the doorbell directly.

Note

When CPU_PROXY mode is enabled, a CPU thread must invoke doca_eth_txq_gpu_cpu_proxy_progress in a loop to detect the GPU's information and ring the doorbell. The GPUNetIO Ethernet Simple Send sample demonstrates this feature.

Memory Consistency (MCST) Algorithm

During a receive operation, the network card writes packet data via PCIe to the application's mapped memory. For pre-Hopper GPUs (see CUGPUDirectRDMAWritesOrdering) , memory consistency (MCST) must be ensured when a CUDA kernel is receiving data into GPU memory.

DOCA GPUNetIO provides two ways to enable the MCST algorithm:

CPU configuration phase: When creating the DOCA Ethernet Rxq, use the function doca_eth_rxq_gpu_enable_mcst_qp to create an internal queue dedicated to the MCST algorithm.

Copy
Copied!

            
            cudaGetDeviceProperties(&prop, cuda_id);
// If pre-Hopper GPU with __CUDA_ARCH__ < 900
if (prop.major < 9)
    doca_eth_rxq_gpu_enable_mcst_qp(rxq->eth_rxq_cpu);

GPU data path phase: Pass the doca_gpu_dev_eth_mcst_mode enum as a template argument to the recv function.

Receive, High-level Function: doca_gpu_dev_eth_rxq_recv

This function receives Ethernet packets in a CUDA kernel and is available in different flavors for per-thread, per-warp, or per-block scopes.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD,
      enum doca_gpu_dev_eth_mcst_mode mcst_mode = DOCA_GPUNETIO_ETH_MCST_DISABLED,
      enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO,
      enum doca_gpu_dev_eth_rxq_attr_type rx_attr = DOCA_GPUNETIO_ETH_RX_ATTR_NONE>
__device__ inline doca_error_t doca_gpu_dev_eth_rxq_recv(struct doca_gpu_eth_rxq *rxq,
                            uint32_t max_rx_pkts,
                            uint64_t timeout_ns,
                            volatile uint64_t *out_first_pkt_idx,
                            uint32_t *out_pkt_num,
                            struct doca_gpu_dev_eth_rxq_attr *out_attr);

Parameters:

rxq: The Ethernet receive queue GPU handler.
max_rx_pkts: Maximum number of packets to receive. It must be a multiple of number of threads in the selected scope. The function guarantees the returned packet count is <= this number.
timeout_ns: Nanoseconds to wait for packets before returning.
out_first_pkt_idx [out]: Index of the first packet received. For block or warp scopes, this variable must be in shared or global memory to be visible to all threads.
out_pkt_num [out]: The actual number of packets received. For block or warp scopes, this variable must be in shared or global memory.
out_attr [out]: If enable_attributes is true, this structure is filled with per-packet attributes. The application must ensure this array is large enough to hold attributes for max_rx_pkts.

Notes

If both max_rx_pkts and timeout_ns are set to 0, the function will hang.
For BLOCK or WARP scopes, the max_rx_pkts value should be a multiple of the number of threads in the scope (e.g., blockDim.x or warpSize), as every thread will try to receive at least once.
This function does not support the shared QP feature. Access to the rxq must be exclusive to the calling thread, warp, or block. No other scope can work on the same rxq in parallel.

Packet Indexing and Circular Buffer

When the Rxq is created on the CPU, its GPU memory is divided into fixed-size strides (based on max_packet_size). The doca_gpu_dev_eth_rxq_recv function guarantees that these strides are filled sequentially with new packets.

image-2025-10-27_15-43-45-version-1-modificationdate-1763548702767-api-v2.png

The function returns the following output parameters to identify the received packets:

out_first_packet_idx: The index of the first stride (packet) filled.
out_pkt_num: The total number of packets received.

For example, if out_first_packet_idx is X and out_pkt_num is Y, the function has received Y packets, filling the GPU memory buffer strides from index X to (X + Y - 1).

Note

The output parameters out_first_packet_idx and out_pkt_num must be visible to all threads in the scope (e.g., via CUDA shared memory for warp and block scopes).

The receive buffer is treated as a circular buffer. Once the last stride is filled, the queue wraps around to the first stride.

Note

It is the application's responsibility to consume packets before they are overwritten by the queue wrapping around. This requires proper dimensioning of the Rxq GPU buffer and scaling across multiple receive queues.

Receive: doca_gpu_dev_eth_rxq_get_pkt_addr

This utility function retrieves the memory address of a specific packet in a specific stride.

Copy
Copied!

            
            __device__ inline uint64_t doca_gpu_dev_eth_rxq_get_pkt_addr(struct doca_gpu_eth_rxq *rxq, uint64_t packet_idx);

Parameter:

rxq: Ethernet receive queue GPU handler.
packet_idx: Index of the stride where the packet resides.

Receive: doca_gpu_dev_eth_rxq_get_pkt_ts

This utility function retrieves the timestamp from the Completion Queue Entry (CQE) associated with a specific packet index.

Copy
Copied!

            
            __device__ inline uint64_t doca_gpu_dev_eth_rxq_get_pkt_ts(struct doca_gpu_eth_rxq *rxq, uint64_t packet_idx);

Parameters:

rxq: GPU handle for the Ethernet receive queue.
packet_idx: Index of the stride where the packet resides.

This function provides an alternative to obtaining timestamps directly via doca_gpu_dev_eth_rxq_recv (when using the DOCA_GPUNETIO_ETH_RX_ATTR_TS or DOCA_GPUNETIO_ETH_RX_ATTR_ALL template attributes).

Note

The receive queue operates as a circular buffer. If this function is called after a significant delay (specifically, after the queue has wrapped around), the CQE associated with packet_idx may be overwritten by a newer packet. In such cases, the returned timestamp will be incorrect.

Send, High-level Function: doca_gpu_dev_eth_txq_send

This function sends Ethernet packets from a CUDA kernel, with flavors for per-thread, per-warp, or per-block scopes. It enables the shared queue feature, allowing concurrent access to the same Txq.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
	  enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU,
	  enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO,
	  enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD>
__device__ static inline void doca_gpu_dev_eth_txq_send(struct doca_gpu_eth_txq *txq,
							uint64_t addr,
							uint32_t mkey,
							size_t size,
							enum doca_gpu_eth_send_flags flags,
							doca_gpu_dev_eth_ticket_t *out_ticket)

Parameters:

txq: Ethernet send queue GPU handler.
addr: Memory address of the packet to send.
mkey: Memory key of the packet.
size: Size of the packet in bytes.
flags: Send flags from the doca_gpu_eth_send_flags enum.
out_ticket: Returns the WQE index (position) in the send queue where the packet was posted.

Send, High-level Function: doca_gpu_dev_eth_txq_wait_send

This function combines the "Accurate Send Scheduling" feature (wait on timestamp) with the send operation. The first thread in the scope posts a "wait on time" barrier, and then all threads in the scope post their send. This function also enables the shared queue feature.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
      enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU,
      enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO,
      enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD>
__device__ static inline void doca_gpu_dev_eth_txq_wait_send(struct doca_gpu_eth_txq *txq,
                                const uint64_t wait_on_time_ts,
                                uint64_t addr,
                                uint32_t mkey,
                                size_t size,
                                enum doca_gpu_eth_send_flags flags,
                                doca_gpu_dev_eth_ticket_t *out_ticket);

Parameters:

txq: Ethernet send queue GPU handler.
wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).
wqe_idx: Index of the WQE.
addr: Packet memory address.
mkey: Packet memory key.
nbytes: Packet size in bytes.
flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_wqe_prepare_send

This low-level function posts a send WQE. It is used by the high-level functions but can also be called directly by the application.

Copy
Copied!

            
            __device__ __inline__ static doca_error_t doca_gpu_dev_eth_txq_wqe_prepare_send(
    const struct doca_gpu_eth_txq *txq,
    struct doca_gpu_dev_eth_txq_wqe *wqe_ptr,
    const uint16_t wqe_idx,
    const uint64_t addr,
    const uint32_t mkey,
    const uint32_t nbytes,
    const enum doca_gpu_eth_send_flags flags);

Parameters:

txq: Ethernet send queue GPU handler.
wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).
wqe_idx: Index of the WQE.
addr: Packet memory address.
mkey: Packet memory key.
nbytes: Packet size in bytes.
flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_wqe_prepare_wait_time

This low-level function posts a "wait on time" WQE to create a time barrier. It is used by doca_gpu_dev_eth_txq_wait_send but can also be called directly.

Copy
Copied!

            
            __device__ __inline__ static doca_error_t doca_gpu_dev_eth_txq_wqe_prepare_wait_time(
    const struct doca_gpu_eth_txq *txq,
    struct doca_gpu_dev_eth_txq_wqe *wqe_ptr,
    const uint16_t wqe_idx,
    const uint64_t wait_on_time_ts,
    const enum doca_gpu_eth_send_flags flags);

Parameters:

txq: Ethernet send queue GPU handler.
wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).
wqe_idx: Index of the WQE.
wait_on_time_ts: Timestamp barrier. The timestamp can be calculated on the GPU (via doca_gpu_dev_eth_txq_calculate_timestamp) or CPU (via doca_eth_txq_calculate_timestamp).
flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_submit

This function "rings the doorbell" to notify the network card to execute all WQEs posted since the last submit. It supports the shared queue feature; internal atomics allow multiple threads to call it in parallel, and the doorbell register will be updated only by the thread with the greatest WQE index. It also allows choosing between GPU and CPU Proxy handlers via the template parameter.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
      enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU,
      enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO>
__device__ static inline void doca_gpu_dev_eth_txq_submit(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameters:

txq: Ethernet send queue GPU handler.
prod_index: The Producer Index (WQE index) to write into the network card's doorbell register.

Send, Low-level Functions: ...update_dbr, ...ring_db, and ...submit_proxy

These functions can replace doca_gpu_dev_eth_txq_submit in special situations, such as when the application has a well-determined send pattern and does not need atomics for shared queue access. The typical sequence is doca_gpu_dev_eth_txq_update_dbr (updates DBREC) followed by doca_gpu_dev_eth_txq_ring_db (updates Doorbell Register).

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU>
__device__ static inline void doca_gpu_dev_eth_txq_update_dbr(struct doca_gpu_eth_txq *txq, uint32_t prod_index);

Parameters:

txq: Ethernet send queue GPU handler.
prod_index: The WQE index to write into the network card DBREC.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU>
__device__ static inline void doca_gpu_dev_eth_txq_ring_db(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameters:

txq: Ethernet send queue GPU handler.
prod_index: The WQE index to write into the network card doorbell register.

doca_gpu_dev_eth_txq_ring_db works for direct GPU doorbell ringing. To ring the doorbell via CPU proxy, doca_gpu_dev_eth_txq_submit_proxy must be used instead:

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
      enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU>
__device__ static inline void doca_gpu_dev_eth_txq_submit_proxy(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameter:

txq: Ethernet send queue GPU handler.
prod_index: The WQE index to write into the CPU proxy shared memory.

Send, Low-level Functions: Poll Completion

These functions allow waiting for send operations to complete. A Completion (CQE) is generated only if the DOCA_GPUNETIO_ETH_SEND_FLAG_NOTIFY flag was used when posting the WQE.

doca_gpu_dev_eth_txq_poll_completion waits for a number of CQEs. It operates at thread scope but supports the shared queue feature (each thread locks num_cqe slots to poll).

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_cq_poll_mode cqe_poll = DOCA_GPUNETIO_ETH_CQ_POLL_ALL,
      enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
      enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU>
__device__ static inline doca_error_t doca_gpu_dev_eth_txq_poll_completion(struct doca_gpu_eth_txq *txq,
                                         const uint32_t num_cqe,
                                         enum doca_gpu_dev_eth_wait_flags wait_mode,
                                         uint32_t *num_completed);

Parameters:

txq: Ethernet send queue GPU handler.
num_cqe: Number of CQEs to wait for.
wait_mode: Wait in blocking or non-blocking mode (flags from doca_gpu_dev_eth_wait_flags).
num_completed: Number of CQEs detected.

Note

When num_cqe > 1, setting the cqe_poll template to DOCA_GPUNETIO_ETH_CQ_POLL_ALL can reduce latency, as the function will only poll the latest CQE (num_cqe - 1) instead of polling sequentially.

doca_gpu_dev_eth_txq_poll_completion_at allows the application to poll for a specific CQE position. It does not support the shared queue feature.

Copy
Copied!

            
            template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU,
      enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU>
__device__ static inline doca_error_t doca_gpu_dev_eth_txq_poll_completion_at(struct doca_gpu_eth_txq *txq,
                                             const uint64_t cqe_idx,
                                             enum doca_gpu_dev_eth_wait_flags wait_mode);

Parameters:

txq: Ethernet send queue GPU handler.
cqe_idx: Index of the CQE to poll.
wait_mode: Wait in blocking or non-blocking mode (flags from doca_gpu_dev_eth_wait_flags).

GPU Functions – Verbs

GPUNetIO Verbs data path functions are defined as inline implementations in header files named doca_gpunetio_dev_verbs_*.

Header Files

doca_gpunetio_verbs_def.h: Contains constants, enums, and structure definitions.
doca_gpunetio_dev_verbs_qp.cuh: Provides CUDA functions for interacting with a QP, including WQE creation and ringing the network card doorbell.
doca_gpunetio_dev_verbs_cq.cuh: Offers CUDA functions for interacting with a CQ, including CQE polling and validation.
doca_gpunetio_dev_verbs_onesided.cuh: Enables one-sided operations (e.g., put) with the shared QP feature at the CUDA thread and warp levels.
doca_gpunetio_dev_verbs_twosided.cuh: Enables two-sided operations (e.g., send/recv) with the shared QP feature at the CUDA thread and warp levels.
doca_gpunetio_dev_verbs_counter.cuh: Provides CUDA functions for counter-like operations via RDMA atomics with the shared QP feature. This requires "Core Direct" to be enabled as a QP attribute.

Note

The GPUNetIO Verbs APIs are currently experimental. Please report any issues encountered during use, as your feedback is crucial for enhancing the code's quality and robustness.

Note

When using the GPUNetIO Verbs header files, avoid calling functions that start with doca_priv_*. These are internal functions intended for use by the public API only.

Note

The doca_gpunetio_verbs_def.h file contains the macro #define DOCA_GPUNETIO_VERBS_ENABLE_DEBUG 0. This default setting disables debug prints, including CQE errors. To enable error messages, set this macro to 1.

Execution Scope and Shared Queue

The high-level GPUNetIO Verbs API can be executed at a thread or warp scope by specifying a value from the doca_gpu_dev_verbs_exec_scope enum as a template argument. This tells the function how many threads are participating in the operation.

The high-level API also supports the shared queue feature, allowing different threads from different scopes to concurrently access the same queue without race conditions.

Thread Scope (DOCA_GPUNETIO_VERBS_EXEC_SCOPE_THREAD)

Each thread acts as a standalone entity, posting WQEs and ringing the doorbell.
This is the slowest approach due to high contention on atomic operations.

image-2025-7-25_11-29-39-version-1-modificationdate-1763548705207-api-v2.png

Warp Scope (DOCA_GPUNETIO_VERBS_EXEC_SCOPE_WARP)

All threads within the warp must call the function.
Each thread posts a WQE to a different location, but only the first thread (lane_idx 0) performs the submit (doorbell ring).
This reduces contention to a per-warp basis.

image-2025-7-25_11-38-12-version-1-modificationdate-1763548704620-api-v2.png

Ring DB and CPU Proxy

The doca_gpu_dev_verbs_nic_handler enum, passed as a template argument, determines who is responsible for ringing the doorbell.

The available modes are:

DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO: Automatically detects the best doorbell ringing option.
DOCA_GPUNETIO_VERBS_NIC_HANDLER_CPU_PROXY: Enables CPU proxy mode. GPU submit functions provide information to the CPU, which then rings the doorbell.
DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_DB: Enables regular GDAKI mode, where CUDA threads ring the doorbell directly.
DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_BF: Experimental BlueFlame doorbell ringing mode (may fail in the current release).

Note

CPU Proxy Mode When CPU_PROXY mode is enabled, a CPU thread must invoke doca_gpu_verbs_cpu_proxy_progress in a loop to detect the GPU's information and ring the doorbell. This mode can be enabled via a command-line option in most GPUNetIO Verbs samples .

Memory Consistency (MCST) Algorithm

During an RDMA Read or Receive operation, the network card writes data via PCIe into the application's mapped memory. For pre-Hopper GPUs (see CUGPUDirectRDMAWritesOrdering) , memory consistency (MCST) must be ensured when a CUDA kernel is receiving data into GPU memory.

DOCA GPUNetIO Verbs applications can enable the MCST algorithm on the GPU by setting the doca_gpu_dev_verbs_mcst_mode mcst_mode template parameter. This parameter is available in functions such as doca_gpu_dev_verbs_get, doca_gpu_dev_verbs_get_wait, and doca_gpu_dev_verbs_recv_wait.

The gpunetio_verbs_twosided_bw sample demonstrates this by checking the CUDA architecture:

Copy
Copied!

            
            #if __CUDA_ARCH__ < 900
	doca_gpu_dev_verbs_recv_wait<DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU, DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO, DOCA_GPUNETIO_VERBS_MCST_ENABLED>(
								qp, doca_gpu_dev_verbs_addr{.addr = (uint64_t)dump_flag, .key = dump_flag_mkey});
#else
	doca_gpu_dev_verbs_recv_wait<DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU, DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO, DOCA_GPUNETIO_VERBS_MCST_DISABLED>(
								qp, doca_gpu_dev_verbs_addr{.addr = 0, .key = 0});
#endif

Queue Type

Many functions (e.g., submit doorbell, poll CQ) can apply to either the Send Queue or Receive Queue of a QP. A template argument from the doca_gpu_dev_verbs_qp_type enum is used to specify which queue to use.

DOCA_GPUNETIO_VERBS_QP_SQ: Uses the Send Queue from the input QP.
DOCA_GPUNETIO_VERBS_QP_RQ: Uses the Receive Queue from the input QP.

Receive CQE with Inline Data

When initializing a DOCA Verbs Queue Pair (QP), you can configure the Receive Queue (RQ) to store up to 32 bytes of payload directly within the CQE using the doca_verbs_qp_init_attr_set_cqe_inline function.

This feature eliminates the need to allocate and register a separate receive buffer for small messages (32 bytes or less). Instead, the data is retrieved directly from the CQE, reducing memory overhead and potentially improving latency.

The following GPU device functions are provided to handle inline data:

doca_gpu_dev_verbs_cqe_is_inline: Checks if the current CQE contains inline data.
doca_gpu_dev_verbs_cqe_get_inl_data: Retrieves a pointer to the inline data within the CQE.

A reference implementation is available in the doca_gpunetio_verbs_twosided_bw sample. You can enable or disable the CQE inline feature using the -x command-line option.

Header: doca_gpunetio_dev_verbs_qp.cuh

This header provides functions for QP manipulation, categorized into three groups:

doca_gpu_dev_verbs_wqe_prepare_*: Prepares and posts various RDMA WQEs (Write, Read, Send, Recv, Atomic, Wait, Dump) on QP memory.
doca_gpu_dev_verbs_submit: Updates network card registers for doorbell submission (regular, CPU proxy, and BlueFlame) for both SQ and RQ.
doca_gpu_dev_verbs_wait_*: Waits for CQEs to arrive on the SQ or RQ's completion queue.

For detailed explanations, refer to the header file. Examples can be found in the doca_gpunetio_verbs_write_bw and doca_gpunetio_verbs_write_lat samples.

Header: doca_gpunetio_dev_verbs_cq.cuh

This header provides functions for manipulating Completion Queues (CQs). Specifically, the doca_gpu_dev_verbs_poll_* functions are used to poll CQEs, waiting for the completion of WQEs in the connected QP.

Header: doca_gpunetio_dev_verbs_onesided.cuh

This header provides shared-QP-aware functions for one-sided RDMA operations at the thread and warp levels.

doca_gpu_dev_verbs_put: Posts an RDMA Write WQE and submits. In WARP scope, only the first thread submits.
doca_gpu_dev_verbs_put_signal: Similar to put, but adds an RDMA Atomic Fetch and Add. In THREAD scope, each thread posts a Write, an Atomic, and submits. In WARP scope, each thread posts a Write, but only the first thread posts the Atomic and submits.
doca_gpu_dev_verbs_get: Posts an RDMA Read WQE and submits. In WARP scope, only the first thread submits. Optionally posts a Dump WQE to ensure data consistency (e.g., ensures Read data is present in memory after CQE arrival, e.g., enum CUGPUDirectRDMAWritesOrdering).

For details, refer to the header file. Examples are in the doca_gpunetio_verbs_put_bw, doca_gpunetio_verbs_put_signal_bw, and doca_gpunetio_verbs_put_signal_lat samples.

Header: doca_gpunetio_dev_verbs_twosided.cuh

This header provides shared-QP-aware functions for two-sided RDMA operations at the thread and warp levels.

doca_gpu_dev_verbs_send: Posts an RDMA Send WQE and submits. In WARP scope, only the first thread submits.
doca_gpu_dev_verbs_recv: Posts an RDMA Receive WQE and submits. In WARP scope, only the first thread submits.

Requirement flow for two-sided communication:

Peer A posts one or more RDMA Receive WQEs.
Peer A notifies Peer B that it is ready to receive.
Peer B waits for this notification.
Peer B posts one or more RDMA Send WQEs.

The notification can be sent via RDMA Write or Atomic. An example using RDMA Write is in the doca_gpunetio_verbs_twosided_bw sample.

Header: doca_gpunetio_dev_verbs_counter.cuh

This header provides shared-QP functions that use the counter feature. This allows a CQE arrival on a main QP to trigger a WQE on a companion QP. This requires Core Direct to be enabled via doca_verbs_qp_init_attr_set_core_direct_master().

doca_gpu_dev_verbs_put_counter: Posts an RDMA Write on the main QP and an RDMA Atomic FetchAdd on the companion QP. The Atomic is executed only after the Write's CQE arrives, allowing for buffer reuse.
doca_gpu_dev_verbs_submit_multi_qps: Submits both the main and companion QP in a single call.

For details, refer to the header file. Examples are in the doca_gpunetio_verbs_put_counter_bw and doca_gpunetio_verbs_put_counter_lat samples.

GPU Functions – RDMA

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute RDMA operations. These functions offer a strong and a weak mode.

Copy
Copied!

            
            __device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_get_info(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t *curr_position, uint32_t *mask_max_position)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
curr_position – Next available position in the queue
mask_max_position – Mask of the total number of positions in the queue

Copy
Copied!

            
            __device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_recv_get_info(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t *curr_position, uint32_t *mask_max_position)

rdma_r – RDMA receive queue GPU handler
curr_position – Next available position in the queue
mask_max_position – Mask of the total number of positions in the queue

doca_gpu_dev_rdma_write_*

To RDMA write data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_write_strong(struct doca_gpu_dev_rdma *rdma,
														uint32_t connection_index,
														struct doca_gpu_buf *remote_buf, uint64_t remote_offset,
														struct doca_gpu_buf *local_buf, uint64_t local_offset,
														size_t length, uint32_t imm,
														const enum doca_gpu_dev_rdma_write_flags flags)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to
remote_offset – Offset, in bytes, to write data to in the remote buffer
local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to write
local_offset – Offset, in bytes, to fetch data from in the local buffer
length – Number of bytes to write
imm – Immediate value uint32_t
flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_write_weak(struct doca_gpu_dev_rdma *rdma,
													uint32_t connection_index,
													struct doca_gpu_buf *remote_buf, uint64_t remote_offset,
													struct doca_gpu_buf *local_buf, uint64_t local_offset,
													size_t length, uint32_t imm,
													const enum doca_gpu_dev_rdma_write_flags flags,
													uint32_t position);

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to
remote_offset – Offset, in bytes, to write data to in the remote buffer
local_buf – Local DOCA buffer from a DOCA GPU buffer array where to fetch data to write
local_offset – Offset, in bytes, to fetch data in the local buffer
length – Number of bytes to write
imm – Immediate value uint32_t
flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum
position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_read_*

To RDMA read data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes to enqueue operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_read_strong(struct doca_gpu_dev_rdma *rdma,
														uint32_t connection_index,
														struct doca_gpu_buf *remote_buf, uint64_t remote_offset,
														struct doca_gpu_buf *local_buf, uint64_t local_offset,
														size_t length,
														const uint32_t flags_bitmask)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data
remote_offset – Offset in bytes to read data to in the remote buffer
local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data
local_offset – Offset in bytes to store data in the local buffer
length – Number of bytes to be read
flags_bitmask – Must be 0; reserved for future use

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_read_weak(struct doca_gpu_dev_rdma *rdma,
													uint32_t connection_index, 
													struct doca_gpu_buf *remote_buf, uint64_t remote_offset,
													struct doca_gpu_buf *local_buf, uint64_t local_offset,
													size_t length,
													const uint32_t flags_bitmask,
													uint32_t position);

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data
remote_offset – Offset in bytes to read data to in the remote buffer
local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data
local_offset – Offset in bytes to store data in the local buffer
length – Number of bytes to be read
flags_bitmask – Must be 0; reserved for future use
position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_send_*

To RDMA send data from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_send_strong(struct doca_gpu_dev_rdma *rdma,
 														uint32_t connection_index,
														struct doca_gpu_buf *local_buf, uint64_t local_offset,
														size_t length, uint32_t imm,
														const enum doca_gpu_dev_rdma_write_flags flags)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send
local_offset – Offset in bytes to fetch data in the local buffer
length – Number of bytes to send
imm – Immediate value uint32_t
flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_send_weak(struct doca_gpu_dev_rdma *rdma,
  													uint32_t connection_index, 
													struct doca_gpu_buf *local_buf, uint64_t local_offset,
													size_t length, uint32_t imm,
													const enum doca_gpu_dev_rdma_write_flags flags,
													uint32_t position);

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send
local_offset – Offset in bytes to fetch data in the local buffer
length – Number of bytes to send
imm – Immediate value uint32_t
flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum
position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_commit_*

Once all RDMA write, send or read requests have been enqueue in the RDMA queue, a synchronization point must be reached to consolidate and execute those requests. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_commit_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_commit_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t num_ops)

rdma – RDMA queue GPU handler
connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.
num_ops – Number of RDMA requests enqueued since the last commit

doca_gpu_dev_rdma_wait_all

After a commit, RDMA requests are executed by the network card as applications move forward doing other operations. If the application needs to verify all RDMA operations have been done by the network card, this "wait all" function can be used to wait for all previous posted operations. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_wait_all(struct doca_gpu_dev_rdma *rdma, uint32_t *num_commits)

rdma – RDMA queue GPU handler
num_commits – Output parameter; the number of commit operations completed

Info

This function is optional, and it can be used to ensure all the RDMA Send/Write/Read operations have actually been executed before moving forward with the application.

doca_gpu_dev_rdma_recv_*

To receive data from an RDMA send, send with immediate, or write with immediate, the destination peer should post a receive operation. DOCA GPUNetIO RDMA receive operations must be done with a doca_gpu_dev_rdma_r handler. This handler can be obtained with the function doca_gpu_dev_rdma_get_recv.

Note

All receive operations must use this object.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_get_recv(struct doca_gpu_dev_rdma *rdma, struct doca_gpu_dev_rdma_r **rdma_r)

rdma – RDMA queue GPU handler
rdma_r – RDMA receive queue GPU handler

Even for the receive side, in this case, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_recv_strong(struct doca_gpu_dev_rdma_r *rdma_r,
													  struct doca_gpu_buf *recv_buf,
													  size_t recv_length,
													  uint64_t recv_offset,
													  const uint32_t flags_bitmask)

rdma_r – RDMA receive queue GPU handler
recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send
recv_length – Number of bytes to send
recv_offset – Offset in bytes to fetch data in the local buffer
flags_bitmask – Must be 0; reserved for future use

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_recv_weak(struct doca_gpu_dev_rdma_r *rdma_r,
													struct doca_gpu_buf *recv_buf,
													size_t recv_length,
													uint64_t recv_offset,
													const uint32_t flags_bitmask,
													uint32_t position);

rdma_r – RDMA receive queue GPU handler
recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send
recv_length – Number of bytes to send
recv_offset – Offset in bytes to fetch data in the local buffer
flags_bitmask - Must be 0; reserved for future use
position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_recv_commit_*

After posting several RDMA receive operations, a commit function must be invoked to activate the receive in the queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_recv_commit_strong(struct doca_gpu_dev_rdma_r *rdma_r)

rdma_r – RDMA receive queue GPU handler

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_recv_commit_weak(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t num_ops)

rdma_r – RDMA receive queue GPU handler
num_ops – Number of RDMA receive requests enqueued since the last commit

doca_gpu_dev_rdma_recv_wait_all

This function waits for the completion of all previously posted RDMA receive operation. Only 1 CUDA thread can invoke this function at a time. It works in blocking or non-blocking mode.

Copy
Copied!

            
             enum doca_gpu_dev_rdma_recv_wait_flags {
	DOCA_GPU_RDMA_RECV_WAIT_FLAG_NB = 0,  /**< Non-Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait
					      					*  checks if the receive operation happened (data has been received)
					      					*  and exit from the function. If nothing has been received,
					      					*  the function doesn't block the execution.
					      					*/
	DOCA_GPU_RDMA_RECV_WAIT_FLAG_B  = 1, /**< Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait
					     				   *  blocks the execution waiting for the receive operations to be executed.
					      				   */
};

Function:

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_rdma_recv_wait_all(struct doca_gpu_dev_rdma_r *rdma_r, const enum doca_gpu_dev_rdma_recv_wait_flags flags, uint32_t *num_ops, uint32_t *imm_val)

rdma_r – RDMA receive queue GPU handler
flags – receive flags
num_ops – Output parameter. Function reports number of completed operations.
imm_val – Output parameter. Application-provided buffer where the function can store received immediate values, if any (or 0xFFFFFFFF if no immediate value is received). If nullptr, the function ignores this parameter.

GPU Functions – DMA

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute DMA operations.

doca_gpu_dev_dma_memcpy

This function allows a CUDA kernel to trigger a DMA memory copy operation through the DMA GPU engine. There is no strong/weak mode here, the DMA is assuming the strong behavior by default.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_dma_memcpy(struct doca_gpu_dma *dma, struct doca_gpu_buf *src_buf, uint64_t src_offset, struct doca_gpu_buf *dst_buf, uint64_t dst_offset, size_t length);

dma – DMA queue GPU handler
src_buf – memcpy source buffer
src_offset – fetch data starting from this source buffer offset
dst_buf – memcpy destination buffer
dst_offset – copy data starting from this destination buffer offset
lenght – number of bytes to copy

doca_gpu_dev_dma_commit

After posting several DMA memory copies, a commit function must be invoked to execute the operations enqueued in the DMA queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!

            
            __device__ doca_error_t doca_gpu_dev_dma_commit(struct doca_gpu_dma *dma);

dma – DMA queue GPU handler

GPU Functions – Comch

This section lists the DOCA GPUNetIO functions that can be used within a CUDA kernel to execute DOCA Comch Producer or Consumer operations. For more information on the Producer/Consumer model, refer to the DOCA Comch documentation.

Note

Only one CUDA thread can invoke these functions per Producer or Consumer object.

Consumer: doca_dev_gpu_comch_consumer_post_recv

This function allows a DOCA Comch Consumer to post a buffer to receive a message from a remote producer.

Copy
Copied!

            
            __device__ doca_error_t doca_dev_gpu_comch_consumer_post_recv(
    struct doca_comch_gpu_consumer *consumer,
    uint32_t mkey,
    uintptr_t addr,
    uint32_t recv_len
);

Parameters:

consumer: The Consumer GPU handle.
mkey: The mkey associated with the buffer to be posted.
addr: The address of the buffer to receive the message into.
recv_len: The maximum message size that can be written to the buffer.

Consumer: doca_dev_gpu_comch_consumer_recv_wait

This function allows a DOCA Comch Consumer to wait for a message to be received from a remote producer.

Copy
Copied!

            
            __device__ doca_error_t doca_dev_gpu_comch_consumer_recv_wait(
    struct doca_comch_gpu_consumer *consumer,
    uint32_t *mkey,
    uintptr_t *addr,
    uint32_t *recv_len,
    uint8_t **imm_data,
    uint32_t *imm_data_len,
    const enum doca_gpu_dev_comch_wait_flags flags
);

Parameters:

consumer: The Consumer GPU handle.
mkey [out]: The mkey of the buffer that the message has been received into.
addr [out]: The address of the buffer that the message has been received into.
recv_len [out]: The length of the message that has been received.
imm_data: Pointer to a buffer for any immediate data included with the message.
imm_data_len [in/out]:
- Input: The maximum size of immediate data that can be received.
- Output: The actual size of immediate data received.
flags: A doca_gpu_dev_comch_wait_flags value (e.g., blocking or non-blocking).

Producer: doca_dev_gpu_comch_producer_send

This function allows a DOCA Comch Producer to send a message to a remote consumer.

Copy
Copied!

            
            __device__ doca_error_t doca_dev_gpu_comch_producer_send(
    struct doca_comch_gpu_producer *producer,
    uint32_t mkey,
    uintptr_t addr,
    uint32_t send_len,
    uint8_t *imm_data,
    uint32_t imm_data_len,
    uint32_t consumer_id,
    uint64_t user_msg_id
);

Paramters:

producer: The Producer GPU handle.
mkey: The mkey associated with the buffer to be sent.
addr: The address of the buffer to be sent.
send_len: The number of bytes to send from the buffer.
imm_data: Immediate data to include with the message.
imm_data_len: Size of the immediate data to include.
consumer_id: The ID of the remote consumer to send the message to.
user_msg_id: A user-defined ID that will be returned by doca_dev_gpu_comch_producer_poll when this message has been successfully sent.

Producer: doca_dev_gpu_comch_producer_poll

This function allows a DOCA Comch Producer to poll for the completion of a send operation.

Copy
Copied!

            
            __device__ doca_error_t doca_dev_gpu_comch_producer_poll(
    struct doca_comch_gpu_producer *producer,
    uint64_t *user_msg_id,
    const enum doca_gpu_dev_comch_wait_flags flags
);

Parameters:

producer: The Producer GPU handle.
user_msg_id [out]: The user_msg_id that was set in the corresponding doca_dev_gpu_comch_producer_send function.
flags: A doca_gpu_dev_comch_wait_flags value (e.g., blocking or non-blocking).

doca_gpu_dev_comch_wait_flags

This enum is used as the flags parameter in both the doca_gpu_dev_comch_consumer_recv_wait and doca_gpu_dev_comch_producer_poll functions. It determines how the function behaves if no operation has completed.

Copy
Copied!

            
            enum doca_gpu_dev_comch_wait_flags {
	DOCA_GPU_COMCH_WAIT_FLAG_NB = 0, /**< Non-Blocking mode: the wait function  checks if any of the operations
					                  * have been completed (data has been sent/received) and exit from the
                 					  * function. If nothing has been sent/received, the function doesn't block
					                  * the execution.
                 					  */
	DOCA_GPU_COMCH_WAIT_FLAG_B = 1,	 /**< Blocking mode: the wait function blocks the execution waiting for a
					                  * send/receive operations to be completed.
                					  */
};

GPUNetIO Samples

This section contains two samples that show how to enable simple GPUNetIO features. Be sure to correctly set the following environment variables:

Build the sample

Copy
Copied!

            
            export PATH=${PATH}:/usr/local/cuda/bin
export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"
export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib/pkgconfig:/opt/mellanox/grpc/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu/pkgconfigexport LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/opt/mellanox/gdrcopy/src:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu

Info

All the DOCA samples described in this section are governed under the BSD-3 software license agreement.

Note

Please ensure the arch of your GPU is included in the meson.build file before building the samples (e.g., sm_80 for Ampere, sm_89 for L40, sm_90 for H100, etc).

Ethernet Send Wait Time

The sample shows how to enable Accurate Send Scheduling (or wait-on-time) feature in the context of a GPUNetIO application. Accurate Send Scheduling is the ability of an NVIDIA NIC to send packets in the future according to application-provided timestamps.

Note

This feature is supported on ConnectX-7 and later .

This sample demonstrates how to send packets from the GPU using Accurate Send Scheduling by calling the high-level doca_gpu_dev_eth_txq_wait_send function with a BLOCK execution scope.

Info

This NVIDIA blog post offers an example for how this feature has been used in 5G networks.

Synchronizing Clocks

Before starting the sample, it is important to properly synchronize the CPU clock with the NIC clock. This way, timestamps provided by the system clock are synchronized with the time in the NIC.

For this purpose, at least the phc2sys service must be used. To install it on an Ubuntu system:

phc2sys

Copy
Copied!

            
            sudo apt install linuxptp

To start the phc2sys service properly, a config file must be created in /lib/systemd/system/phc2sys.service. Assuming the network interface is ens6f0 :

phc2sys

Copy
Copied!

            
            [Unit]
Description=Synchronize system clock or PTP hardware clock (PHC)
Documentation=man:phc2sys
 
[Service]
Restart=always
RestartSec=5s
Type=simple
ExecStart=/bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp$(ethtool -T ens6f0 | grep PTP | awk '{print $4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
 
[Install]
WantedBy=multi-user.target

Now phc2sys service can be started:

phc2sys

Copy
Copied!

            
            sudo systemctl stop systemd-timesyncd
sudo systemctl disable systemd-timesyncd
sudo systemctl daemon-reload
sudo systemctl start phc2sys.service

To check the status of phc2sys:

phc2sys

Copy
Copied!

            
            $ sudo systemctl status phc2sys.service

Output:

phc2sys

Copy
Copied!

            
            ● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
     Loaded: loaded (/lib/systemd/system/phc2sys.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-03 10:59:13 UTC; 2 days ago
       Docs: man:phc2sys
   Main PID: 337824 (sh)
      Tasks: 2 (limit: 303788)
     Memory: 560.0K
        CPU: 52min 8.199s
     CGroup: /system.slice/phc2sys.service
             ├─337824 /bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T enp23s0f1np1 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R >
             └─337829 /usr/sbin/phc2sys -s /dev/ptp3 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
 
Apr 05 16:35:52 doca-vr-045 phc2sys[337829]: [457395.040] CLOCK_REALTIME rms    8 max   18 freq +110532 +/-  27 delay   770 +/-   3
Apr 05 16:35:53 doca-vr-045 phc2sys[337829]: [457396.071] CLOCK_REALTIME rms    8 max   20 freq +110513 +/-  30 delay   769 +/-   3
Apr 05 16:35:54 doca-vr-045 phc2sys[337829]: [457397.102] CLOCK_REALTIME rms    8 max   18 freq +110527 +/-  30 delay   769 +/-   3
Apr 05 16:35:55 doca-vr-045 phc2sys[337829]: [457398.130] CLOCK_REALTIME rms    8 max   18 freq +110517 +/-  31 delay   769 +/-   3
Apr 05 16:35:56 doca-vr-045 phc2sys[337829]: [457399.159] CLOCK_REALTIME rms    8 max   19 freq +110523 +/-  32 delay   770 +/-   3
Apr 05 16:35:57 doca-vr-045 phc2sys[337829]: [457400.191] CLOCK_REALTIME rms    8 max   20 freq +110528 +/-  33 delay   770 +/-   3
Apr 05 16:35:58 doca-vr-045 phc2sys[337829]: [457401.221] CLOCK_REALTIME rms    8 max   19 freq +110512 +/-  38 delay   770 +/-   3
Apr 05 16:35:59 doca-vr-045 phc2sys[337829]: [457402.253] CLOCK_REALTIME rms    9 max   20 freq +110538 +/-  47 delay   770 +/-   4
Apr 05 16:36:00 doca-vr-045 phc2sys[337829]: [457403.281] CLOCK_REALTIME rms    8 max   21 freq +110517 +/-  38 delay   769 +/-   3
Apr 05 16:36:01 doca-vr-045 phc2sys[337829]: [457404.311] CLOCK_REALTIME rms    8 max   17 freq +110526 +/-  26 delay   769 +/-   3
...

At this point, the system and NIC clocks are synchronized so timestamps provided by the CPU are correctly interpreted by the NIC.

Warning

The timestamps you get may not reflect the real time and day. To get that, you must properly set the ptp4l service with an external grand master on the system. Doing that is out of the scope of this sample.

Running the Sample

To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:

phc2sys

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_send_wait_time
meson build
ninja -C build

The sample sends 8 bursts of 32 raw Ethernet packets or 1kB to a dummy Ethernet address, 10:11:12:13:14:15, in a timed way. Program the NIC to send every t nanoseconds (command line option -t).

The following example programs a system with GPU PCIe address ca:00.0 and NIC PCIe address 17:00.0 to send 32 packets every 5 milliseconds:

Run

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ sudo ./build/doca_gpunetio_send_wait_time -n 17:00.0 -g ca:00.0 -t 5000000[09:22:54:165778][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:195][main] Starting the sample
[09:22:54:438260][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:224][main] Sample configuration:
		GPU ca:00.0
		NIC 17:00.0
		Timeout 5000000ns
EAL: Detected CPU lcores: 128
...
EAL: Probe PCI driver: mlx5_pci (15b3:a2d6) device: 0000:17:00.0 (socket 0)
[09:22:54:819996][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:607][gpunetio_send_wait_time] Wait on time supported mode: DPDK
EAL: Probe PCI driver: gpu_cuda (10de:20b5) device: 0000:ca:00.0 (socket 1)
[09:22:54:830212][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:252][create_tx_buf] Mapping send queue buffer (0x0x7f48e32a0000 size 262144B) with legacy nvidia-peermem mode
[09:22:54:832462][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:657][gpunetio_send_wait_time] Launching CUDA kernel to send packets
[09:22:54:842945][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:664][gpunetio_send_wait_time] Waiting 10 sec for 256 packets to be sent
[09:23:04:883309][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:684][gpunetio_send_wait_time] Sample finished successfully
[09:23:04:883339][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:239][main] Sample finished successfully

To verify that packets are actually sent at the right time, use a packet sniffer on the other side (e.g., tcpdump):

phc2sys

Copy
Copied!

            
            $ sudo tcpdump -i enp23s0f1np1 -A -s 64
 
17:12:23.480318 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
....
17:12:23.480368 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of first burst of 32 packets, bump to +5ms
17:12:23.485321 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...
17:12:23.485369 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
# end of second burst of 32 packets, bump to +5ms
17:12:23.490278 IP5 (invalid)
Sent from DOCA GPUNetIO...........................
...

The output should show a jump of approximately 5 milliseconds every 32 packets.

Note

tcpdump may increase latency in sniffing packets and reporting the receive timestamp, so the difference between bursts of 32 packets reported may be less than expected, especially with small interval times like 500 microseconds (-t 500000).

Ethernet Simple Receive

This sample application demonstrates the fundamental steps to build a DOCA GPUNetIO receiver application. It creates one queue for UDP packets and uses a single CUDA kernel to receive those packets from the GPU.

The sample uses the high-level doca_gpu_dev_eth_rxq_recv function, and its execution scope (thread, warp, or block) can be set at runtime using the -e command-line parameter.

Note

Invoking printf from a CUDA kernel is not good practice for release software as it slows down the kernel's overall execution. It should only be used for debugging. To enable packet info printing in this sample, the DOCA_GPUNETIO_SIMPLE_RECEIVE_DEBUG macro must be set to 1.

Build Instructions

Build the sample

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_receive
meson build
ninja -C build

Execution and Testing

This guide assumes a two-machine setup:

Receiver Machine: Runs the doca_gpunetio_simple_receive application.
Packet Generator Machine: Uses an application like nping to send UDP packets.

Packet Generator Machine Example

On the packet generator machine, use nping to send 10 UDP packets to the receiver's IP address (assumed to be 192.168.1.1 in this example).

Command:

nping generator

Copy
Copied!

            
            $ nping --udp -c 10 -p 2090 192.168.1.1 --data-length 1024 --delay 500ms

Output:

Copy
Copied!

            
            Starting Nping 0.7.80 ( https://nmap.org/nping ) at 2023-11-20 11:05 UTC
SENT (0.0018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (0.5018s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.0025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (1.5025s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.0032s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (2.5033s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.0040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (3.5040s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.0047s) UDP packet with 1024 bytes to 192.168.1.1:2090
SENT (4.5048s) UDP packet with 1024 bytes to 192.168.1.1:2090
 
Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
UDP packets sent: 10 | Rcvd: 0 | Lost: 10 (100.00%)
Nping done: 1 IP address pinged in 5.50 seconds

Receiver Machine Example

On the receiver machine, run the sample with the appropriate PCI addresses for the NIC and GPU. This example uses BLOCK scope (-e 2).

Command:

nping generator

Copy
Copied!

            
            # Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_receive -n 9f:00.0 -g 8a:00.0 -e 2

Output:

nping generator

Copy
Copied!

            
            [2025-10-27 00:52:32:387590][3382972416][DOCA][INF][doca_log.cpp:633] DOCA version 3.2.0111
[2025-10-27 00:52:32:387627][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:198][main] Starting the sample
[2025-10-27 00:52:32:681807][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:240][main] Sample configuration:
        GPU 8a:00.0
        NIC 9f:00.0
        Shared QP exec scope Block
[2025-10-27 00:52:32:687128][3382972416][DOCA][WRN][engine_model.c:90] adapting queue depth to 128.
[2025-10-27 00:52:32:753758][3382972416][DOCA][WRN][hws_port.c:864] ARGUMENT_256B resource doens't exist, skip creating NAT64 actions
[2025-10-27 00:52:32:755527][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:479][create_rxq] Creating Sample Eth Rxq
[2025-10-27 00:52:32:755713][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:544][create_rxq] Mapping receive queue buffer (0x0x7eef8e000000 size 33554432B dmabuf fd 43) with dmabuf mode
[2025-10-27 00:52:32:795823][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:717][gpunetio_simple_receive] Launching CUDA kernel to receive packets
[2025-10-27 00:52:32:799403][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:721][gpunetio_simple_receive] Waiting for termination
 
# Type Ctrl+C to kill the sample
 
[2025-10-27 00:53:35:034046][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:61][signal_handler] Signal 2 received, preparing to exit!
Exiting from simple receive sample. Total number of received packets: 10
[2025-10-27 00:53:35:034322][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:395][destroy_rxq] Destroying Rxq
[2AcC-10-27 00:53:35:061065][3382972416][DOCA][WRN][hws_group_pool.c:85] group_pool has 1 used groups
[2025-10-27 00:53:35:845568][3382972416][DOCA][INF][gpunetio_simple_receive_sample.c:738][gpunetio_simple_receive] Sample finished successfully
[2025-10-27 00:53:35:845583][3382972416][DOCA][INF][gpunetio_simple_receive_main.c:259][main] Sample finished successfully

Ethernet Simple Send

This sample implements a simple GPU Ethernet packet generator that constantly sends a flow of raw Ethernet packets. It demonstrates two different implementation "flavors" for sending packets: one using the low-level API and one using the high-level API.

The behavior is controlled by the -q command-line option:

-q 0 (Low-level API): Disables the shared queue feature and executes using the low-level send functions.
-q 1 (High-level API): Enables the shared queue feature. When this is set, the execution scope (thread, warp, or block) can also be chosen using the -e option.

Build Instructions

DOCA Simple Receive

Copy
Copied!

            
            # Ensure DOCA is in the pkgconfig environment variable
cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_send
meson build
ninja -C build

Execution Example

Assuming a system with a NIC at PCIe address 9f:00.0 and a GPU at 8a:00.0, the following command launches the sample to send 1024-byte packets (-s 1024) using 1024 CUDA threads (-t 1024). It enables the high-level API (-q 1) with a BLOCK execution scope (-e 2).

Command:

DOCA Simple Receive

Copy
Copied!

            
            # Ensure DOCA is in the LD_LIBRARY_PATH environment variable
$ sudo ./doca_gpunetio_simple_send -n 9f:00.0 -g 8a:00.0 -s 1024 -t 1024 -q 1 -e 2

Output:

Copy
Copied!

            
            [2025-10-28 04:38:10:548387][961593344][DOCA][INF][doca_log.cpp:645] DOCA version 3.3.0010
[2025-10-28 04:38:10:548420][961593344][DOCA][INF][gpunetio_simple_send_main.c:363][main] Starting the sample
[2025-10-28 04:38:10:839382][961593344][DOCA][INF][gpunetio_simple_send_main.c:433][main] Sample configuration:
        GPU 8a:00.0
        NIC 9f:00.0
        Packet size 1024
        CUDA threads 1024
        CPU Proxy No
        Shared QP Yes
        Shared QP exec scope Block
 
[2025-10-28 04:38:10:911219][961593344][DOCA][INF][gpunetio_simple_send_sample.c:291][create_txq] Creating Sample Eth Txq
[2025-10-28 04:38:10:916434][961593344][DOCA][INF][gpunetio_simple_send_sample.c:429][create_txq] Mapping send queue buffer (0x0x7fd718800000 size 1048576B dmabuf fd 45) with dmabuf mode
[2025-10-28 04:38:10:917479][961593344][DOCA][INF][gpunetio_simple_send_sample.c:580][gpunetio_simple_send] Launching CUDA kernel to send packets.
[2025-10-28 04:38:10:920972][961593344][DOCA][INF][gpunetio_simple_send_sample.c:584][gpunetio_simple_send] Waiting for ctrl+c termination
 
# Type Ctrl+C to kill the sample
 
[2025-10-28 04:38:22:681160][961593344][DOCA][INF][gpunetio_simple_send_sample.c:67][signal_handler] Signal 2 received, preparing to exit!
[2025-10-28 04:38:22:681176][961593344][DOCA][INF][gpunetio_simple_send_sample.c:590][gpunetio_simple_send] Exiting from sample
[2025-10-28 04:38:22:681387][961593344][DOCA][INF][gpunetio_simple_send_sample.c:204][destroy_txq] Destroying Txq
[2025-10-28 04:38:22:990964][961593344][DOCA][INF][gpunetio_simple_send_sample.c:612][gpunetio_simple_send] Sample finished successfully
[2025-10-28 04:38:22:990980][961593344][DOCA][INF][gpunetio_simple_send_main.c:457][main] Sample finished successfully

Performance Verification

To verify that packets are being sent, you can check the traffic throughput on the same machine using the mlnx_perf command. (Assuming the NIC at 9f:00.0 has the interface name ens6f0np0):

nping generator

Copy
Copied!

            
            $ mlnx_perf -i ens6f0np0
      tx_vport_unicast_packets: 21,033,677
        tx_vport_unicast_bytes: 21,538,485,248 Bps   = 172,307.88 Mbps
                tx_packets_phy: 21,033,286
                  tx_bytes_phy: 21,622,256,208 Bps   = 172,978.4 Mbps
                tx_prio0_bytes: 21,617,084,940 Bps   = 172,936.67 Mbps
              tx_prio0_packets: 21,028,292
                          UP 0: 172,936.67           Mbps = 100.00%
                          UP 0: 21,028,292           Tran/sec = 100.00%

Note

Throughput will vary between different systems based on factors like the PCIe connection between the GPU and NIC, the NIC model, and the GPU model. This sample can be used to evaluate your system's send performance by varying the packet size and number of CUDA threads.

RDMA Client Server

This sample exhibits how to use the GPUNetIO RDMA API to receive and send/write with immediate using a single RDMA queue.

The server has a GPU buffer array A composed by GPU_BUF_NUM doca_gpu_buf elements, each 1kB in size. The client has two GPU buffer arrays, B and C, each composed by GPU_BUF_NUM doca_gpu_buf elements, each 512B in size.

The goal is for the client to fill a single server buffer of 1kB with two GPU buffers of 512B as illustrated in the following figure:

image-2024-4-17_12-29-48-version-1-modificationdate-1763548715807-api-v2.png

To show how to use RDMA write and send, even buffers are sent from the client with write immediate, while odd buffers are sent with send immediate. In both cases, the server must pre-post the RDMA receive operations.

For each buffer, the CUDA kernel code repeats the handshake:

image-2024-6-26_16-43-24-version-1-modificationdate-1763548706887-api-v2.png

Once all buffers are filled, the server double checks that all values are valid. The server output should be as follows:

DOCA RDMA Server side

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0
 
[14:11:43:000930][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:250][main] Starting the sample
...
[14:11:43:686610][1173110][DOCA][INF][rdma_common.c:91][oob_connection_server_setup] Listening for incoming connections
[14:11:45:681523][1173110][DOCA][INF][rdma_common.c:105][oob_connection_server_setup] Client connected at IP: 192.168.2.28 and port: 46274
...
[14:11:45:771807][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:644][rdma_write_server] Before launching CUDA kernel, buffer array A is:
[14:11:45:771822][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 0 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771837][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 1 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771851][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 2 -> offset 0: 1111 | offset 128: 1111
[14:11:45:771864][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 3 -> offset 0: 1111 | offset 128: 1111
RDMA Recv 2 ops completed with immediate values 0 and 1!
RDMA Recv 2 ops completed with immediate values 1 and 2!
RDMA Recv 2 ops completed with immediate values 2 and 3!
RDMA Recv 2 ops completed with immediate values 3 and 4!
[14:11:45:781561][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:671][rdma_write_server] After launching CUDA kernel, buffer array A is:
[14:11:45:781574][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 0 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781583][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 1 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781593][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 2 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781602][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 3 -> offset 0: 2222 | offset 128: 3333
[14:11:45:781640][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:294][main] Sample finished successfully

On the other side, assuming the server is at IP address 192.168.2.28, the client output should be as follows:

DOCA RDMA Client side

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
 
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0 -c 192.168.2.28
 
[16:08:22:335744][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:197][main] Starting the sample
...
[16:08:25:753316][160913][DOCA][INF][rdma_common.c:147][oob_connection_client_setup] Connected with server successfully
......
Client waiting on flag 7f6596735000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 0
Client waiting on flag 7f6596735001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 1
Client waiting on flag 7f6596735002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 2
Client waiting on flag 7f6596735003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 3
[16:08:25:853454][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:241][main] Sample finished successfully

Note

With RDMA, the network device must be specified by name (e.g., mlx5_0 ) instead of the PCIe address (as is the case for Ethernet).

It is also possible to enable the RDMA CM mode, establishing two connections with the same RDMA GPU handler. An example on the client side:

DOCA RDMA Client side with CM

Copy
Copied!

            
            # Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable
 
$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write
$ ./build/samples/doca_gpunetio_rdma_client_server_write -d mlx5_0 -gpu 17:00.0 -gid 3 -c 10.137.189.28 -cm --server-addr-type ipv4 --server-addr 192.168.2.28
 
[11:30:34:489781][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:461][main] Starting the sample
...
[11:30:35:038828][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:950][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:082039][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:963][rdma_write_client] Client - Connection 1 is established
...
[11:30:35:095282][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1006][rdma_write_client] Establishing connection 2..
[11:30:35:097521][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1016][rdma_write_client] Client is waiting for a connection establishment
[11:30:35:102718][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1029][rdma_write_client] Client - Connection 2 is established
[11:30:35:102783][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1046][rdma_write_client] Client, terminate kernels
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 0. Waiting on the exit flag.
Client waiting on flag 7f16067b5000 for server to post RDMA Recvs
Thread 0 post rdma write imm 0
Thread 1 post rdma write imm 1
Client waiting on flag 7f16067b5001 for server to post RDMA Recvs
Thread 0 post rdma send imm 1
Thread 1 post rdma send imm 2
Client waiting on flag 7f16067b5002 for server to post RDMA Recvs
Thread 0 post rdma write imm 2
Thread 1 post rdma write imm 3
Client waiting on flag 7f16067b5003 for server to post RDMA Recvs
Thread 0 post rdma send imm 3
Thread 1 post rdma send imm 4
Client posted and completed 4 RDMA commits on connection 1. Waiting on the exit flag.
[11:30:35:122448][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:512][main] Sample finished successfully

In case of RDMA CM, the command option -cm must be specified on the server side.

Warning

Printing from a CUDA kernel is not recommended for performance. It may make sense for debugging purposes and for simple samples like this one.

Verbs Samples

The doca_gpunetio_verbs_* examples demonstrate how to use the GPUNetIO Verbs API in various scenarios. These samples require a client-server setup, with the client needing the -c <server IP> parameter.

The following parameters are supported by all Verbs samples:

-n: Network card handler type (0: AUTO, 1: CPU Proxy, 2: GPU DB). Default is 0 (AUTO).
-e: Execution mode for shared QP (0: per-thread, 1: per-warp). Default is 0 (per-thread).
-d: Network card device name.
-g: GPU device PCIe address.
-gid: GID index interface (optional).
-i: Number of iterations (optional).
-t: Number of CUDA threads (optional).
-x : Enable Receive CQE Inline feature. Valid only in the doca_gpunetio_verbs_twosided_bw sample.

For the examples in this guide, it is assumed that the GPU PCIe address is 8A:00.0 and the network card PCIe address is specified accordingly.

Note

If the samples are running on a RoCE connection (e.g. the ConnectX/BlueField is set in Ethernet mode instead of Infiniband mode) you may get this (or similar) error: FW failed to modify object, status=BAD_PARAM_ERR (0x3), syndrome=0x1f3b5d .

To fix it, please remove the doca_verbs_ah_attr_set_dlid function from the connect_verbs_qp function in the samples/doca_gpunetio/verbs_common.c file.

Bandwidth Samples

The samples ending with _bw measure the bandwidth of specific GPUNetIO Verbs API functions. In these samples, the client prepares and sends data from a CUDA kernel, while the server waits on the CPU to receive the data, validates it upon receiving a Ctrl+C signal, and reports the outcome.

Key Characteristics

The client outputs the MB/s achieved for preparing and sending messages of various sizes.
The server outputs a message indicating the execution outcome.
All bandwidth samples support the previously listed command-line parameters.

Simplifying QP/CQ/UAR Creation

To simplify the integration of DOCA Verbs with GPUNetIO, high-level functions like doca_gpu_verbs_create_qp_hl() and doca_gpu_verbs_create_qp_group_hl() are provided in samples/doca_gpunetio/verbs_high_level.cpp. These functions encapsulate the necessary steps for creating QP/CQ/UAR, making it easier for developers to combine DOCA Verbs and GPUNetIO in their applications.

doca_gpunetio_verbs_write_bw

The doca_gpunetio_verbs_write_bw test measures the bandwidth of RDMA Write operations using the GPUNetIO Verbs API. It launches a CUDA kernel with 1 block and 512 threads by default, where all threads post RDMA Write WQEs in different positions and the last thread submits them. The test then polls the CQE corresponding to the last WQE to ensure all previous WQEs have been executed correctly.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_bw

The doca_gpunetio_verbs_put_bw test measures the bandwidth of RDMA Write operations (referred to as "Put") using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads. The test can also measure individual function latencies by setting the KERNEL_DEBUG_TIMES macro to 1.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_signal_bw

The doca_gpunetio_verbs_put_signal_bw test measures the bandwidth of Put + Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be:

Copy
Copied!

            
            do {
    final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
    doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));

doca_gpunetio_verbs_put_counter_bw

The doca_gpunetio_verbs_put_counter_bw test measures the bandwidth of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be :

Copy
Copied!

            
            do {
    final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]);
    doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>();
} while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));

doca_gpunetio_verbs_twosided_bw

The doca_gpunetio_verbs_twosided_bw test measures the bandwidth of client-server data exchange via Send/Recv operations using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_get_bw

This sample application measures the bandwidth of one-sided RDMA Read ("Get") operations. It uses the GPUNetIO Verbs API with the shared QP feature.

The application launches a CUDA kernel configured with 2 blocks, each containing 256 threads.

Example usage:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_get_bw -g 8A:00.0 -d mlx5_0

Client (connects to the server's IP, e.g., 192.168.1.63):

Copy
Copied!

            
            doca_gpunetio_verbs_get_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Info

Additional CLI options can be added.

Latency Samples

The samples ending with _lat measure the latency of specific GPUNetIO Verbs API functions by performing a ping-pong exchange between client and server. These tests launch a CUDA kernel with a single CUDA thread and do not support the -e and -t command-line options.

Both client and server output the round-trip time (RTT) latency (half and full) in microseconds for preparing and exchanging messages of different sizes. The server also outputs a message indicating the execution outcome.

doca_gpunetio_verbs_write_lat

The doca_gpunetio_verbs_write_lat test measures the latency of RDMA Write operations using the GPUNetIO Verbs API without the shared QP feature. It is similar to perftest.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_signal_lat

The doca_gpunetio_verbs_put_signal_lat test measures the latency of Put and Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It uses high-level API functions from doca_gpunetio_dev_verbs_onesided.cuh and launches a CUDA kernel with a single CUDA thread.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0

Client (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_counter_lat

The doca_gpunetio_verbs_put_counter_lat test measures the latency of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It utilizes high-level API functions from doca_gpunetio_dev_verbs_counter.cuh and the Core Direct counter feature, even with a single CUDA thread.

Example command lines:

Server:

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0

Client: (additional command-line options can be added):

Copy
Copied!

            
            doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

GPU DMA Copy

This sample exhibits how to use the DOCA DMA and DOCA GPUNetIO libraries to DMA copy a memory buffer from the CPU to the GPU (with DOCA DMA CPU functions) and from the GPU to the CPU (with DOCA GPUNetIO DMA device functions) from a CUDA kernel. This sample requires a DPU as it uses the DMA engine on it.

DOCA RDMA Client side

Copy
Copied!

            
            $ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_dma_memcpy
 
# Build the sample and then execute
 
$ ./build/doca_gpunetio_dma_memcpy -g 17:00.0 -n ca:00.0
[15:44:04:189462][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:164][main] Starting the sample
EAL: Detected CPU lcores: 64
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Selected IOVA mode 'VA'
EAL: No free 2048 kB hugepages reported on node 0
EAL: No free 2048 kB hugepages reported on node 1
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
EAL: Probe PCI driver: gpu_cuda (10de:2331) device: 0000:17:00.0 (socket 0)
[15:44:04:857251][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:211][init_sample_mem_objs] The CPU source buffer value to be copied to GPU memory: This is a sample piece of text from CPU
[15:44:04:857359][862197][DOCA][WRN][doca_mmap.cpp:1743][doca_mmap_set_memrange] Mmap 0x55aec6206140: Memory range isn't cache-line aligned - addr=0x55aec52ceb10. For best performance align address to 64B
[15:44:04:858839][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:158][init_sample_mem_objs] The GPU source buffer value to be copied to CPU memory: This is a sample piece of text from GPU
[15:44:04:921702][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:570][submit_dma_memcpy_task] Success, DMA memcpy job done successfully
CUDA KERNEL INFO: The GPU destination buffer value after the memcpy: This is a sample piece of text from CPU 
CPU received message from GPU: This is a sample piece of text from GPU
[15:44:04:930087][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:364][gpu_dma_cleanup] Cleanup DMA ctx with GPU data path
[15:44:04:932658][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:404][gpu_dma_cleanup] Cleanup DMA ctx with CPU data path
[15:44:04:954156][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:197][main] Sample finished successfully

On This Page

PCIe set

PCIe check

IOMMU

Launch nvidia-peermem

GPU Configuration

GPU Configuration

BAR1 mapping

BAR1 mapping

Build the sample

phc2sys

phc2sys

phc2sys

phc2sys

phc2sys

phc2sys

Run

phc2sys

Build the sample

nping generator

nping generator

nping generator

DOCA Simple Receive

DOCA Simple Receive

nping generator

DOCA RDMA Server side

DOCA RDMA Client side