DOCA Documentation v3.1.0

On This Page

DOCA GPUNetIO

This document provides an overview and configuration instructions for DOCA GPUNetIO API.

Note

The quality status of DOCA libraries is listed here.

DOCA GPUNetIO enables real-time GPU processing for network packets, making it ideal for application domains such as:

  • Signal processing

  • Network security

  • Information gathering

  • Input reconstruction

Traditional approaches often rely on a CPU-centric model, where the CPU coordinates with the NIC to receive packets in GPU memory using GPUDirect RDMA. Afterward, the CPU notifies a CUDA kernel on the GPU to process the packets. However, on low-power platforms, this CPU dependency can become a bottleneck, limiting GPU performance and increasing latency.

DOCA GPUNetIO addresses this challenge by offering a GPU-centric solution that removes the CPU from the critical path. By combining multiple NVIDIA technologies, it provides a highly efficient and scalable method for network packet processing.

Technologies integrated with DOCA GPUNetIO:

  • GPUDirect RDMA – Enables direct packet transfer between the NIC and GPU memory, eliminating unnecessary memory copies

  • GPUDirect Async Kernel-Initiated (GDAKI) – Allows CUDA kernels to control network operations without CPU intervention

    • IBGDA specifically for the RDMA protocol

  • GDRCopy Library – Allows the CPU to access GPU memory directly

  • NVIDIA BlueField DMA Engine – Supports GPU-triggered memory copies

The following is an example diagram of a CPU-centric approach:

image2023-3-17_17-16-6-version-1-modificationdate-1750108056950-api-v2.png

The following is an example diagram of a GPU-centric approach:

image2023-4-19_11-47-9-version-1-modificationdate-1750108051253-api-v2.png

Key features of DOCA GPUNetIO include :

  • GPUDirect Async Kernel-Initiated (GDAKI)

    • GDAKI network communications – a GPU CUDA kernel can control network communications to send or receive data

      • GPU can control Ethernet communications (Ethernet/IP/UDP/TCP/ICMP)

      • GPU can control RDMA communications (InfiniBand or RoCE are supported)

      • CPU intervention is unnecessary in the application critical path

  • GPUDirect RDMA

    • Enables direct data transfers to and from GPU memory without CPU staging copies.

  • DMA Engine Control

    • CUDA kernels can initiate memory copies using BlueField's DMA engine

  • Semaphores for Low-Latency Communication

    • Supports efficient message passing between CUDA kernels or between CUDA kernels and CPU threads

  • Smart Memory Allocation

    • Allocates aligned GPU memory buffers, optimizing memory access

    • GDRCopy library to allocate a GPU memory buffer accessible from the CPU

  • Accurate Send Scheduling

    • Provides precise control over Ethernet packet transmission based on user-defined timestamps.

NVIDIA applications that use DOCA GPUNetIO include:

  • Aerial 5G SDK – For ultra-low latency 5G network operations

  • NIXL – NVIDIA Inference Xfer Library (NIXL) is targeted for accelerating point to point communications in AI inference frameworks (e.g., NVIDIA Dynamo)

  • Morpheus – Enhancing cybersecurity through accelerated packet inspection

  • Holoscan Advanced Network Operator – Powering real-time data processing in edge AI environments

For more information about DOCA GPUNetIO, refer to the following NVIDIA blog posts:

Changes in 3.1

  • New set of DOCA GPUNetIO API on top of DOCA Verbs for GDAKI (IBGDA) over Infiniband or RoCE.

  • The GPUNetIO Verbs CUDA functions are release as inline functions in open source CUDA header files.

  • New samples introduced to test the new GPUNetIO Verbs functions and features.

  • Documentation has been reworked and improved. DPU converged card documentation has been removed.

DOCA GPUNetIO is available in DOCA for Host All package for all the DOCA supported OSs at the DOCA downloads portal. DOCA GPUNetIO components have a dependency on CUDA. CUDA version 12.8 is required if application needs to link agains GPUNetIO CUDA functions in the static library (Ethernet, RDMA and DMA), otherwise any CUDA 12.2 or newer is ok for application using CUDA functions in GPUNetIO Verbs header files.

To install DOCA GPUNetIO components:

  • For Ubuntu/Debian:

    Copy
    Copied!
                

    apt install doca-all doca-sdk-gpunetio libdoca-sdk-gpunetio-dev

  • For RHEL:

    Copy
    Copied!
                

    yum install doca-all doca-sdk-gpunetio doca-sdk-gpunetio-devel

Note

To achieve the best performance, when building any DOCA GPUNetIO sample or application, set the buildtype to release instead of debug in the meson.build file.

To run a DOCA GPUNetIO application, the system must be configured with both a GPU and a NIC (either ConnectX or BlueField), connected to the system via PCIe.

image-2025-7-10_11-18-26-version-1-modificationdate-1752139106327-api-v2.png

The system's internal hardware topology should be GPUDirect-RDMA-friendly to maximize the internal throughput between the GPU and the NIC.

To verify the type of connection between the GPU and NIC:

Copy
Copied!
            

$ nvidia-smi topo -m GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE 12-23,36-47 1 N/A NIC0 NODE X PIX NIC1 NODE PIX X   Legend:   X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks   NIC Legend:   NIC0: mlx5_0 NIC1: mlx5_1

To maximize throughput between the GPU and NIC, the system should have a PIX (or PXB) topology with a dedicated PCIe connection. A PHB topology is still acceptable if the GPU and NIC are on the same PCIe Host Bridge and NUMA node, although performance may vary depending on the platform. For optimal performance, it's recommended to avoid NODE and SYS topologies, as they may negatively impact performance despite the application remaining functional.

Note

DOCA GPUNetIO has been tested with bare-metal systems and Docker containers. However, it is not supported in virtualized environments, such as those using KVM.

NIC Configuration

ConnectX NIC

Note

Ensure the NVIDIA ConnectX firmware is compatible with the current DOCA release. NVIDIA recommends using ConnectX-6 Dx or later adapters.

  1. Start MST:

    Copy
    Copied!
                

    $ sudo mst start

  2. Check MST status:

    Copy
    Copied!
                

    $ sudo mst status -v

    Example output:

    Copy
    Copied!
                

    MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA ConnectX6DX(rev:0) /dev/mst/mt4125_pciconf0.1 b5:00.1 mlx5_1 net-ens6f1 0 ConnectX6DX(rev:0) /dev/mst/mt4125_pciconf0 b5:00.0 mlx5_0 net-ens6f0 0

  3. Configure ConnectX NIC:

    • For Ethernet transport, r un the following commands, replacing <mst_device> with the actual MST device name (e.g., /dev/mst/mt4125_pciconf0 ):

      Copy
      Copied!
                  

      mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=1 KEEP_ETH_LINK_UP_P2=1 KEEP_IB_LINK_UP_P1=0 KEEP_IB_LINK_UP_P2=0 # This is required only if application uses the Accurate Send Scheduling feature mlxconfig -d <mst_device> --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

      Info

      The following example assumes that the adapter is dual-port. If single port, only P1 options apply.

    • For InfiniBand transport, run:

      Copy
      Copied!
                  

      mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=0 KEEP_ETH_LINK_UP_P2=0 KEEP_IB_LINK_UP_P1=1 KEEP_IB_LINK_UP_P2=1 # Accurate Send Scheduling feature can't be used with InfiniBand

      Info

      The following example assumes that the adapter is dual-port. If single port, only P1 options apply.

  4. Perform a cold reboot to apply the changes :

    Copy
    Copied!
                

    ipmitool power cycle

BlueField NIC

To use NVIDIA BlueField-2 or BlueField-3 with DOCA GPUNetIO, the DPU must be in NIC mode to expose the internal ConnectX to the host application.

  1. Start MST:

    Copy
    Copied!
                

    $ sudo mst start

  2. Check MST status:

    Copy
    Copied!
                

    $ sudo mst status -v

    Example output:

    Copy
    Copied!
                

    MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 9f:00.1 mlx5_1 net-ens6f1np1 1 BlueField3(rev:1) /dev/mst/mt41692_pciconf0 9f:00.0 mlx5_0 net-ens6f0np0 1

  3. Configure BlueField NIC:

    • For Ethernet transport:

      Copy
      Copied!
                  

      sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=2 LINK_TYPE_P2=2 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED # This is required only if application uses the Accurate Send Scheduling feature sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

    • For InfiniBand transport:

      Copy
      Copied!
                  

      sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=1 LINK_TYPE_P2=1 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED # Accurate Send Scheduling feature can't be used with Infiniband

  4. Perform a cold reboot:

    Copy
    Copied!
                

    ipmitool power cycle

  5. Example verification command for Ethernet.

    Copy
    Copied!
                

    sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q LINK_TYPE_P1 LINK_TYPE_P2 INTERNAL_CPU_MODEL INTERNAL_CPU_PAGE_SUPPLIER INTERNAL_CPU_ESWITCH_MANAGER INTERNAL_CPU_IB_VPORT0 INTERNAL_CPU_OFFLOAD_ENGINE ACCURATE_TX_SCHEDULER REAL_TIME_CLOCK_ENABLE

    Example output (Ethernet):

    Copy
    Copied!
                

    LINK_TYPE_P1 ETH(2) LINK_TYPE_P2 ETH(2) INTERNAL_CPU_MODEL EMBEDDED_CPU(1) INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1) INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1) INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1) INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1) ACCURATE_TX_SCHEDULER True(1) REAL_TIME_CLOCK_ENABLE True(1)

PCIe Configuration

On some x86 systems, the Access Control Services (ACS) must be disabled to ensure direct communication between the NIC and GPU, whether they reside on the same converged accelerator DPU or on different PCIe slots in the system. The recommended solution is to disable ACS control via BIOS (e.g., Supermicro or HPE) on your PCIe bridge. Alternatively, it is also possible to disable it via command line, but it may not be as effective as the BIOS option.

The following lspci -tvvv output illustrates a typical system topology:

Copy
Copied!
            

$ lspci -tvvv...+-[0000:b0]-+-00.0 Intel Corporation Device 09a2 | +-00.1 Intel Corporation Device 09a4 | +-00.2 Intel Corporation Device 09a3 | +-00.4 Intel Corporation Device 0998 | \-02.0-[b1-b6]----00.0-[b2-b6]--+-00.0-[b3]--+-00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller | | +-00.1 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller | | \-00.2 Mellanox Technologies MT42822 BlueField-2 SoC Management Interface | \-01.0-[b4-b6]----00.0-[b5-b6]----08.0-[b6]----00.0 NVIDIA Corporation Device 20b8

The PCIe switch address to consider is b2:00.0 (entry point of the DPU). ACSCtl must have all negative values:

PCIe set

Copy
Copied!
            

setpci -s b2:00.0 ECAP_ACS+0x6.w=0000

To verify that the setting has been applied correctly:

PCIe check

Copy
Copied!
            

$ sudo lspci -s b2:00.0 -vvvv | grep -i ACSCtl ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Please refer to this page and this page for more information.

If the application still does not report any received packets, try to disable IOMMU. On some systems, it can be done from the BIOS looking for the the VT-d or IOMMU from the NorthBridge configuration and change that setting to Disable and save it. The system may also require adding intel_iommu=off or amd_iommu=off to the kernel options. That can be done through the grub command line as follows:

IOMMU

Copy
Copied!
            

$ sudo vim /etc/default/grub # GRUB_CMDLINE_LINUX_DEFAULT="iommu=off intel_iommu=off <more options>" $ sudo update-grub $ sudo reboot


GPU Configuration

It is recommended to enable NVIDIA driver persistence mode to decrease initial application latency nvidia-smi -pm 1.

GDRCopy Installation

To enable direct CPU access to GPU memory without using CUDA APIs, DOCA requires the GDRCopy kernel module and library.

  1. Install necessary packages:

    Copy
    Copied!
                

    sudo apt install -y check kmod

  2. Clone the GDRCopy repository:

    Copy
    Copied!
                

    git clone https://github.com/NVIDIA/gdrcopy.git /opt/mellanox/gdrcopy

  3. Build GDRCopy:

    Copy
    Copied!
                

    cd /opt/mellanox/gdrcopy && make

  4. Load the GDRCopy kernel module:

    Copy
    Copied!
                

    ./insmod.sh

  5. Check if the gdrdrv and nvidia-peermem modules are loaded:

    Copy
    Copied!
                

    lsmod | egrep gdrdrv

    Example output:

    Copy
    Copied!
                

    gdrdrv 24576 0 nvidia 55726080 4 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

  6. Export the GDRCopy library path:

    Copy
    Copied!
                

    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mellanox/gdrcopy/src

  7. Ensure CUDA library paths are in the environment variables:

    Copy
    Copied!
                

    export PATH="/usr/local/cuda/bin:${PATH}" export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}" export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"

Note

GDRCopy is optional. If not installed, DOCA GPUNetIO cannot allocate memory using the DOCA_GPU_MEM_TYPE_GPU_CPU flag. If GDRCopy is not detected, DOCA GPUNetIO will log warning messages.

If GDRCopy is not required for your application, you can safely ignore the related warning messages. To use GDRCopy, ensure its installation path is included in the LD_LIBRARY_PATH environment variable or specified using the GDRCOPY_PATH_L environment variable.


Enabling NIC-GPU Memory Interaction

To enable the NIC to send and receive packets using GPU memory, load the NVIDIA kernel module nvidia-peermem, typically included with the CUDA Toolkit installation.

Loading nvidia-peermem

This method is considered the legacy mode for mapping buffers.

Launch nvidia-peermem

Copy
Copied!
            

sudo modprobe nvidia-peermem


Using dmabuf

DOCA provides an alternative method to map GPU memory using dmabuf. The prerequisites for this approach are:

  • Linux Kernel version 6.2 or later

  • libibverbs version 1.14.44 or later

  • CUDA Toolkit:

    • Version 12.5 or older: installed with the -m=kernel-open flag (implying open-source NVIDIA driver mode)

    • Version 12.6 or newer: open kernel mode is enabled by default

Note

Using DOCA with kernel 6.2 to enable dmabuf is currently experimental.


Example Code

The following code snippet demonstrates how to use dmabuf for GPU memory mapping:

GPU Configuration

Copy
Copied!
            

/* Get the dmabuf file-descriptor for the GPU memory buffer from CUDA */ result = doca_gpu_dmabuf_fd(gpu_dev, gpu_buffer_addr, gpu_buffer_size, &(dmabuf_fd)); if (result != DOCA_SUCCESS) { /* Fallback to nvidia-peermem legacy method if dmabuf fails */ doca_mmap_set_memrange(gpu_buffer_mmap, gpu_buffer_addr, gpu_buffer_size); } else { /* Create DOCA mmap using dmabuf */ doca_mmap_set_dmabuf_memrange(gpu_buffer_mmap, dmabuf_fd, gpu_buffer_addr, 0, gpu_buffer_size); }


Handling dmabuf Mapping Failure

If doca_gpu_dmabuf_fd fails, it likely indicates that the NVIDIA driver is not in open-source mode. Upon calling doca_mmap_start, DOCA attempts to map the GPU memory buffer using dmabuf. If this fails, it falls back to the legacy nvidia-peermem method.

In case of dmabuf mapping failure, the following warning message is logged:

GPU Configuration

Copy
Copied!
            

[DOCA][WRN][linux_devx_adapter.cpp:374] devx adapter 0x5566a16018e0: Registration using dmabuf is not supported, falling back to legacy registration

If your application can rely on nvidia-peermem and does not require dmabuf, you can ignore this warning message.

GPU BAR1 Size

Every time a GPU buffer is mapped to the NIC (e.g., buffers associated with send or receive queues), a portion of the GPU BAR1 mapping space is used. Therefore, it is important to check that the BAR1 mapping is large enough to hold all the bytes the DOCA GPUNetIO application is trying to map. To verify the BAR1 mapping space of a GPU you can use nvidia-smi:

BAR1 mapping

Copy
Copied!
            

$ nvidia-smi -q   ==============NVSMI LOG============== ..... Attached GPUs : 1 GPU 00000000:CA:00.0 Product Name : NVIDIA A100 80GB PCIe Product Architecture : Ampere Persistence Mode : Enabled ..... BAR1 Memory Usage Total : 131072 MiB Used : 1 MiB Free : 131071 MiB

By default, some GPUs (e.g. RTX models) may have a very small BAR1 size:

BAR1 mapping

Copy
Copied!
            

$ nvidia-smi -q | grep -i bar -A 3 BAR1 Memory Usage Total : 256 MiB Used : 6 MiB Free : 250 MiB

If the BAR1 size is not enough, DOCA GPUNetIO applications may exit with errors because DOCA mmap fails to map the GPU memory buffers to the NIC (e.g., Failed to start mmap DOCA Driver call failure). To overcome this issue, the GPU BAR1 must be increased from the BIOS. The system should have "Resizable BAR" option enabled. For further information, refer to this NVIDIA forum post.

Running without Root Privileges

All DOCA GPUNetIO samples and applications using Ethernet rely on DOCA Flow. Therefore, they must be executed with sudo or root privileges.

However, RDMA and DMA samples can be run without sudo privileges if a specific option is enabled in the NVIDIA driver. To do this, follow these steps:

  1. Create a configuration file for the NVIDIA driver:

    Copy
    Copied!
                

    cat <<EOF | sudo tee /etc/modprobe.d/nvidia.conf options nvidia NVreg_RegistryDwords="PeerMappingOverride=1;" EOF

  2. Perform a cold reboot to ensure the changes take effect.

  3. Verify that the configuration has been applied using the following command:

    Copy
    Copied!
                

    $ grep RegistryDwords /proc/driver/nvidia/params

    You should see the following output confirming the setting:

    Copy
    Copied!
                

    RegistryDwords: "PeerMappingOverride=1;"

A GPU packet processing network application can be split into two fundamental phases:

  • Configuration phase on the CPU (devices configuration, memory allocation, launch of CUDA kernels, etc.)

  • Data path phase where GPU and NIC interact to exercise their functions

DOCA GPUNetIO provides different building blocks, some of them in combination with the DOCA Ethernet, DOCA RDMA or DOCA DMA library, to create a full pipeline running entirely on the GPU.

During the setup phase on the CPU, applications must:

  1. Prepare all the objects on the CPU.

  2. Export a GPU handler for them.

  3. Launch a CUDA kernel passing the object's GPU handler to work with the object during the data path.

For this reason, DOCA GPUNetIO is composed of two libraries:

  • libdoca_gpunetio with functions invoked by CPU to prepare the GPU, allocate memory and objects

  • libdoca_gpunetio_device with functions invoked by GPU within CUDA kernels during the data path

Note

The pkgconfig file for the DOCA GPUNetIO shared library is doca-gpunetio.pc. However, there is no pkgconfig file for the DOCA GPUNetIO CUDA device's static library /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio_device.a, so it must be explicitly linked to the CUDA application if DOCA GPUNetIO CUDA device functions are required.

The following diagram presents the typical flow:

image-2025-4-18_12-5-4-version-1-modificationdate-1750108048477-api-v2.png

DOCA GPUNetIO provides GPU functions to control various objects for different transports/protocols created with other DOCA libraries. This section explains the correlation between DOCA GPUNetIO and other DOCA libraries.

Ethernet GDAKI Communications

To enable GPU communications on Ethernet transport, the application requires DOCA GPUNetIO, DOCA Ethernet, and DOCA Flow libraries.

Initial CPU Configuration Phase

  1. Create a device handler for the network card using DOCA Core.

  2. Create a GPU device handler for the GPU card using DOCA GPUNetIO.

  3. Use DOCA Ethernet to:

    • Create send and/or receive queue handlers.

    • Set queue handlers' data path on the GPU.

    • Export a GPU handler representing those queues.

  4. Use DOCA Flow to assign flow steering rules to each DOCA Ethernet receive queue, defining the type of packets the NIC should place in that queue.

Data Path Phase

After completing the configuration phase, launch a CUDA Kernel, passing the GPU handlers for Ethernet queues as input arguments. This allows DOCA GPUNetIO CUDA device functions to operate within the CUDA Kernel.

Relevant CUDA Device Functions

For Ethernet communications, use functions defined in doca_gpunetio_dev_eth_rxq.cuh and doca_gpunetio_dev_eth_txq.cuh header files, starting with doca_gpu_dev_eth_*.

Example Use Cases

Refer to the DOCA GPU Packet Processing Application Guide (doca_gpu_packet_processing) and samples (doca_gpunetio_simple_send, doca_gpunetio_simple_receive, doca_gpunetio_send_wait_time) for examples of Ethernet GPU communications in different scenarios.

Tip

For a deeper understanding of Ethernet send and receive structures, objects, and functions, refer to DOCA Ethernet documentation.

An example diagram when multiple queues and/or semaphores are used to receive Ethernet traffic:

image2023-4-3_18-18-20-version-1-modificationdate-1750108055750-api-v2.png

Receiving and dispatching packets to another CUDA kernel is not required. A simpler scenario can have a single CUDA kernel receiving and processing packets:

image2023-4-4_12-13-32-version-1-modificationdate-1750108055430-api-v2.png

RDMA Verbs GDAKI Communications (IBGDA)

DOCA GPUNetIO provides GPU data path functions for objects created with DOCA RDMA and DOCA Verbs libraries, enabling GPU communications over RDMA transport protocols (IB or RoCE).

DOCA GPUNetIO and DOCA RDMA

DOCA RDMA is a high-level library that abstracts most low-level details of mlx5 and IBVerbs. The GPUNetIO CUDA data path functions follow a similar high-level API approach.

Key Characteristics

  • High-level API for generic RDMA operations (Write, Send, Read, Recv) posting and polling.

  • No shared queue management; applications must manage simultaneous access to queues from different CUDA threads.

  • Closed-source CUDA static library.

  • Suitable for simple GDAKI applications performing basic RDMA operations.

  • Requires less knowledge of IBVerbs, rdma-core, and mlx5 details.

Configuration and Usage

  1. Create a device handler for the network card using DOCA Core.

  2. Create a GPU device handler for the GPU card using DOCA GPUNetIO.

  3. Use DOCA RDMA to:

    • Create send and/or receive queue handlers.

    • Set queue handlers' data path on the GPU.

    • Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for RDMA queues as input arguments.

Relevant CUDA Device Functions

Use functions defined in doca_gpunetio_dev_rdma.cuh, starting with doca_gpu_dev_rdma_*, for RDMA communications in CUDA kernels.

Example Use Cases

Refer to the sample doca_gpunetio_rdma_client_server_write for examples of GPUNetIO RDMA functions.

Tip

For a deeper understanding of RDMA operations, refer to the DOCA RDMA documentation.

DOCA GPUNetIO and DOCA Verbs

DOCA Verbs provides a lower-level interface for RDMA communications, allowing the creation and manipulation of low-level mlx5 and IBVerbs objects.

Key Characteristics

  • Low-level API for manipulating IBVerbs fundamental elements.

  • Shared QP: a single QP can be safely accessed concurrently by different CUDA threads or warps.

  • Combined operations: the library provides building blocks for concatenating multiple operations.

  • CPU proxy mode: a fallback mechanism for systems where direct DoorBell ringing is not possible.

  • Open-source CUDA header library with inline functions.

  • Suitable for complex GDAKI applications requiring advanced RDMA operations.

Configuration and Usage

  1. Create a device handler for the network card using DOCA Core.

  2. Create a GPU device handler for the GPU card using DOCA GPUNetIO.

  3. Use DOCA Verbs to:

    • Create send and/or receive queue handlers.

    • Set queue handlers' data path on the GPU.

    • Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for Verbs queues as input arguments.

Relevant CUDA Device Functions

Use functions defined and implemented inline in various header files starting with doca_gpunetio_dev_verbs_* for Verbs communications in CUDA kernels.

Example Use Cases

Refer to samples doca_gpunetio_verbs_* for examples of GPUNetIO Verbs functions.

Tip

For a deeper understanding of Verbs operations, refer to the DOCA Verbs documentation.

The verbs_high_level.cpp file provides high-level functions like doca_gpu_verbs_create_qp_hl() and doca_gpu_verbs_create_qp_group_hl() to simplify the integration of DOCA Verbs and GPUNetIO.

Warning

The GPUNetIO Verbs APIs are experimental. Please report any issues encountered during use to help improve the code's quality and robustness.

DMA GDAKI Memory Copies

To enable GPU-triggered memory copies using the DMA engine, an application requires DOCA GPUNetIO and DOCA DMA libraries.

Initial CPU Configuration Phase

  1. Create a device handler for the network card using DOCA Core.

  2. Create a GPU device handler for the GPU card using DOCA GPUNetIO.

  3. Use DOCA DMA to:

    • Create DMA queue handlers.

    • Set queue handlers' data path on the GPU.

    • Export a GPU handler representing those queues.

Data Path Phase

After completing the configuration phase, launch a CUDA Kernel, passing the GPU handlers for DMA queues as input arguments. This enables DOCA GPUNetIO CUDA device functions to operate within the CUDA Kernel.

Relevant CUDA Device Functions

For DMA memory copies, use functions defined in doca_gpunetio_dev_dma.cuh, starting with doca_gpu_dev_dma_*.

Example Use Case

Refer to the sample doca_gpunetio_dma_memcpy for an example of triggering DMA memory copies from a CUDA Kernel.

Tip

For a deeper understanding of DMA operations, refer to the DOCA DMA documentation.

This section details the specific structures and operations related to the main DOCA GPUNetIO API on CPU and GPU. GPUNetIO headers are:

  • doca_gpunetio.h – CPU functions to create GPU handlers, allocate GPU memory, etc..

  • doca_gpunetio_dev_buf.cuh – GPU functions to manage a DOCA buffer array

  • doca_gpunetio_dev_eth_rxq.cuh – GPU functions to manage a DOCA Ethernet receive queue

  • doca_gpunetio_dev_eth_txq.cuh – GPU functions to manage a DOCA Ethernet send queue

  • doca_gpunetio_dev_sem.cuh – GPU functions to manage a DOCA GPUNetIO semaphore

  • doca_gpunetio_dev_rdma.cuh – GPU functions to manage a DOCA RDMA queue

  • doca_gpunetio_dev_dma.cuh – GPU functions to manage a DOCA DMA queue

  • doca_gpunetio_dev_verbs_*.cuh – GPU functions to manage a DOCA Verbs object

This section lists the main functions of DOCA GPUNetIO.

Tip

To better understand DOCA core objects like doca_mmap or doca_buf_array, please refer to the DOCA Core.

All DOCA Core, Ethernet, RDMA and DMA objects used in combination with GPUNetIO have a GPU export function to obtain a GPU handler for that object.

CPU Functions

In this section there is the list of DOCA GPUNetIO functions that can be used on the CPU only.

doca_gpu_mem_type

This enum lists all the possible memory types that can be allocated with GPUNetIO.

Copy
Copied!
            

enum doca_gpu_mem_type { DOCA_GPU_MEM_TYPE_GPU = 0, DOCA_GPU_MEM_TYPE_GPU_CPU = 1, DOCA_GPU_MEM_TYPE_CPU_GPU = 2, };

Note

With regards to the syntax, the text string after the DOCA_GPU_MEM_TYPE_ prefix signifies <where-memory-resides>_<who-has-access>.

  • DOCA_GPU_MEM_TYPE_GPU – memory resides on the GPU and is accessible from the GPU only

  • DOCA_GPU_MEM_TYPE_GPU_CPU – memory resides on the GPU and is accessible also by the CPU

  • DOCA_GPU_MEM_TYPE_CPU_GPU – memory resides on the CPU and is accessible also by the GPU

Typical usage of the DOCA_GPU_MEM_TYPE_GPU_CPU memory type is to send a notification from the CPU to the GPU (e.g., a CUDA kernel periodically checking to see if the exit condition set by the CPU is met).

doca_gpu_create

This is the first function a GPUNetIO application must invoke to create an handler on a GPU device. The function initializes a pointer to a structure in memory with type struct doca_gpu *.

Copy
Copied!
            

doca_error_t doca_gpu_create(const char *gpu_bus_id, struct doca_gpu **gpu_dev);

  • gpu_bus_id<PCIe-bus>:<device>.<function> of the GPU device you want to use in your application

  • gpu_dev [out] – GPUNetIO handler to that GPU device

To get the PCIe address, users can use the commands lspci or nvidia-smi.

doca_gpu_mem_alloc

This CPU function allocates different flavors of memory.

Copy
Copied!
            

doca_error_t doca_gpu_mem_alloc(struct doca_gpu *gpu_dev, size_t size, size_t alignment, enum doca_gpu_mem_type mtype, void **memptr_gpu, void **memptr_cpu)

  • gpu_dev – GPUNetIO device handler

  • size – Size, in bytes, of the memory area to allocate

  • alignment – Memory address alignment to use. If 0, default one will be used

  • mtype – Type of memory to allocate

  • memptr_gpu [out] – GPU pointer to use to modify that memory from the GPU if memory is allocated on or is visible by the GPU

  • memptr_cpu[out] – CPU pointer to use to modify that memory from the CPU if memory is allocated on or is visible by the CPU. Can be NULL if memory is GPU-only

Warning

Make sure to use the right pointer on the right device! I f an application tries to access the memory using the memptr_gpu address from the CPU, a segmentation fault will result.


doca_gpu_semaphore_create

Creates a new instance of a DOCA GPUNetIO semaphore. A semaphore is composed by a list of items each having, by default, a status flag, number of packets, and the index of a doca_gpu_buf in a doca_gpu_buf_arr.

For example, a GPUNetIO semaphore can be used in applications where a CUDA kernel is responsible for receiving packets in a doca_gpu_buf_arr array associated with an Ethernet receive queue object, doca_gpu_eth_rxq (see section "doca_gpu_dev_eth_rxq_receive_*"), and dispatching packet info to a second CUDA kernel which processes them.

Another way to use a GPUNetIO semaphore is to exchange data across different entities like two CUDA kernels or a CUDA kernel and a CPU thread. The reason for this scenario may be that the CUDA kernel needs to provide the outcome of the packet processing to the CPU which would in turn compile a statistics report. Therefore, it is possible to associate a custom application-defined structure with each item in the semaphore. This way, the semaphore can be used as a message passing object.

image2023-4-18_12-6-25-version-1-modificationdate-1750108051700-api-v2.png

Entities communicating through a semaphore must adopt a poll/update mechanism according to the following logic:

  • Update:

    1. Populate the next item of the semaphore (packets' info and/or custom application-defined info).

    2. Set status flag to READY.

  • Poll:

    1. Wait for the next item to have a status flag equal to READY.

    2. Read and process info.

    3. Set status flag to DONE.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_create(struct doca_gpu *gpu_dev, struct doca_gpu_semaphore **semaphore)

  • gpu_dev – GPUNetIO handler

  • semaphore [out] – GPUNetIO semaphore handler associated to the GPU device

doca_gpu_semaphore_set_memory_type

This function defines the type of memory for the semaphore allocation.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_memory_type(struct doca_gpu_semaphore *semaphore, enum doca_gpu_mem_type mtype)

  • semaphore – GPUNetIO semaphore handler

  • mtype – Type of memory to allocate the custom info structure

    • If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type.

    • If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_set_items_num

This function defines the number of items in a semaphore.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_items_num(struct doca_gpu_semaphore *semaphore, uint32_t num_items)

  • semaphore – GPUNetIO semaphore handler

  • num_items – Number of items to allocate

doca_gpu_semaphore_set_custom_info

This function associates an application-specific structure to semaphore items as explained under "doca_gpu_semaphore_create".

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_custom_info(struct doca_gpu_semaphore *semaphore, uint32_t nbytes, enum doca_gpu_mem_type mtype)

  • semaphore – GPUNetIO semaphore handler

  • nbytes – Size of the custom info structure to associate

  • mtype – Type of memory to allocate the custom info structure

    • If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type

    • If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_get_status

From the CPU, query the status of a semaphore item. If the semaphore is allocated with DOCA_GPU_MEM_GPU, this function results in a segmentation fault.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_get_status(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, enum doca_gpu_semaphore_status *status)

  • semaphore_cpu – GPUNetIO semaphore CPU handler

  • idx – Semaphore item index

  • status [out] – Output semaphore status

doca_gpu_semaphore_get_custom_info_addr

From the CPU, retrieve the address of the custom info structure associated to a semaphore item. If the semaphore or the custom info is allocated with DOCA_GPU_MEM_GPU this function results in a segmentation fault.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_get_custom_info_addr(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, void **custom_info)

  • semaphore_cpu – GPUNetIO semaphore CPU handler

  • idx – Semaphore item index

  • custom_info [out] – Output semaphore custom info address

doca_gpu_verbs_export_qp

The doca_gpu_verbs_export_qp function creates a GPUNetIO handler from a DOCA Verbs QP object. It takes a DOCA Verbs QP as input and returns a DOCA GPUNetIO Verbs QP object ( struct doca_gpu_verbs_qp ) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler ( struct doca_gpu_dev_verbs_qp ) using the doca_gpu_verbs_get_qp_dev function.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_export_qp(struct doca_gpu *gpu_dev, struct doca_dev *dev, struct doca_verbs_qp *qp, enum doca_gpu_dev_verbs_nic_handler nic_handler, void *gpu_qp_umem_dev_ptr, struct doca_verbs_cq *cq_sq, struct doca_verbs_cq *cq_rq, struct doca_gpu_verbs_qp **qp_out);

  • gpu_dev: GPUNetIO device handler.

  • dev: DOCA device handler.

  • qp: DOCA Verbs QP handler.

  • nic_handler: Type of NIC handler.

  • gpu_qp_umem_dev_ptr: GPU memory pointer to UMEM.

  • cq_sq and cq_rq: CQs associated with the Send and Receive Queues in the QP.

    Note

    While either cq_sq or cq_rq can be NULL, they cannot both be NULL simultaneously.

  • qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_get_qp_dev

Extracts a GPU device handler (struct doca_gpu_dev_verbs_qp) from a DOCA GPUNetIO Verbs QP object.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_get_qp_dev(struct doca_gpu_verbs_qp *qp, struct doca_gpu_dev_verbs_qp **qp_gpu);

  • qp: DOCA GPUNetIO Verbs QP handler.

  • qp_gpu: DOCA GPUNetIO Verbs QP GPU device handler in GPU memory.

doca_gpu_verbs_unexport_qp

Unexports a previously exported DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp).

Copy
Copied!
            

doca_error_t doca_gpu_verbs_export_qp(ruct doca_gpu *gpu_dev, struct doca_gpu_verbs_qp *qp);

  • gpu_dev: GPUNetIO device handler.

  • qp: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_bridge_export_qp

The doca_gpu_verbs_bridge_export_qp function creates a DOCA GPUNetIO Verbs QP object from application-defined parameters, acting as a bridge between IBVerbs/mlx5 objects and DOCA GPUNetIO. This allows applications to create required objects like QP, CQ, and UAR using IBVerbs and mlx5 commands and then pass the relevant information to this DOCA GPUNetIO function.

The function returns a DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler (struct doca_gpu_dev_verbs_qp) using the doca_gpu_verbs_get_qp_dev function.

It is the application's responsibility to ensure that all passed parameters are correctly created and set.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_bridge_export_qp(struct doca_gpu *gpu_dev, uint32_t sq_qpn, void *sq_wqe_addr, uint16_t sq_wqe_num, uint32_t *sq_dbrec, uint64_t *sq_db, size_t uar_size, uint32_t sq_cqn, void *sq_cqe_addr, uint32_t sq_cqe_num, uint32_t *sq_cq_dbrec, uint32_t rq_qpn, void *rq_wqe_addr, uint16_t rq_wqe_num, uint32_t *rq_dbrec, uint32_t rcv_wqe_size, uint32_t rq_cqn, void *rq_cqe_addr, uint32_t rq_cqe_num, uint32_t *rq_cq_dbrec, enum doca_gpu_dev_verbs_nic_handler nic_handler, struct doca_gpu_verbs_qp **qp_out);

  • gpu_dev: GPUNetIO device handler.

  • sq_qpn: Send QP queue number.

  • sq_wqe_addr: Send QP WQE buffer memory address.

  • sq_wqe_num: Send QP number of WQEs.

  • sq_dbrec: Send QP Doorbell Record address.

  • sq_db: Send QP Doorbell address.

  • uar_size: Size of UAR.

  • sq_cqn: Send CQ number.

  • sq_cqe_addr: Send CQ CQE buffer memory address.

  • sq_cqe_num: Send CQ number of CQEs.

  • qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

DOCA PE

A DOCA Ethernet Txq context, exported for GPUNetIO usage, can be tracked via DOCA PE on the CPU side to check if there are errors when sending packets or to retrieve notification info after sending a packet with any of the doca_gpu_dev_eth_txq_*_enqueue_* functions on the GPU. An example can be found in the DOCA GPU packet processing application with ICMP traffic.

Strong Mode vs. Weak Mode

Some Ethernet and RDMA GPU functions present two modes of operation: Weak and strong.

  • In weak mode, the application calculates the next available position in the queue. With the help of functions like doca_gpu_eth_txq_get_info, doca_gpu_rdma_get_info, or doca_gpu_dev_rdma_recv_get_info it is possible to know the next available position in the queue and the mask of the number of total entries in the queue (so the incremental descriptor index can be wrapped). In this mode, the developer must specify a queue descriptor number for where to enqueue the packet, ensuring that no descriptor in the queue is left empty. It's a bit more complex to manage but it should result in better performance and developer can emphasize GPU memory coalescing enqueuing sequential operations using sequential memory locations.

  • In strong mode, the GPU function enqueues the Ethernet/RDMA operation in the next available position in the queue. It is simpler to manage as developer does not have to worry about operation's position, but it may introduce an extra latency to atomically guarantee the access of multiple threads to the same queue. Moreover, it does not guarantee that sequential operations refer to sequential memory locations.

    Note

    All strong mode functions work at the CUDA block level. That is, it is not possible to access the same Eth/RDMA queue at the same time from two different CUDA blocks.

GPU Functions – Ethernet

This section provides a list of DOCA GPUNetIO functions that can be used for Ethernet network operations on the GPU only within a CUDA kernel.

doca_gpu_dev_eth_rxq_receive_*

To acquire packets in a CUDA kernel, DOCA GPUNetIO offers different flavors of the receive function for different scopes: per CUDA block, per CUDA warp, and per CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_rxq_receive_block(struct doca_gpu_eth_rxq *eth_rxq, uint32_t max_rx_pkts, uint64_t timeout_ns, uint32_t *num_rx_pkts, uint64_t *doca_gpu_buf_idx) __device__ doca_error_t doca_gpu_dev_eth_rxq_receive_warp(struct doca_gpu_eth_rxq *eth_rxq, uint32_t max_rx_pkts, uint64_t timeout_ns, uint32_t *num_rx_pkts, uint64_t *doca_gpu_buf_idx) __device__ doca_error_t doca_gpu_dev_eth_rxq_receive_thread(struct doca_gpu_eth_rxq *eth_rxq, uint32_t max_rx_pkts, uint64_t timeout_ns, uint32_t *num_rx_pkts, uint64_t *doca_gpu_buf_idx)

  • eth_rxq – Ethernet receive queue GPU handler

  • max_rx_pkts – Maximum number of packets to receive. It ensures the number of packets returned by the function is lower or equal to this number.

  • timeout_ns – Nanoseconds to wait for packets before returning

  • num_rx_pkts [out] – Effective number of received packets. With CUDA block or warp scopes, this variable should be visible in memory by all the other threads (shared or global memory).

  • doca_gpu_buf_idx [out] – DOCA buffer index of the first packet received in this function. With CUDA block or warp scopes, this variable should be visible in memory by all the other threads (shared or global memory).

Note

If both max_rx_pkts and timeout_ns are 0, the function never returns.

CUDA threads in the same scope (thread, warp, or block) must invoke the function on the same receive queue. The output parameters num_rx_pkts and doca_gpu_buf_idx must be visible by all threads in the scope (e.g., CUDA shared memory for warp and block).

Each packet received by this function goes to the doca_gpu_buf_arr internally created and associated with the Ethernet queues (see section "Building Blocks").

The function exits when timeout_ns is reached or when the maximum number of packets is received.

Note

For CUDA block scope, the block invoking the receive function must have at least 32 CUDA threads (i.e., one warp).

The output parameters indicate how many packets have been received (num_rx_pkts) and the index of the first received packet in the doca_gpu_buf_arr internally associated with the Ethernet receive queue. Packets are stored consecutively in the doca_gpu_buf_arr so if the function returns num_rx_pkts=N and doca_gpu_buf_idx=X, this means that all the doca_gpu_buf in the doca_gpu_buf_arr within the range [X, .. ,X + (N-1)] have been filled with packets.

image-2024-6-26_16-38-42-version-1-modificationdate-1750108062957-api-v2.png

The DOCA buffer array is treated in a circular fashion so that once the last DOCA buffer is filled by a packet, the queue circles back to the first DOCA buffer. There is no need for the application to lock or free doca_gpu_buf_arr buffers.

Note

It is the application's responsibility to consume packets before they are overwritten when circling back, properly dimensioning the DOCA buffer array size and scaling across multiple receive queues.


doca_gpu_send_flags

This enum lists all the possible flags for the txq functions. The usage of those flags makes sense if a DOCA PE has been attached to the DOCA Ethernet Txq context with GPU data path and a CPU thread, in a loop, keeps invoking doca_pe_progress.

Warning

If no DOCA PE has been attached to the DOCA Ethernet Txq context, it is mandatory to use the DOCA_GPU_SEND_FLAG_NONE flag.

Copy
Copied!
            

enum doca_gpu_mem_type { DOCA_GPU_SEND_FLAG_NONE = 0, DOCA_GPU_SEND_FLAG_NOTIFY = 1 << 0, };

  • DOCA_GPU_SEND_FLAG_NONE (default) – send is executed and no notification info is returned. If an error occurs, an event is generated. This error can be detected from the CPU side using DOCA PE.

  • DOCA_GPU_SEND_FLAG_NOTIFY – once the send (or wait) is executed, return a notification with packet info. This notification can be detected from the CPU side using DOCA PE.

doca_gpu_dev_eth_txq_send_*

To send packets from a CUDA kernel, DOCA GPUNetIO offers a strong and weak modes for enqueuing a packet in the Ethernet TXQ. For both modes, the scope is the single CUDA thread each populating and enqueuing a different doca_gpu_buf from a doca_gpu_buf_arr in the send queue.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_get_info(struct doca_gpu_eth_txq *eth_txq, uint32_t *curr_position, uint32_t *mask_max_position)

  • eth_txq – Ethernet send queue GPU handler

  • curr_position – Next available position in the queue

  • mask_max_position – Mask of the total number of positions in the queue

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_send_enqueue_strong(struct doca_gpu_eth_txq *eth_txq, const struct doca_gpu_buf *buf_ptr, const uint32_t nbytes, const uint32_t flags_bitmask)

  • eth_txq – Ethernet send queue GPU handler

  • buf_ptr – DOCA buffer from a DOCA GPU buffer array to be sent

  • nbytes – Number of bytes to be sent in the packet

  • flags_bitmask – One of the flags in the doca_gpu_send_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_send_enqueue_weak(const struct doca_gpu_eth_txq *eth_txq, const struct doca_gpu_buf *buf_ptr, const uint32_t nbytes, const uint32_t ndescr, const uint32_t flags_bitmask)

  • eth_txq – Ethernet send queue GPU handler

  • buf_ptr – DOCA buffer from a DOCA GPU buffer array to be sent

  • nbytes – Number of bytes to be sent in the packet

  • ndescr – Position in the queue to place the packet. Range: 0 - mask_max_position.

  • flags_bitmask – One of the flags in the doca_gpu_send_flags enum

doca_gpu_dev_eth_txq_wait_*

To enable Accurate Send Scheduling, the "wait on time" barrier (based on timestamp) must be set in the send queue before enqueuing more packets. Like doca_gpu_dev_eth_txq_send_*, doca_gpu_dev_eth_txq_wait_* also has a strong and weak mode.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_wait_time_enqueue_strong(struct doca_gpu_eth_txq *eth_txq, const uint64_t wait_on_time_value, const uint32_t flags_bitmask)

  • eth_txq – Ethernet send queue GPU handler

  • wait_on_time_value – Timestamp to specify when packets must be sent after this barrier

  • flags_bitmask – One of the flags in the doca_gpu_send_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_wait_time_enqueue_weak(struct doca_gpu_eth_txq *eth_txq, const uint64_t wait_on_time_value, const uint32_t ndescr, const uint32_t flags_bitmask)

  • eth_txq – Ethernet send queue GPU handler

  • wait_on_time_value – Timestamp to specify when packets must be sent after this barrier

  • ndescr – Position in the queue to place the packet. Range: 0 - mask_max_position.

  • flags_bitmask – One of the flags in the doca_gpu_send_flags enum

Please refer to section "GPUNetIO Samples" to understand how to enable and use Accurate Send Scheduling.

doca_gpu_dev_eth_txq_commit_*

After enqueuing all the packets to be sent and time barriers, a commit function must be invoked on the txq queue. The right commit function must be used according to the type of enqueue mode (i.e., strong or weak) used in doca_gpu_dev_eth_txq_send_* and doca_gpu_dev_eth_txq_wait_*.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_commit_strong(struct doca_gpu_eth_txq *eth_txq)

  • eth_txq – Ethernet send queue GPU handler

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_commit_weak(struct doca_gpu_eth_txq *eth_txq, const uint32_t descr_num)

  • eth_txq – Ethernet send queue GPU handler

  • descr_num – Number of queue items enqueued thus far

Only one CUDA thread in the scope (CUDA block or CUDA warp) can invoke this function on the send queue after several enqueue operations. Typical flow is as follows:

  1. All threads in the scope enqueue packets in the send queue.

  2. Synchronization point.

  3. Only one thread in the scope performs the send queue commit.

doca_gpu_dev_eth_txq_push

After committing, the items in the send queue must be actually pushed to the network card.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_push(struct doca_gpu_eth_txq *eth_txq)

  • eth_txq – Ethernet send queue GPU handler

Only one CUDA thread in the scope (CUDA block or CUDA warp) can invoke this function on the send queue after several enqueue or commit operations. Typical flow is as follows:

  1. All threads in the scope enqueue packets in the send queue.

  2. Synchronization point.

  3. Only one thread in the scope does the send queue commit.

  4. Only one thread in the scope does the send queue push.

Section "Produce and Send" provides an example where the scope is a block (e.g., each CUDA block operates on a different Ethernet send queue).

doca_gpu_dev_eth_txq_wait_completion

After pushing a number of sending operations on the txq, this function allows to wait for all the send to complete before moving forward. In this context, sending operations means number of calls to doca_gpu_dev_eth_txq_commit_* function.

As an example, a CUDA block composed by 512 CUDA Threads posts 512 sends (one per thread) but then only thread 0 execute the commit and push. The doca_gpu_dev_eth_txq_wait_completion must be called with num_sends = 1 .

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_eth_txq_wait_completion(struct doca_gpu_eth_txq *eth_txq, uint32_t num_sends, enum doca_gpu_dev_eth_txq_wait_flags wait_mode, uint32_t *num_completed);

  • eth_txq – Ethernet send queue GPU handler

  • num_sends – number of commit for send operations to wait

  • wait_mode – Wait in blocking or non-blocking mode

  • num_completed – number of completed send operations

Only one CUDA thread in the scope (CUDA block or CUDA warp) can invoke this function on the send queue after a push operations. The sample gpunetio_simple_send shows the usage of this function.

GPU Functions – Verbs

GPUNetIO Verbs data path functions are defined in header files named doca_gpunetio_dev_verbs_*, with inline implementations.

Header Files

  • doca_gpunetio_verbs_def.h: Contains constants, enums, and structure definitions.

  • doca_gpunetio_dev_verbs_qp.cuh: Provides CUDA functions for interacting with a QP, including WQE creation and ringing the network card doorbell.

  • doca_gpunetio_dev_verbs_cq.cuh: Offers CUDA functions for interacting with a CQ, including CQE polling and validation.

  • doca_gpunetio_dev_verbs_onesided.cuh: Enables one-sided operations (e.g., put) with the shared QP feature at the CUDA thread and warp levels.

  • doca_gpunetio_dev_verbs_twosided.cuh: Enables two-sided operations (e.g., send/recv) with the shared QP feature at the CUDA thread and warp levels.

  • doca_gpunetio_dev_verbs_counter.cuh: Provides CUDA functions for counter-like operations via RDMA atomics with the shared QP feature, requiring Core Direct to be enabled as a QP attribute.

Warning

The GPUNetIO Verbs APIs are experimental. Please report any issues encountered during use, as your feedback is crucial for enhancing the code's quality and robustness.

Note

When using the GPUNetIO Verbs header files, avoid calling functions that start with doca_priv_* as they are internal functions intended for use by the public API only.

Warning

The doca_gpunetio_verbs_def.h file contains a macro definition #define DOCA_GPUNETIO_VERBS_ENABLE_DEBUG 0 , which disables debug prints, including CQE errors. To enable error messages, set this macro to 1.


Shared QP

The shared QP feature is enabled by specifying a value from the doca_gpu_dev_verbs_exec_scope enum as a template argument in certain functions. This enum value determines the execution scope of the functions.

Thread Scope

The Thread scope is enabled using the DOCA_GPUNETIO_VERBS_EXEC_SCOPE_THREAD flag. In this scope, there is no assumption about the number of threads accessing the QP in parallel from the same or different CUDA kernels. Each thread acts as a standalone entity, posting one or more WQEs and submitting (ringing the doorbell). This approach may be the slowest and most expensive due to contention on atomic operations among potentially all threads in the CUDA kernel(s).

image-2025-7-25_11-29-39-version-1-modificationdate-1753435780057-api-v2.png

This approach may be the slowest and most expensive due to contention on atomic operations among potentially all threads in the CUDA kernel(s).

Warp Scope

The Warp scope is enabled using the DOCA_GPUNETIO_VERBS_EXEC_SCOPE_WARP flag. In this scope, all CUDA threads within a warp call the function, allowing for optimizations. Specifically, each thread in the warp posts a WQE to a different QP location, while only the first thread (lane_idx 0) performs the submit operation.

image-2025-7-25_11-38-12-version-1-modificationdate-1753436293477-api-v2.png

This reduces contention on atomic operations compared to the thread scope, as concurrency is limited to per-warp rather than per-thread.

Block Scope

The Block scope is not directly enabled by a specific flag. However, applications can implement block-level functionality by combining the building blocks used in the Thread and Warp scopes.

Ring DB and CPU Proxy

The CPU proxy feature is enabled by specifying a value from the doca_gpu_dev_verbs_nic_handler enum as a template argument in certain functions. This determines who is responsible for ringing the doorbell.

The available modes are:

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO: Automatically detects the best doorbell ringing option.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_CPU_PROXY: Enables CPU proxy mode. In this mode, GPU submit DB functions provide information to the CPU for ringing the doorbell. A CPU thread must invoke doca_gpu_verbs_cpu_proxy_progress in a loop to detect the GPU information and ring the doorbell. This mode can be enabled via a command-line option in most GPUNetIO Verbs samples.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_DB: Enables regular GDAKI mode, where CUDA threads ring the doorbell.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_BF: Experimental BlueFlame doorbell ringing mode (may fail in the current release).

Queue Type

Many functions, such as those for submitting doorbells or polling completion queues, can be applied to either the Send Queue or Receive Queue of a QP. These functions use a template argument from the doca_gpu_dev_verbs_qp_type enum to specify the queue type.

The possible values for this enum are:

  • DOCA_GPUNETIO_VERBS_QP_SQ: Uses the Send Queue from the input QP

  • DOCA_GPUNETIO_VERBS_QP_RQ: Uses the Receive Queue from the input QP

doca_gpunetio_dev_verbs_qp.cuh

The doca_gpunetio_dev_verbs_qp.cuh header file provides functions for QP manipulation, categorized into three main groups:

  • doca_gpu_dev_verbs_wqe_prepare_*: Prepares and posts various types of RDMA WQEs (Write, Read, Send, Recv, Atomic, Wait, and Dump) on QP memory.

  • doca_gpu_dev_verbs_submit: Updates network card registers for doorbell submission, supporting regular GPU-mapped doorbells, CPU proxy doorbells, and BlueFlame doorbells for both Send and Receive Queues.

  • doca_gpu_dev_verbs_wait_*: Waits for CQEs to arrive on SQ CQ or RQ CQ.

For detailed explanations of each function, refer to the header file. Examples of using these functions can be found in the doca_gpunetio_verbs_write_bw and doca_gpunetio_verbs_write_lat samples.

doca_gpunetio_dev_verbs_cq.cuh

The doca_gpunetio_dev_verbs_cq.cuh header file provides functions for manipulating Completion Queues (CQs). Specifically, the doca_gpu_dev_verbs_poll_* functions are used to poll Completion Queue Entries (CQEs) in the CQ, waiting for the completion of Work Queue Entries (WQEs) in the connected Queue Pair (QP).

doca_gpunetio_dev_verbs_onesided.cuh

The doca_gpunetio_dev_verbs_onesided.cuh header file provides functions that support the shared QP feature for one-sided RDMA operations at both CUDA thread and warp levels. The main functions are:

  • doca_gpu_dev_verbs_put: Posts an RDMA Write WQE and submits it. In WARP scope, submission is done only once by the first thread in the warp.

  • doca_gpu_dev_verbs_put_signal: Similar to doca_gpu_dev_verbs_put but with an additional RDMA Atomic Fetch and Add operation. In THREAD scope, each thread posts an RDMA Write, RDMA Atomic, and submits. In WARP scope, each thread posts an RDMA Write, but only the first thread posts the RDMA Atomic and submits.

  • doca_gpu_dev_verbs_get: Posts an RDMA Read WQE and submits it. In WARP scope, submission is done only once by the first thread in the warp. Optionally posts a Dump WQE to ensure Read data is present in memory after CQE arrival (e.g., enum CUGPUDirectRDMAWritesOrdering).

For detailed explanations, refer to the header file. Examples can be found in the doca_gpunetio_verbs_put_bw, doca_gpunetio_verbs_put_signal_bw, and doca_gpunetio_verbs_put_signal_lat samples.

doca_gpunetio_dev_verbs_twosided.cuh

The doca_gpunetio_dev_verbs_twosided.cuh header file provides functions that support the shared QP feature for two-sided RDMA operations at both CUDA thread and warp levels. The main functions are:

  • doca_gpu_dev_verbs_send: Posts an RDMA Send WQE and submits it. In WARP scope, submission is done only once by the first thread in the warp.

  • doca_gpu_dev_verbs_recv: Posts an RDMA Receive WQE and submits it. In WARP scope, submission is done only once by the first thread in the warp.

For two-sided communications, the RDMA protocol requires the following flow:

  1. Peer A posts one or more RDMA Receive WQEs.

  2. Peer A notifies Peer B that RDMA Receive WQEs are posted.

  3. Peer B waits for the notification from Peer A.

  4. Peer B posts one or more RDMA Send WQEs.

The notification can be sent using various methods, such as RDMA Write or RDMA Atomic. An example using RDMA Write as the notification method is available in the doca_gpunetio_verbs_twosided_bw sample.

doca_gpunetio_dev_verbs_counter.cuh

The doca_gpunetio_dev_verbs_counter.cuh header file provides functions that support the shared QP feature with counter functionality enabled. This feature allows triggering a WQE on a companion QP when a CQE arrives at a specific index in the main QP's CQ. To enable the counter feature, Core Direct must be enabled as a QP attribute using doca_verbs_qp_init_attr_set_core_direct_master().

Example of functions in this header file:

  • doca_gpu_dev_verbs_put_counter: Enables posting an RDMA Write WQE on the main QP and an RDMA Atomic FetchAdd on the companion QP, with the RDMA Atomic executed only after the RDMA Write CQE arrives. This allows for buffer reuse after the RDMA Atomic is executed. In warp scope, the first thread in the warp handles posting the Wait WQE, RDMA Atomic, and submitting both QPs.

  • doca_gpu_dev_verbs_submit_multi_qps: Submits both the main QP and companion QP in a single call.

For detailed explanations, refer to the header file. Examples can be found in the doca_gpunetio_verbs_put_counter_bw and doca_gpunetio_verbs_put_counter_lat samples.

GPU Functions – RDMA

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute RDMA operations. These functions offer a strong and a weak mode.

Copy
Copied!
            

__device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_get_info(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t *curr_position, uint32_t *mask_max_position)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • curr_position – Next available position in the queue

  • mask_max_position – Mask of the total number of positions in the queue

Copy
Copied!
            

__device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_recv_get_info(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t *curr_position, uint32_t *mask_max_position)

  • rdma_r – RDMA receive queue GPU handler

  • curr_position – Next available position in the queue

  • mask_max_position – Mask of the total number of positions in the queue

doca_gpu_dev_rdma_write_*

To RDMA write data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_write_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to

  • remote_offset – Offset, in bytes, to write data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to write

  • local_offset – Offset, in bytes, to fetch data from in the local buffer

  • length – Number of bytes to write

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_write_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to

  • remote_offset – Offset, in bytes, to write data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to fetch data to write

  • local_offset – Offset, in bytes, to fetch data in the local buffer

  • length – Number of bytes to write

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_read_*

To RDMA read data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes to enqueue operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_read_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, const uint32_t flags_bitmask)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data

  • remote_offset – Offset in bytes to read data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data

  • local_offset – Offset in bytes to store data in the local buffer

  • length – Number of bytes to be read

  • flags_bitmask – Must be 0; reserved for future use

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_read_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, const uint32_t flags_bitmask, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data

  • remote_offset – Offset in bytes to read data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data

  • local_offset – Offset in bytes to store data in the local buffer

  • length – Number of bytes to be read

  • flags_bitmask – Must be 0; reserved for future use

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_send_*

To RDMA send data from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_send_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • local_offset – Offset in bytes to fetch data in the local buffer

  • length – Number of bytes to send

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_send_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • local_offset – Offset in bytes to fetch data in the local buffer

  • length – Number of bytes to send

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_commit_*

Once all RDMA write, send or read requests have been enqueue in the RDMA queue, a synchronization point must be reached to consolidate and execute those requests. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_commit_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_commit_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t num_ops)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • num_ops – Number of RDMA requests enqueued since the last commit

doca_gpu_dev_rdma_wait_all

After a commit, RDMA requests are executed by the network card as applications move forward doing other operations. If the application needs to verify all RDMA operations have been done by the network card, this "wait all" function can be used to wait for all previous posted operations. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_wait_all(struct doca_gpu_dev_rdma *rdma, uint32_t *num_commits)

  • rdma – RDMA queue GPU handler

  • num_commits – Output parameter; the number of commit operations completed

Info

This function is optional, and it can be used to ensure all the RDMA Send/Write/Read operations have actually been executed before moving forward with the application.


doca_gpu_dev_rdma_recv_*

To receive data from an RDMA send, send with immediate, or write with immediate, the destination peer should post a receive operation. DOCA GPUNetIO RDMA receive operations must be done with a doca_gpu_dev_rdma_r handler. This handler can be obtained with the function doca_gpu_dev_rdma_get_recv.

Note

All receive operations must use this object.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_get_recv(struct doca_gpu_dev_rdma *rdma, struct doca_gpu_dev_rdma_r **rdma_r)

  • rdma – RDMA queue GPU handler

  • rdma_r – RDMA receive queue GPU handler

Even for the receive side, in this case, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_strong(struct doca_gpu_dev_rdma_r *rdma_r, struct doca_gpu_buf *recv_buf, size_t recv_length, uint64_t recv_offset, const uint32_t flags_bitmask)

  • rdma_r – RDMA receive queue GPU handler

  • recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • recv_length – Number of bytes to send

  • recv_offset – Offset in bytes to fetch data in the local buffer

  • flags_bitmask – Must be 0; reserved for future use

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_weak(struct doca_gpu_dev_rdma_r *rdma_r, struct doca_gpu_buf *recv_buf, size_t recv_length, uint64_t recv_offset, const uint32_t flags_bitmask, uint32_t position);

  • rdma_r – RDMA receive queue GPU handler

  • recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • recv_length – Number of bytes to send

  • recv_offset – Offset in bytes to fetch data in the local buffer

  • flags_bitmask - Must be 0; reserved for future use

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_recv_commit_*

After posting several RDMA receive operations, a commit function must be invoked to activate the receive in the queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_commit_strong(struct doca_gpu_dev_rdma_r *rdma_r)

  • rdma_r – RDMA receive queue GPU handler

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_commit_weak(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t num_ops)

  • rdma_r – RDMA receive queue GPU handler

  • num_ops – Number of RDMA receive requests enqueued since the last commit

doca_gpu_dev_rdma_recv_wait_all

This function waits for the completion of all previously posted RDMA receive operation. Only 1 CUDA thread can invoke this function at a time. It works in blocking or non-blocking mode.

Copy
Copied!
            

 enum doca_gpu_dev_rdma_recv_wait_flags { DOCA_GPU_RDMA_RECV_WAIT_FLAG_NB = 0, /**< Non-Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait * checks if the receive operation happened (data has been received) * and exit from the function. If nothing has been received, * the function doesn't block the execution. */ DOCA_GPU_RDMA_RECV_WAIT_FLAG_B = 1, /**< Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait * blocks the execution waiting for the receive operations to be executed. */ };

Function:

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_wait_all(struct doca_gpu_dev_rdma_r *rdma_r, const enum doca_gpu_dev_rdma_recv_wait_flags flags, uint32_t *num_ops, uint32_t *imm_val)

  • rdma_r – RDMA receive queue GPU handler

  • flags – receive flags

  • num_ops – Output parameter. Function reports number of completed operations.

  • imm_val – Output parameter. Application-provided buffer where the function can store received immediate values, if any (or 0xFFFFFFFF if no immediate value is received). If nullptr, the function ignores this parameter.

GPU Functions – DMA

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute DMA operations.

doca_gpu_dev_dma_memcpy

This function allows a CUDA kernel to trigger a DMA memory copy operation through the DMA GPU engine. There is no strong/weak mode here, the DMA is assuming the strong behavior by default.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_dma_memcpy(struct doca_gpu_dma *dma, struct doca_gpu_buf *src_buf, uint64_t src_offset, struct doca_gpu_buf *dst_buf, uint64_t dst_offset, size_t length);

  • dma – DMA queue GPU handler

  • src_buf – memcpy source buffer

  • src_offset – fetch data starting from this source buffer offset

  • dst_buf – memcpy destination buffer

  • dst_offset – copy data starting from this destination buffer offset

  • lenght – number of bytes to copy

doca_gpu_dev_dma_commit

After posting several DMA memory copies, a commit function must be invoked to execute the operations enqueued in the DMA queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_dma_commit(struct doca_gpu_dma *dma);

  • dma – DMA queue GPU handler

This section contains two samples that show how to enable simple GPUNetIO features. Be sure to correctly set the following environment variables:

Build the sample

Copy
Copied!
            

export PATH=${PATH}:/usr/local/cuda/bin export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}" export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/usr/lib/pkgconfig:/opt/mellanox/grpc/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu/pkgconfig:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu/pkgconfigexport LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/opt/mellanox/gdrcopy/src:/opt/mellanox/dpdk/lib/{x86_64,aarch64}-linux-gnu:/opt/mellanox/doca/lib/{x86_64,aarch64}-linux-gnu

Info

All the DOCA samples described in this section are governed under the BSD-3 software license agreement.

Note

Please ensure the arch of your GPU is included in the meson.build file before building the samples (e.g., sm_80 for Ampere, sm_89 for L40, sm_90 for H100, etc).

Ethernet Send Wait Time

The sample shows how to enable Accurate Send Scheduling (or wait-on-time) in the context of a GPUNetIO application. Accurate Send Scheduling is the ability of an NVIDIA NIC to send packets in the future according to application-provided timestamps.

Note

This feature is supported on ConnectX-6 Dx and later .

Info

This NVIDIA blog post offers an example for how this feature has been used in 5G networks.

This DOCA GPUNetIO sample provides a simple application to send packets with Accurate Send Scheduling from the GPU .

Synchronizing Clocks

Before starting the sample, it is important to properly synchronize the CPU clock with the NIC clock. This way, timestamps provided by the system clock are synchronized with the time in the NIC.

For this purpose, at least the phc2sys service must be used. To install it on an Ubuntu system:

phc2sys

Copy
Copied!
            

sudo apt install linuxptp

To start the phc2sys service properly, a config file must be created in /lib/systemd/system/phc2sys.service. Assuming the network interface is ens6f0 :

phc2sys

Copy
Copied!
            

[Unit] Description=Synchronize system clock or PTP hardware clock (PHC) Documentation=man:phc2sys   [Service] Restart=always RestartSec=5s Type=simple ExecStart=/bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp$(ethtool -T ens6f0 | grep PTP | awk '{print $4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"   [Install] WantedBy=multi-user.target

Now phc2sys service can be started:

phc2sys

Copy
Copied!
            

sudo systemctl stop systemd-timesyncd sudo systemctl disable systemd-timesyncd sudo systemctl daemon-reload sudo systemctl start phc2sys.service

To check the status of phc2sys:

phc2sys

Copy
Copied!
            

$ sudo systemctl status phc2sys.service   ● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC) Loaded: loaded (/lib/systemd/system/phc2sys.service; disabled; vendor preset: enabled) Active: active (running) since Mon 2023-04-03 10:59:13 UTC; 2 days ago Docs: man:phc2sys Main PID: 337824 (sh) Tasks: 2 (limit: 303788) Memory: 560.0K CPU: 52min 8.199s CGroup: /system.slice/phc2sys.service ├─337824 /bin/sh -c "taskset -c 15 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T enp23s0f1np1 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R > └─337829 /usr/sbin/phc2sys -s /dev/ptp3 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256   Apr 05 16:35:52 doca-vr-045 phc2sys[337829]: [457395.040] CLOCK_REALTIME rms 8 max 18 freq +110532 +/- 27 delay 770 +/- 3 Apr 05 16:35:53 doca-vr-045 phc2sys[337829]: [457396.071] CLOCK_REALTIME rms 8 max 20 freq +110513 +/- 30 delay 769 +/- 3 Apr 05 16:35:54 doca-vr-045 phc2sys[337829]: [457397.102] CLOCK_REALTIME rms 8 max 18 freq +110527 +/- 30 delay 769 +/- 3 Apr 05 16:35:55 doca-vr-045 phc2sys[337829]: [457398.130] CLOCK_REALTIME rms 8 max 18 freq +110517 +/- 31 delay 769 +/- 3 Apr 05 16:35:56 doca-vr-045 phc2sys[337829]: [457399.159] CLOCK_REALTIME rms 8 max 19 freq +110523 +/- 32 delay 770 +/- 3 Apr 05 16:35:57 doca-vr-045 phc2sys[337829]: [457400.191] CLOCK_REALTIME rms 8 max 20 freq +110528 +/- 33 delay 770 +/- 3 Apr 05 16:35:58 doca-vr-045 phc2sys[337829]: [457401.221] CLOCK_REALTIME rms 8 max 19 freq +110512 +/- 38 delay 770 +/- 3 Apr 05 16:35:59 doca-vr-045 phc2sys[337829]: [457402.253] CLOCK_REALTIME rms 9 max 20 freq +110538 +/- 47 delay 770 +/- 4 Apr 05 16:36:00 doca-vr-045 phc2sys[337829]: [457403.281] CLOCK_REALTIME rms 8 max 21 freq +110517 +/- 38 delay 769 +/- 3 Apr 05 16:36:01 doca-vr-045 phc2sys[337829]: [457404.311] CLOCK_REALTIME rms 8 max 17 freq +110526 +/- 26 delay 769 +/- 3 ...

At this point, the system and NIC clocks are synchronized so timestamps provided by the CPU are correctly interpreted by the NIC.

Warning

The timestamps you get may not reflect the real time and day. To get that, you must properly set the ptp4l service with an external grand master on the system. Doing that is out of the scope of this sample.


Running the Sample

To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:

phc2sys

Copy
Copied!
            

# Ensure DOCA is in the pkgconfig environment variable cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_send_wait_time meson build ninja -C build

The sample sends 8 bursts of 32 raw Ethernet packets or 1kB to a dummy Ethernet address, 10:11:12:13:14:15, in a timed way. Program the NIC to send every t nanoseconds (command line option -t).

The following example programs a system with GPU PCIe address ca:00.0 and NIC PCIe address 17:00.0 to send 32 packets every 5 milliseconds:

Run

Copy
Copied!
            

# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable $ sudo ./build/doca_gpunetio_send_wait_time -n 17:00.0 -g ca:00.0 -t 5000000[09:22:54:165778][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:195][main] Starting the sample [09:22:54:438260][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:224][main] Sample configuration: GPU ca:00.0 NIC 17:00.0 Timeout 5000000ns EAL: Detected CPU lcores: 128 ... EAL: Probe PCI driver: mlx5_pci (15b3:a2d6) device: 0000:17:00.0 (socket 0) [09:22:54:819996][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:607][gpunetio_send_wait_time] Wait on time supported mode: DPDK EAL: Probe PCI driver: gpu_cuda (10de:20b5) device: 0000:ca:00.0 (socket 1) [09:22:54:830212][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:252][create_tx_buf] Mapping send queue buffer (0x0x7f48e32a0000 size 262144B) with legacy nvidia-peermem mode [09:22:54:832462][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:657][gpunetio_send_wait_time] Launching CUDA kernel to send packets [09:22:54:842945][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:664][gpunetio_send_wait_time] Waiting 10 sec for 256 packets to be sent [09:23:04:883309][1316878][DOCA][INF][gpunetio_send_wait_time_sample.c:684][gpunetio_send_wait_time] Sample finished successfully [09:23:04:883339][1316878][DOCA][INF][gpunetio_send_wait_time_main.c:239][main] Sample finished successfully

To verify that packets are actually sent at the right time, use a packet sniffer on the other side (e.g., tcpdump):

phc2sys

Copy
Copied!
            

$ sudo tcpdump -i enp23s0f1np1 -A -s 64   17:12:23.480318 IP5 (invalid) Sent from DOCA GPUNetIO........................... .... 17:12:23.480368 IP5 (invalid) Sent from DOCA GPUNetIO........................... # end of first burst of 32 packets, bump to +5ms 17:12:23.485321 IP5 (invalid) Sent from DOCA GPUNetIO........................... ... 17:12:23.485369 IP5 (invalid) Sent from DOCA GPUNetIO........................... # end of second burst of 32 packets, bump to +5ms 17:12:23.490278 IP5 (invalid) Sent from DOCA GPUNetIO........................... ...

The output should show a jump of approximately 5 milliseconds every 32 packets.

Note

tcpdump may increase latency in sniffing packets and reporting the receive timestamp, so the difference between bursts of 32 packets reported may be less than expected, especially with small interval times like 500 microseconds (-t 500000).

Ethernet Simple Receive

This simple application shows the fundamental steps to build a DOCA GPUNetIO receiver application with one queue for UDP packets and one CUDA kernel receiving those packets from the GPU, printing packet info to the console.

Warning

Invoking a printf from a CUDA kernel is not good practice for release software and should be used only to print debug information as it slows down the overall execution of the CUDA kernel.

To build and run the application:

Build the sample

Copy
Copied!
            

# Ensure DOCA is in the pkgconfig environment variable cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_simple_receive meson build ninja -C build

To test the application, this guide assumes the usual setup with two machines: one with the DOCA receiver application and the second one acting as packet generator. As UDP packet generator, this example considers the nping application that can be easily installed easily on any Linux machine.

The command to send 10 UDP packets via nping on the packet generator machine is:

nping generator

Copy
Copied!
            

$ nping --udp -c 10 -p 2090 192.168.1.1 --data-length 1024 --delay 500ms   Starting Nping 0.7.80 ( https://nmap.org/nping ) at 2023-11-20 11:05 UTC SENT (0.0018s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (0.5018s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (1.0025s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (1.5025s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (2.0032s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (2.5033s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (3.0040s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (3.5040s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (4.0047s) UDP packet with 1024 bytes to 192.168.1.1:2090 SENT (4.5048s) UDP packet with 1024 bytes to 192.168.1.1:2090   Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A UDP packets sent: 10 | Rcvd: 0 | Lost: 10 (100.00%) Nping done: 1 IP address pinged in 5.50 seconds

Assuming the DOCA Simple Receive sample is waiting on the other machine at IP address 192.168.1.1.

The DOCA Simple Receive sample is launched on a system with NIC at 9f:00.0 PCIe address and GPU at 8a:00.0 PCIe address:

DOCA Simple Receive

Copy
Copied!
            

# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable $ sudo ./build/doca_gpunetio_simple_receive -n 9f:00.0 -g 8a:00.0 [02:41:26:920936][177899][DOCA][INF][gpunetio_simple_receive_main.c:161][main] Starting the sample [02:41:27:271921][177899][DOCA][INF][gpunetio_simple_receive_main.c:191][main] Sample configuration: GPU 8a:00.0 NIC 9f:00.0   [02:41:27:277635][177899][DOCA][WRN][engine_model.c:90][adapt_queue_depth] adapting queue depth to 128. EAL: Detected CPU lcores: 48 EAL: Detected NUMA nodes: 2 EAL: Detected shared linkage of DPDK EAL: Multi-process socket /var/run/dpdk/177899/mp_socket EAL: Selected IOVA mode 'VA' EAL: VFIO support initialized TELEMETRY: No legacy callbacks, legacy socket not created EAL: Probe PCI driver: mlx5_pci (15b3:a2dc) device: 0000:9f:00.0 (socket 1) [02:41:28:101711][177899][DOCA][INF][gpunetio_simple_receive_sample.c:468][create_rxq] Creating Sample Eth Rxq [02:41:28:102054][177899][DOCA][INF][gpunetio_simple_receive_sample.c:533][create_rxq] Mapping receive queue buffer (0x0x7f7124000000 size 33554432B dmabuf fd 262) with dmabuf mode [02:41:28:102107][177899][DOCA][WRN][linux_devx_adapter.cpp:389][umem_reg] devx adapter 0x5648c8af3110: Registration using dmabuf is not supported, falling back to legacy registration [02:41:28:107539][177899][DOCA][WRN][linux_devx_adapter.cpp:389][umem_reg] devx adapter 0x5648c8af3110: Registration using dmabuf is not supported, falling back to legacy registration [02:41:28:108795][177899][DOCA][WRN][linux_devx_adapter.cpp:389][umem_reg] devx adapter 0x5648c8af3110: Registration using dmabuf is not supported, falling back to legacy registration [02:41:28:138254][177899][DOCA][INF][gpunetio_simple_receive_sample.c:682][gpunetio_simple_receive] Launching CUDA kernel to receive packets [02:41:28:142716][177899][DOCA][INF][gpunetio_simple_receive_sample.c:686][gpunetio_simple_receive] Waiting for termination Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9 Thread 0 received UDP packet with Eth src 10:70:fd:fa:77:f5 - Eth dst 10:70:fd:fa:77:e9   # Type Ctrl+C to kill the sample   [11:01:44:265141][2328673][DOCA][INF][gpunetio_simple_receive_sample.c:45][signal_handler] Signal 2 received, preparing to exit! [11:01:44:265189][2328673][DOCA][INF][gpunetio_simple_receive_sample.c:620][gpunetio_simple_receive] Exiting from sample [11:01:44:265533][2328673][DOCA][INF][gpunetio_simple_receive_sample.c:362][destroy_rxq] Destroying Rxq [11:01:44:307829][2328673][DOCA][INF][gpunetio_simple_receive_sample.c:631][gpunetio_simple_receive] Sample finished successfully [11:01:44:307861][2328673][DOCA][INF][gpunetio_simple_receive_main.c:204][main] Sample finished successfully


Ethernet Simple Send

This sample implements a simple GPU Ethernet packet generator that constantly sends a flow of raw Ethernet packets. Assuming a system with NIC at 9f:00.0 PCIe address and GPU at 8a:00.0 PCIe address, the command line to send 1kB packets ( -s option) with a CUDA Kernel having 512 CUDA Threads ( -t option) is :

DOCA Simple Receive

Copy
Copied!
            

# Ensure DOCA is in the LD_LIBRARY_PATH environment variable $ sudo ./build//tmp/try/samples/doca_gpunetio_simple_send -n 9f:00.0 -g 8a:00.0 -t 512 -s 1024 [02:45:17:140632][177981][DOCA][INF][gpunetio_simple_send_main.c:243][main] Starting the sample [02:45:17:484016][177981][DOCA][INF][gpunetio_simple_send_main.c:273][main] Sample configuration: GPU 8a:00.0 NIC 9f:00.0 Packet size 1024 CUDA threads 512   [02:45:17:489649][177981][DOCA][WRN][engine_model.c:90][adapt_queue_depth] adapting queue depth to 128. EAL: Detected CPU lcores: 48 EAL: Detected NUMA nodes: 2 EAL: Detected shared linkage of DPDK EAL: Multi-process socket /var/run/dpdk/177981/mp_socket EAL: Selected IOVA mode 'VA' EAL: VFIO support initialized TELEMETRY: No legacy callbacks, legacy socket not created EAL: Probe PCI driver: mlx5_pci (15b3:a2dc) device: 0000:9f:00.0 (socket 1) [02:45:18:132875][177981][DOCA][INF][gpunetio_simple_send_sample.c:273][create_txq] Creating Sample Eth Txq [02:45:18:134410][177981][DOCA][WRN][linux_devx_adapter.cpp:389][umem_reg] devx adapter 0x564bab1c2070: Registration using dmabuf is not supported, falling back to legacy registration [02:45:18:138997][177981][DOCA][INF][gpunetio_simple_send_sample.c:395][create_txq] Mapping receive queue buffer (0x0x7fd98e710000 size 524288B dmabuf fd 276) with dmabuf mode [02:45:18:139029][177981][DOCA][WRN][linux_devx_adapter.cpp:389][umem_reg] devx adapter 0x564bab1c2070: Registration using dmabuf is not supported, falling back to legacy registration [02:45:18:140278][177981][DOCA][INF][gpunetio_simple_send_sample.c:530][gpunetio_simple_send] Launching CUDA kernel to receive packets [02:45:18:144629][177981][DOCA][INF][gpunetio_simple_send_sample.c:534][gpunetio_simple_send] Waiting for termination   # Type Ctrl+C to kill the sample   [02:51:06:835730][177981][DOCA][INF][gpunetio_simple_send_sample.c:52][signal_handler] Signal 2 received, preparing to exit! [02:51:06:835760][177981][DOCA][INF][gpunetio_simple_send_sample.c:540][gpunetio_simple_send] Exiting from sample [02:51:06:835826][177981][DOCA][INF][gpunetio_simple_send_sample.c:189][destroy_txq] Destroying Txq [02:51:07:328585][177981][DOCA][INF][gpunetio_simple_send_sample.c:551][gpunetio_simple_send] Sample finished successfully [02:51:07:328620][177981][DOCA][INF][gpunetio_simple_send_main.c:291][main] Sample finished successfully

To verify packets are actually sent, on the same machine is possible to check the traffic throughput with minx_perf command (assuming the NIC 9f:00.0 has interface name ens6f0np0 ):

nping generator

Copy
Copied!
            

$ mlnx_perf -i ens6f0np0 tx_vport_unicast_packets: 21,033,677 tx_vport_unicast_bytes: 21,538,485,248 Bps = 172,307.88 Mbps tx_packets_phy: 21,033,286 tx_bytes_phy: 21,622,256,208 Bps = 172,978.4 Mbps tx_prio0_bytes: 21,617,084,940 Bps = 172,936.67 Mbps tx_prio0_packets: 21,028,292 UP 0: 172,936.67 Mbps = 100.00% UP 0: 21,028,292 Tran/sec = 100.00%

Different systems may result in different throughput (type of PCIe connection between GPU and NIC, type of NIC, type of GPU, etc..). This sample can be used to evaluate your system performance when sending packets varying the number of CUDA thread and packet size.

RDMA Client Server

This sample exhibits how to use the GPUNetIO RDMA API to receive and send/write with immediate using a single RDMA queue.

The server has a GPU buffer array A composed by GPU_BUF_NUM doca_gpu_buf elements, each 1kB in size. The client has two GPU buffer arrays, B and C, each composed by GPU_BUF_NUM doca_gpu_buf elements, each 512B in size.

The goal is for the client to fill a single server buffer of 1kB with two GPU buffers of 512B as illustrated in the following figure:

image-2024-4-17_12-29-48-version-1-modificationdate-1750108049917-api-v2.png

To show how to use RDMA write and send, even buffers are sent from the client with write immediate, while odd buffers are sent with send immediate. In both cases, the server must pre-post the RDMA receive operations.

For each buffer, the CUDA kernel code repeats the handshake:

image-2024-6-26_16-43-24-version-1-modificationdate-1750108061930-api-v2.png

Once all buffers are filled, the server double checks that all values are valid. The server output should be as follows:

DOCA RDMA Server side

Copy
Copied!
            

# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable $ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write $ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0   [14:11:43:000930][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:250][main] Starting the sample ... [14:11:43:686610][1173110][DOCA][INF][rdma_common.c:91][oob_connection_server_setup] Listening for incoming connections [14:11:45:681523][1173110][DOCA][INF][rdma_common.c:105][oob_connection_server_setup] Client connected at IP: 192.168.2.28 and port: 46274 ... [14:11:45:771807][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:644][rdma_write_server] Before launching CUDA kernel, buffer array A is: [14:11:45:771822][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 0 -> offset 0: 1111 | offset 128: 1111 [14:11:45:771837][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 1 -> offset 0: 1111 | offset 128: 1111 [14:11:45:771851][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 2 -> offset 0: 1111 | offset 128: 1111 [14:11:45:771864][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:646][rdma_write_server] Buffer 3 -> offset 0: 1111 | offset 128: 1111 RDMA Recv 2 ops completed with immediate values 0 and 1! RDMA Recv 2 ops completed with immediate values 1 and 2! RDMA Recv 2 ops completed with immediate values 2 and 3! RDMA Recv 2 ops completed with immediate values 3 and 4! [14:11:45:781561][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:671][rdma_write_server] After launching CUDA kernel, buffer array A is: [14:11:45:781574][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 0 -> offset 0: 2222 | offset 128: 3333 [14:11:45:781583][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 1 -> offset 0: 2222 | offset 128: 3333 [14:11:45:781593][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 2 -> offset 0: 2222 | offset 128: 3333 [14:11:45:781602][1173110][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:673][rdma_write_server] Buffer 3 -> offset 0: 2222 | offset 128: 3333 [14:11:45:781640][1173110][DOCA][INF][gpunetio_rdma_client_server_write_main.c:294][main] Sample finished successfully

On the other side, assuming the server is at IP address 192.168.2.28, the client output should be as follows:

DOCA RDMA Client side

Copy
Copied!
            

# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable   $ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write $ ./build/doca_gpunetio_rdma_client_server_write -gpu 17:00.0 -d mlx5_0 -c 192.168.2.28   [16:08:22:335744][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:197][main] Starting the sample ... [16:08:25:753316][160913][DOCA][INF][rdma_common.c:147][oob_connection_client_setup] Connected with server successfully ...... Client waiting on flag 7f6596735000 for server to post RDMA Recvs Thread 0 post rdma write imm 0 Thread 1 post rdma write imm 0 Client waiting on flag 7f6596735001 for server to post RDMA Recvs Thread 0 post rdma send imm 1 Thread 1 post rdma send imm 1 Client waiting on flag 7f6596735002 for server to post RDMA Recvs Thread 0 post rdma write imm 2 Thread 1 post rdma write imm 2 Client waiting on flag 7f6596735003 for server to post RDMA Recvs Thread 0 post rdma send imm 3 Thread 1 post rdma send imm 3 [16:08:25:853454][160913][DOCA][INF][gpunetio_rdma_client_server_write_main.c:241][main] Sample finished successfully

Note

With RDMA, the network device must be specified by name (e.g., mlx5_0 ) instead of the PCIe address (as is the case for Ethernet).

It is also possible to enable the RDMA CM mode, establishing two connections with the same RDMA GPU handler. An example on the client side:

DOCA RDMA Client side with CM

Copy
Copied!
            

# Ensure DOCA and DPDK are in the LD_LIBRARY_PATH environment variable   $ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_rdma_client_server_write $ ./build/samples/doca_gpunetio_rdma_client_server_write -d mlx5_0 -gpu 17:00.0 -gid 3 -c 10.137.189.28 -cm --server-addr-type ipv4 --server-addr 192.168.2.28   [11:30:34:489781][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:461][main] Starting the sample ... [11:30:35:038828][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:950][rdma_write_client] Client is waiting for a connection establishment [11:30:35:082039][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:963][rdma_write_client] Client - Connection 1 is established ... [11:30:35:095282][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1006][rdma_write_client] Establishing connection 2.. [11:30:35:097521][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1016][rdma_write_client] Client is waiting for a connection establishment [11:30:35:102718][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1029][rdma_write_client] Client - Connection 2 is established [11:30:35:102783][3853018][DOCA][INF][gpunetio_rdma_client_server_write_sample.c:1046][rdma_write_client] Client, terminate kernels Client waiting on flag 7f16067b5000 for server to post RDMA Recvs Thread 0 post rdma write imm 0 Thread 1 post rdma write imm 1 Client waiting on flag 7f16067b5001 for server to post RDMA Recvs Thread 0 post rdma send imm 1 Thread 1 post rdma send imm 2 Client waiting on flag 7f16067b5002 for server to post RDMA Recvs Thread 0 post rdma write imm 2 Thread 1 post rdma write imm 3 Client waiting on flag 7f16067b5003 for server to post RDMA Recvs Thread 0 post rdma send imm 3 Thread 1 post rdma send imm 4 Client posted and completed 4 RDMA commits on connection 0. Waiting on the exit flag. Client waiting on flag 7f16067b5000 for server to post RDMA Recvs Thread 0 post rdma write imm 0 Thread 1 post rdma write imm 1 Client waiting on flag 7f16067b5001 for server to post RDMA Recvs Thread 0 post rdma send imm 1 Thread 1 post rdma send imm 2 Client waiting on flag 7f16067b5002 for server to post RDMA Recvs Thread 0 post rdma write imm 2 Thread 1 post rdma write imm 3 Client waiting on flag 7f16067b5003 for server to post RDMA Recvs Thread 0 post rdma send imm 3 Thread 1 post rdma send imm 4 Client posted and completed 4 RDMA commits on connection 1. Waiting on the exit flag. [11:30:35:122448][3853018][DOCA][INF][gpunetio_rdma_client_server_write_main.c:512][main] Sample finished successfully

In case of RDMA CM, the command option -cm must be specified on the server side.

Warning

Printing from a CUDA kernel is not recommended for performance. It may make sense for debugging purposes and for simple samples like this one.


Verbs Samples

The doca_gpunetio_verbs_* examples demonstrate how to use the GPUNetIO Verbs API in various scenarios. These samples require a client-server setup, with the client needing the -c <server IP> parameter.

The following parameters are supported by all Verbs samples:

  • -n: Network card handler type (0: AUTO, 1: CPU Proxy, 2: GPU DB). Default is 0 (AUTO).

  • -e: Execution mode for shared QP (0: per-thread, 1: per-warp). Default is 0 (per-thread).

  • -d: Network card device name.

  • -g: GPU device PCIe address.

  • -gid: GID index for DOCA RDMA (optional).

  • -i: Number of iterations (optional).

  • -t: Number of CUDA threads (optional).

For the examples in this guide, it is assumed that the GPU PCIe address is 8A:00.0 and the network card PCIe address is specified accordingly.

Note

If the samples are running on a RoCE connection (e.g. the ConnectX/BlueField is set in Ethernet mode instead of Infiniband mode) you may get this (or similar) error: FW failed to modify object, status=BAD_PARAM_ERR (0x3), syndrome=0x1f3b5d .

To fix it, please remove the doca_verbs_ah_attr_set_dlid function from the connect_verbs_qp function in the samples/doca_gpunetio/verbs_common.c file.

Bandwidth Samples

The samples ending with _bw measure the bandwidth of specific GPUNetIO Verbs API functions. In these samples, the client prepares and sends data from a CUDA kernel, while the server waits on the CPU to receive the data, validate it upon receiving a Ctrl+C signal, and reports the outcome.

Key Characteristics

  • The client outputs the MB/s achieved for preparing and sending messages of various sizes.

  • The server outputs a message indicating the execution outcome.

  • All bandwidth samples support the previously listed command-line parameters.

Simplifying QP/CQ/UAR Creation

To simplify the integration of DOCA Verbs with GPUNetIO, high-level functions like doca_gpu_verbs_create_qp_hl() and doca_gpu_verbs_create_qp_group_hl() are provided in samples/doca_gpunetio/verbs_high_level.cpp. These functions encapsulate the necessary steps for creating QP/CQ/UAR, making it easier for developers to combine DOCA Verbs and GPUNetIO in their applications.

doca_gpunetio_verbs_write_bw

The doca_gpunetio_verbs_write_bw test measures the bandwidth of RDMA Write operations using the GPUNetIO Verbs API. It launches a CUDA kernel with 1 block and 512 threads by default, where all threads post RDMA Write WQEs in different positions and the last thread submits them. The test then polls the CQE corresponding to the last WQE to ensure all previous WQEs have been executed correctly.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_write_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_bw

The doca_gpunetio_verbs_put_bw test measures the bandwidth of RDMA Write operations (referred to as "Put") using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads. The test can also measure individual function latencies by setting the KERNEL_DEBUG_TIMES macro to 1.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63 

doca_gpunetio_verbs_put_signal_bw

The doca_gpunetio_verbs_put_signal_bw test measures the bandwidth of Put + Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_signal_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63 

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be:

Copy
Copied!
            

do { final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]); doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>(); } while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));


doca_gpunetio_verbs_put_counter_bw

The doca_gpunetio_verbs_put_counter_bw test measures the bandwidth of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_counter_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63 

Note

The sample works correctly, but there is a minor mistake in the CUDA kernel loop. The corrected loop should be :

Copy
Copied!
            

do { final_val = doca_gpu_dev_verbs_atomic_read<uint64_t, DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU>(&prev_flag_buf[tidx]); doca_gpu_dev_verbs_fence_acquire<DOCA_GPUNETIO_VERBS_SYNC_SCOPE_SYS>(); } while((final_val != (iter_thread - 1)) && (final_val != ((iter_thread * 2) - 1)));


doca_gpunetio_verbs_twosided_bw

The doca_gpunetio_verbs_twosided_bw test measures the bandwidth of client-server data exchange via Send/Recv operations using the GPUNetIO Verbs API with the shared QP feature. It launches a CUDA kernel with 2 blocks, each containing 256 threads.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_twosided_bw -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

Latency Samples

The samples ending with _lat measure the latency of specific GPUNetIO Verbs API functions by performing a ping-pong exchange between client and server. These tests launch a CUDA kernel with a single CUDA thread and do not support the -e and -t command-line options.

Both client and server output the round-trip time (RTT) latency (half and full) in microseconds for preparing and exchanging messages of different sizes. The server also outputs a message indicating the execution outcome.

doca_gpunetio_verbs_write_lat

The doca_gpunetio_verbs_write_lat test measures the latency of RDMA Write operations using the GPUNetIO Verbs API without the shared QP feature. It is similar to perftest.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_write_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_signal_lat

The doca_gpunetio_verbs_put_signal_lat test measures the latency of Put and Signal operations (RDMA Write + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It uses high-level API functions from doca_gpunetio_dev_verbs_onesided.cuh and launches a CUDA kernel with a single CUDA thread.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0

  • Client (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_signal_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

doca_gpunetio_verbs_put_counter_lat

The doca_gpunetio_verbs_put_counter_lat test measures the latency of Put + Counter operations (RDMA Write + Wait WQE + RDMA Atomic with shared QP) using the GPUNetIO Verbs API. It utilizes high-level API functions from doca_gpunetio_dev_verbs_counter.cuh and the Core Direct counter feature, even with a single CUDA thread.

Example command lines:

  • Server:

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0

  • Client: (additional command-line options can be added):

    Copy
    Copied!
                

    doca_gpunetio_verbs_put_counter_lat -g 8A:00.0 -d mlx5_0 -c 192.168.1.63

GPU DMA Copy

This sample exhibits how to use the DOCA DMA and DOCA GPUNetIO libraries to DMA copy a memory buffer from the CPU to the GPU (with DOCA DMA CPU functions) and from the GPU to the CPU (with DOCA GPUNetIO DMA device functions) from a CUDA kernel. This sample requires a DPU as it uses the DMA engine on it.

DOCA RDMA Client side

Copy
Copied!
            

$ cd /opt/mellanox/doca/samples/doca_gpunetio/gpunetio_dma_memcpy   # Build the sample and then execute   $ ./build/doca_gpunetio_dma_memcpy -g 17:00.0 -n ca:00.0 [15:44:04:189462][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:164][main] Starting the sample EAL: Detected CPU lcores: 64 EAL: Detected NUMA nodes: 2 EAL: Detected shared linkage of DPDK EAL: Selected IOVA mode 'VA' EAL: No free 2048 kB hugepages reported on node 0 EAL: No free 2048 kB hugepages reported on node 1 EAL: VFIO support initialized TELEMETRY: No legacy callbacks, legacy socket not created EAL: Probe PCI driver: gpu_cuda (10de:2331) device: 0000:17:00.0 (socket 0) [15:44:04:857251][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:211][init_sample_mem_objs] The CPU source buffer value to be copied to GPU memory: This is a sample piece of text from CPU [15:44:04:857359][862197][DOCA][WRN][doca_mmap.cpp:1743][doca_mmap_set_memrange] Mmap 0x55aec6206140: Memory range isn't cache-line aligned - addr=0x55aec52ceb10. For best performance align address to 64B [15:44:04:858839][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:158][init_sample_mem_objs] The GPU source buffer value to be copied to CPU memory: This is a sample piece of text from GPU [15:44:04:921702][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:570][submit_dma_memcpy_task] Success, DMA memcpy job done successfully CUDA KERNEL INFO: The GPU destination buffer value after the memcpy: This is a sample piece of text from CPU CPU received message from GPU: This is a sample piece of text from GPU [15:44:04:930087][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:364][gpu_dma_cleanup] Cleanup DMA ctx with GPU data path [15:44:04:932658][862197][DOCA][INF][gpunetio_dma_memcpy_sample.c:404][gpu_dma_cleanup] Cleanup DMA ctx with CPU data path [15:44:04:954156][862197][DOCA][INF][gpunetio_dma_memcpy_main.c:197][main] Sample finished successfully


© Copyright 2025, NVIDIA. Last updated on Aug 25, 2025.