NVIDIA Docs Hub Homepage NVIDIA Networking BlueField DPUs / SuperNICs & DOCA DOCA Documentation v3.3.0 GPUNetIO Architecture and Design

GPUNetIO Architecture and Design

Architecture

A DOCA GPUNetIO network application is split into two fundamental phases:

Configuration Phase (CPU): The CPU handles all initial setup, such as device configuration, memory allocation, and launching CUDA kernels.
Data Path Phase (GPU): The GPU and NIC interact directly to execute high-speed packet processing functions.

DOCA GPUNetIO provides the building blocks to create a full data path pipeline that runs entirely on the GPU, often in combination with other libraries like DOCA Ethernet, DOCA RDMA Verbs, DOCA RDMA or DOCA DMA.

Setup and Component Model

During the setup phase, the CPU-based application must:

Prepare all required objects (e.g., queues, contexts) on the CPU.
Export a GPU-specific handle for these objects.
Launch a CUDA kernel, passing the object's GPU handle to it so the kernel can work with the object during the data path phase.

This "CPU-setup, GPU-run" model is why DOCA GPUNetIO is composed of several distinct components:

libdoca_gpunetio.so (CPU Control Path) A shared library containing control-path functions. The CPU application uses these to prepare the GPU, allocate memory, and configure objects.
libdoca_gpunetio_device.a (GPU Data Path – Static Library) A static library containing data-path functions for GPUNetIO RDMA, GPUNetIO DMA, and GPUNetIO CommCh. These functions are invoked by the GPU from within a CUDA kernel.
doca_gpunetio_dev_*.cuh (GPU Data Path – Headers) A set of header files providing inline data-path functions for GPUNetIO Ethernet and GPUNetIO Verbs. These are compiled directly into the application's CUDA kernels.

The following diagram presents the typical flow:

image-2025-4-18_12-5-4-version-1-modificationdate-1769110331957-api-v2.png

Library Linking

The pkgconfig file for the CPU shared library is doca-gpunetio.pc.
There is no pkgconfig file for the GPU static library (libdoca_gpunetio_device.a). If your application requires these CUDA device functions, you must explicitly link this library.
- Default path: /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio_device.a

GDAKI Data Path for Different Protocols

DOCA GPUNetIO provides GPU GDAKI (GPU Direct Access Kernel Interface) functions to control objects for various transports and protocols that were created using other DOCA libraries. This section explains the correlation between DOCA GPUNetIO and these other libraries.

Ethernet GDAKI Communications

To enable GPU-accelerated communications over the Ethernet transport, an application must use a combination of three DOCA libraries:

DOCA GPUNetIO: For GPU-specific handles and data path functions.
DOCA Ethernet: To create and manage the underlying TX/RX queues.
DOCA Flow: To steer packets to the correct GPU-managed queues.

Control Path Phase: Initial CPU Configuration

Before any data path operations can occur on the GPU, the CPU must first configure all the necessary resources.

Create a DOCA Core device handler for the network card.
Create a DOCA GPUNetIO device handler for the GPU.
Use the DOCA Ethernet library to:
- Create the required Send Queues (TXQ) and/or Receive Queues (RXQ).
- Set the data path for these queue handlers to the GPU.
- Export a GPU-specific handle that represents these queues.
Use the DOCA Flow library to create and install flow steering rules that direct the desired types of packets to the newly created DOCA Ethernet receive queues.

Data Path Phase: GPU Kernel Execution

After the configuration phase is complete, the application can launch a CUDA Kernel, passing the GPU handles for the Ethernet queues as input arguments. This allows DOCA GPUNetIO CUDA device functions to operate directly on the queues from within the kernel.

All GPUNetIO Ethernet CUDA device functions are provided as inlined functions in the following header files:

doca_gpunetio_dev_eth_rxq.cuh
doca_gpunetio_dev_eth_txq.cuh

These functions are provided in two distinct APIs:

Low-level API: Provides fine-grained control over fundamental mlx5 elements, such as posting Work Queue Entries (WQEs), ringing the network card's doorbell, and polling for Completion Queue Entries (CQEs).
High-level API: Provides more complex, pre-packaged functions that implement advanced features:
- Shared Send QP: Allows a single Send Queue to be safely accessed concurrently by different CUDA threads, warps, or blocks.
- Cooperative Receive QP: Allows a single thread, all threads in a warp, or all threads in a block to cooperate for parallel packet reception from a single Receive Queue.
- Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the receive side.

Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible.

Info

For examples of how to use both the high-level and low-level GPUNetIO Ethernet APIs, refer to the "GPUNetIO Sample Guide".

Example Use Cases and Further Reading

Refer to the DOCA GPU Packet Processing Application Guide (doca_gpu_packet_processing) and samples (doca_gpunetio_simple_send, doca_gpunetio_simple_receive, doca_gpunetio_send_wait_time) for examples of Ethernet GPU communications.

Tip

For a deeper understanding of the underlying Ethernet send and receive structures, objects, and functions, refer to the DOCA Ethernet library documentation.

An example diagram when multiple queues and/or semaphores are used to receive Ethernet traffic:

image2023-4-3_18-18-20-version-1-modificationdate-1769110332393-api-v2.png

Receiving and dispatching packets to another CUDA kernel is not required. A simpler scenario can have a single CUDA kernel receiving and processing packets:

image2023-4-4_12-13-32-version-1-modificationdate-1769110332807-api-v2.png

RDMA Verbs GDAKI Communications (IBGDA)

DOCA GPUNetIO provides GPU data path functions for objects created with the DOCA RDMA and DOCA RDMA Verbs libraries. This enables GPU communications over RDMA transport protocols (IB or RoCE).

DOCA GPUNetIO and DOCA RDMA

This approach uses the high-level DOCA RDMA library, which abstracts most low-level mlx5 and IBVerbs details. The GPUNetIO CUDA data path functions follow a similarly high-level API.

Key characteristics:

Provides a high-level API for generic RDMA operations (Write, Send, Read, Recv).
Delivered as a closed-source CUDA static library (libdoca_gpunetio_device.a).
Does not include built-in shared queue management. Applications must manually manage simultaneous access to queues from different CUDA threads.
Best suited for simpler GDAKI applications performing basic RDMA operations, as it requires less deep knowledge of IBVerbs or mlx5 details.

Weak vs. Strong Operation Modes

Some RDMA GPU functions offer two operation modes:

Weak Mode: The application is responsible for calculating the next available position in the queue.
- Helper functions (e.g., doca_gpu_rdma_get_info) provide the next available position and queue size mask (for index wrapping).
- The developer must specify the exact queue descriptor number, ensuring no descriptors are skipped.
- More complex, but offers better performance and allows developers to optimize for GPU memory coalescing.
Strong Mode: The GPU function automatically enqueues the RDMA operation in the next available position.
- Simpler to manage, as the developer does not need to track the position.
- May introduce extra latency due to atomic operations. It also does not guarantee that sequential operations use sequential memory locations.
  
  Note
  
  All strong mode functions operate at the CUDA block level. It is not possible to access the same RDMA queue from two different CUDA blocks simultaneously.

Configuration and Usage

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA RDMA to:
- Create send and/or receive queue handlers.
- Set the queue handlers' data path to the GPU.
- Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the RDMA queues as input arguments. Use the functions defined in doca_gpunetio_dev_rdma.cuh (starting with doca_gpu_dev_rdma_*) for RDMA communications in the kernel.

Example Use Cases

Refer to the doca_gpunetio_rdma_client_server_write sample for examples of GPUNetIO RDMA functions.

Tip

For a deeper understanding of RDMA operations, refer to the DOCA RDMA documentation.

DOCA GPUNetIO and DOCA Verbs

This approach uses the lower-level DOCA RDMA Verbs library. The GPUNetIO Verbs CUDA data path functions are provided as inlined functions in the doca_gpunetio_dev_verbs_*.cuh header files.

These functions are offered as two different APIs:

Low-level API: For direct manipulation of fundamental RDMA mlx5 elements, such as posting Work Queue Entries (WQEs), ringing doorbells, and polling Completion Queues (CQEs). This supports both one-sided (Read, Write, Atomic) and two-sided (Send, Recv) operations.
High-level API: More complex helper functions that implement common patterns:
- Shared QP: Allows a single QP to be safely accessed concurrently by different CUDA threads or warps.
- Combined Operations: Building blocks for concatenating multiple operations (e.g., put_signal, which combines an RDMA Write and an Atomic Fetch-and-Add).
- Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the RDMA Get or Receive side.
- ConnectX-8 reliable doorbell feature: no need to update the DBREC

Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible. The samples/doca_gpunetio/verbs_high_level.cpp file provides helper functions (e.g., doca_gpu_verbs_create_qp_hl()) that simplify the CPU-side setup for these Verbs QPs.

Warning

The GPUNetIO Verbs APIs are currently experimental. Please report any issues encountered to help improve code quality and robustness.

Configuration and Usage

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA RDMA Verbs to:
- Create send and/or receive queue handlers.
- Set the queue handlers' data path to the GPU.
- Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the Verbs queues as input arguments.

Example Use Cases

Refer to samples doca_gpunetio_verbs_* for examples of GPUNetIO Verbs functions.

Tip

For a deeper understanding of Verbs operations, refer to the DOCA RDMA Verbs documentation.

DMA GDAKI Memory Copies

To enable GPU-triggered memory copies using the DMA engine, an application requires DOCA GPUNetIO and DOCA DMA libraries.

Initial CPU Configuration Phase

Create a device handler for the network card using DOCA Core.
Create a GPU device handler for the GPU card using DOCA GPUNetIO.
Use DOCA DMA to:
- Create DMA queue handlers.
- Set queue handlers' data path on the GPU.
- Export a GPU handler representing those queues.

Data Path Phase on GPU

After completing the configuration phase, launch a CUDA Kernel, passing the GPU handlers for DMA queues as input arguments. This enables DOCA GPUNetIO CUDA device functions to operate within the CUDA Kernel.

For DMA memory copies, use functions defined in doca_gpunetio_dev_dma.cuh.

Example Use Case

Refer to the sample doca_gpunetio_dma_memcpy for an example of triggering DMA memory copies from a CUDA Kernel.

Tip

For a deeper understanding of DMA operations, refer to the DOCA DMA documentation.

On This Page