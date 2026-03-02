DOCA GPUNetIO provides GPU GDAKI (GPU Direct Access Kernel Interface) functions to control objects for various transports and protocols that were created using other DOCA libraries. This section explains the correlation between DOCA GPUNetIO and these other libraries.

To enable GPU-accelerated communications over the Ethernet transport, an application must use a combination of three DOCA libraries:

DOCA GPUNetIO: For GPU-specific handles and data path functions.

DOCA Ethernet: To create and manage the underlying TX/RX queues.

DOCA Flow: To steer packets to the correct GPU-managed queues.

Before any data path operations can occur on the GPU, the CPU must first configure all the necessary resources.

Create a DOCA Core device handler for the network card. Create a DOCA GPUNetIO device handler for the GPU. Use the DOCA Ethernet library to: Create the required Send Queues (TXQ) and/or Receive Queues (RXQ).

Set the data path for these queue handlers to the GPU.

Export a GPU-specific handle that represents these queues. Use the DOCA Flow library to create and install flow steering rules that direct the desired types of packets to the newly created DOCA Ethernet receive queues.

After the configuration phase is complete, the application can launch a CUDA Kernel, passing the GPU handles for the Ethernet queues as input arguments. This allows DOCA GPUNetIO CUDA device functions to operate directly on the queues from within the kernel.

All GPUNetIO Ethernet CUDA device functions are provided as inlined functions in the following header files:

doca_gpunetio_dev_eth_rxq.cuh

doca_gpunetio_dev_eth_txq.cuh

These functions are provided in two distinct APIs:

Low-level API: Provides fine-grained control over fundamental mlx5 elements, such as posting Work Queue Entries (WQEs), ringing the network card's doorbell, and polling for Completion Queue Entries (CQEs).

High-level API: Provides more complex, pre-packaged functions that implement advanced features: Shared Send QP: Allows a single Send Queue to be safely accessed concurrently by different CUDA threads, warps, or blocks. Cooperative Receive QP: Allows a single thread, all threads in a warp, or all threads in a block to cooperate for parallel packet reception from a single Receive Queue. Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the receive side.



Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible.

Info For examples of how to use both the high-level and low-level GPUNetIO Ethernet APIs, refer to the "GPUNetIO Sample Guide".





Refer to the DOCA GPU Packet Processing Application Guide ( doca_gpu_packet_processing ) and samples ( doca_gpunetio_simple_send , doca_gpunetio_simple_receive , doca_gpunetio_send_wait_time ) for examples of Ethernet GPU communications.

Tip For a deeper understanding of the underlying Ethernet send and receive structures, objects, and functions, refer to the DOCA Ethernet library documentation.

An example diagram when multiple queues and/or semaphores are used to receive Ethernet traffic:

Receiving and dispatching packets to another CUDA kernel is not required. A simpler scenario can have a single CUDA kernel receiving and processing packets:

DOCA GPUNetIO provides GPU data path functions for objects created with the DOCA RDMA and DOCA RDMA Verbs libraries. This enables GPU communications over RDMA transport protocols (IB or RoCE).

This approach uses the high-level DOCA RDMA library, which abstracts most low-level mlx5 and IBVerbs details. The GPUNetIO CUDA data path functions follow a similarly high-level API.

Key characteristics:

Provides a high-level API for generic RDMA operations (Write, Send, Read, Recv).

Delivered as a closed-source CUDA static library ( libdoca_gpunetio_device.a ).

Does not include built-in shared queue management. Applications must manually manage simultaneous access to queues from different CUDA threads.

Best suited for simpler GDAKI applications performing basic RDMA operations, as it requires less deep knowledge of IBVerbs or mlx5 details.

Some RDMA GPU functions offer two operation modes:

Weak Mode: The application is responsible for calculating the next available position in the queue. Helper functions (e.g., doca_gpu_rdma_get_info ) provide the next available position and queue size mask (for index wrapping). The developer must specify the exact queue descriptor number, ensuring no descriptors are skipped. More complex, but offers better performance and allows developers to optimize for GPU memory coalescing.

Strong Mode: The GPU function automatically enqueues the RDMA operation in the next available position. Simpler to manage, as the developer does not need to track the position. May introduce extra latency due to atomic operations. It also does not guarantee that sequential operations use sequential memory locations. Note All strong mode functions operate at the CUDA block level. It is not possible to access the same RDMA queue from two different CUDA blocks simultaneously.



Create a device handler for the network card using DOCA Core. Create a GPU device handler for the GPU card using DOCA GPUNetIO. Use DOCA RDMA to: Create send and/or receive queue handlers.

Set the queue handlers' data path to the GPU.

Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the RDMA queues as input arguments. Use the functions defined in doca_gpunetio_dev_rdma.cuh (starting with doca_gpu_dev_rdma_* ) for RDMA communications in the kernel.

Refer to the doca_gpunetio_rdma_client_server_write sample for examples of GPUNetIO RDMA functions.

Tip For a deeper understanding of RDMA operations, refer to the DOCA RDMA documentation.

This approach uses the lower-level DOCA RDMA Verbs library. The GPUNetIO Verbs CUDA data path functions are provided as inlined functions in the doca_gpunetio_dev_verbs_*.cuh header files.

These functions are offered as two different APIs:

Low-level API: For direct manipulation of fundamental RDMA mlx5 elements, such as posting Work Queue Entries (WQEs), ringing doorbells, and polling Completion Queues (CQEs). This supports both one-sided (Read, Write, Atomic) and two-sided (Send, Recv) operations.

High-level API: More complex helper functions that implement common patterns: Shared QP: Allows a single QP to be safely accessed concurrently by different CUDA threads or warps. Combined Operations: Building blocks for concatenating multiple operations (e.g., put_signal , which combines an RDMA Write and an Atomic Fetch-and-Add). Memory Consistency (MCST): A feature for pre-Hopper GPUs to manage memory mappings on the RDMA Get or Receive side. ConnectX-8 reliable doorbell feature: no need to update the DBREC



Both APIs support CPU proxy mode, a fallback mechanism for systems where direct DoorBell ringing from the GPU is not possible. The samples/doca_gpunetio/verbs_high_level.cpp file provides helper functions (e.g., doca_gpu_verbs_create_qp_hl() ) that simplify the CPU-side setup for these Verbs QPs.

Warning The GPUNetIO Verbs APIs are currently experimental. Please report any issues encountered to help improve code quality and robustness.

Create a device handler for the network card using DOCA Core. Create a GPU device handler for the GPU card using DOCA GPUNetIO. Use DOCA RDMA Verbs to: Create send and/or receive queue handlers.

Set the queue handlers' data path to the GPU.

Export a GPU handler representing those queues.

After configuration, launch a CUDA Kernel, passing the GPU handlers for the Verbs queues as input arguments.

Refer to samples doca_gpunetio_verbs_* for examples of GPUNetIO Verbs functions.

Tip For a deeper understanding of Verbs operations, refer to the DOCA RDMA Verbs documentation.

To enable GPU-triggered memory copies using the DMA engine, an application requires DOCA GPUNetIO and DOCA DMA libraries.

Create a device handler for the network card using DOCA Core. Create a GPU device handler for the GPU card using DOCA GPUNetIO. Use DOCA DMA to: Create DMA queue handlers.

Set queue handlers' data path on the GPU.

Export a GPU handler representing those queues.

After completing the configuration phase, launch a CUDA Kernel, passing the GPU handlers for DMA queues as input arguments. This enables DOCA GPUNetIO CUDA device functions to operate within the CUDA Kernel.

For DMA memory copies, use functions defined in doca_gpunetio_dev_dma.cuh .

Refer to the sample doca_gpunetio_dma_memcpy for an example of triggering DMA memory copies from a CUDA Kernel.