DOCA Documentation v3.3.0

On This Page

GPUNetIO API Reference

This section details the specific structures and operations related to the main DOCA GPUNetIO API on CPU and GPU. GPUNetIO headers are:

  • doca_gpunetio.h – CPU functions to create GPU handlers, allocate GPU memory, etc..

  • doca_gpunetio_dev_eth_rxq.cuh – Open source GPU functions to manage a DOCA Ethernet receive queue

  • doca_gpunetio_dev_eth_txq.cuh – Open source GPU functions to manage a DOCA Ethernet send queue

  • doca_gpunetio_dev_verbs_*.cuh – Open source GPU functions to manage a DOCA RDMA Verbs object

  • doca_gpunetio_dev_buf.cuh – GPU functions to manage a DOCA buffer array

  • doca_gpunetio_dev_sem.cuh – GPU functions to manage a DOCA GPUNetIO semaphore

  • doca_gpunetio_dev_rdma.cuh – GPU functions to manage a DOCA RDMA queue

  • doca_gpunetio_dev_dma.cuh – GPU functions to manage a DOCA DMA queue

This section lists the main functions of DOCA GPUNetIO. All DOCA Core, Ethernet, Verbs, RDMA and DMA objects used in combination with GPUNetIO have a GPU export function to obtain a GPU handler for that object.

In this section there is the list of DOCA GPUNetIO functions that can be used on the CPU only.

doca_gpu_mem_type

This enum lists all the possible memory types that can be allocated with GPUNetIO.

Copy
Copied!
            

enum doca_gpu_mem_type { DOCA_GPU_MEM_TYPE_GPU = 0, DOCA_GPU_MEM_TYPE_GPU_CPU = 1, DOCA_GPU_MEM_TYPE_CPU_GPU = 2, };

Note

With regards to the syntax, the text string after the DOCA_GPU_MEM_TYPE_ prefix signifies <where-memory-resides>_<who-has-access>.

  • DOCA_GPU_MEM_TYPE_GPU – memory resides on the GPU and is accessible from the GPU only

  • DOCA_GPU_MEM_TYPE_GPU_CPU – memory resides on the GPU and is accessible also by the CPU

  • DOCA_GPU_MEM_TYPE_CPU_GPU – memory resides on the CPU and is accessible also by the GPU

Typical usage of the DOCA_GPU_MEM_TYPE_GPU_CPU memory type is to send a notification from the CPU to the GPU (e.g., a CUDA kernel periodically checking to see if the exit condition set by the CPU is met).

doca_gpu_create

This is the first function a GPUNetIO application must invoke to create an handler on a GPU device. The function initializes a pointer to a structure in memory with type struct doca_gpu *.

Copy
Copied!
            

doca_error_t doca_gpu_create(const char *gpu_bus_id, struct doca_gpu **gpu_dev);

  • gpu_bus_id<PCIe-bus>:<device>.<function> of the GPU device you want to use in your application

  • gpu_dev [out] – GPUNetIO handler to that GPU device

To get the PCIe address, users can use the commands lspci or nvidia-smi.

doca_gpu_mem_alloc

This CPU function allocates different flavors of memory.

Copy
Copied!
            

doca_error_t doca_gpu_mem_alloc(struct doca_gpu *gpu_dev, size_t size, size_t alignment, enum doca_gpu_mem_type mtype, void **memptr_gpu, void **memptr_cpu)

  • gpu_dev – GPUNetIO device handler

  • size – Size, in bytes, of the memory area to allocate

  • alignment – Memory address alignment to use. If 0, default one will be used

  • mtype – Type of memory to allocate

  • memptr_gpu [out] – GPU pointer to use to modify that memory from the GPU if memory is allocated on or is visible by the GPU

  • memptr_cpu[out] – CPU pointer to use to modify that memory from the CPU if memory is allocated on or is visible by the CPU. Can be NULL if memory is GPU-only

Warning

Make sure to use the right pointer on the right device! If an application tries to access the memory using the memptr_gpu address from the CPU, a segmentation fault will result.


doca_gpu_semaphore_create

Creates a new instance of a DOCA GPUNetIO semaphore. A semaphore is composed by a list of items each having, by default, a status flag, number of packets, and the index of a doca_gpu_buf in a doca_gpu_buf_arr.

For example, a GPUNetIO semaphore can be used in applications where a CUDA kernel is responsible for receiving packets in a doca_gpu_buf_arr array associated with an Ethernet receive queue object, doca_gpu_eth_rxq (see section "doca_gpu_dev_eth_rxq_receive_*"), and dispatching packet info to a second CUDA kernel which processes them.

Another way to use a GPUNetIO semaphore is to exchange data across different entities like two CUDA kernels or a CUDA kernel and a CPU thread. The reason for this scenario may be that the CUDA kernel needs to provide the outcome of the packet processing to the CPU which would in turn compile a statistics report. Therefore, it is possible to associate a custom application-defined structure with each item in the semaphore. This way, the semaphore can be used as a message passing object.

image2023-4-18_12-6-25-version-1-modificationdate-1769098425997-api-v2.png

Entities communicating through a semaphore must adopt a poll/update mechanism according to the following logic:

  • Update:

    1. Populate the next item of the semaphore (packets' info and/or custom application-defined info).

    2. Set status flag to READY.

  • Poll:

    1. Wait for the next item to have a status flag equal to READY.

    2. Read and process info.

    3. Set status flag to DONE.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_create(struct doca_gpu *gpu_dev, struct doca_gpu_semaphore **semaphore)

  • gpu_dev – GPUNetIO handler

  • semaphore [out] – GPUNetIO semaphore handler associated to the GPU device

doca_gpu_semaphore_set_memory_type

This function defines the type of memory for the semaphore allocation.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_memory_type(struct doca_gpu_semaphore *semaphore, enum doca_gpu_mem_type mtype)

  • semaphore – GPUNetIO semaphore handler

  • mtype – Type of memory to allocate the custom info structure

    • If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type.

    • If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_set_items_num

This function defines the number of items in a semaphore.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_items_num(struct doca_gpu_semaphore *semaphore, uint32_t num_items)

  • semaphore – GPUNetIO semaphore handler

  • num_items – Number of items to allocate

doca_gpu_semaphore_set_custom_info

This function associates an application-specific structure to semaphore items as explained under "doca_gpu_semaphore_create".

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_set_custom_info(struct doca_gpu_semaphore *semaphore, uint32_t nbytes, enum doca_gpu_mem_type mtype)

  • semaphore – GPUNetIO semaphore handler

  • nbytes – Size of the custom info structure to associate

  • mtype – Type of memory to allocate the custom info structure

    • If the application must share packet info only across CUDA kernels, then DOCA_GPU_MEM_GPU is the suggested memory type

    • If the application must share info from a CUDA kernel to a CPU (e.g., to report statistics or output of the pipeline computation), then DOCA_GPU_MEM_CPU_GPU is the suggested memory type

doca_gpu_semaphore_get_status

From the CPU, query the status of a semaphore item. If the semaphore is allocated with DOCA_GPU_MEM_GPU, this function results in a segmentation fault.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_get_status(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, enum doca_gpu_semaphore_status *status)

  • semaphore_cpu – GPUNetIO semaphore CPU handler

  • idx – Semaphore item index

  • status [out] – Output semaphore status

doca_gpu_semaphore_get_custom_info_addr

From the CPU, retrieve the address of the custom info structure associated to a semaphore item. If the semaphore or the custom info is allocated with DOCA_GPU_MEM_GPU this function results in a segmentation fault.

Copy
Copied!
            

doca_error_t doca_gpu_semaphore_get_custom_info_addr(struct doca_gpu_semaphore *semaphore_cpu, uint32_t idx, void **custom_info)

  • semaphore_cpu – GPUNetIO semaphore CPU handler

  • idx – Semaphore item index

  • custom_info [out] – Output semaphore custom info address

doca_gpu_verbs_export_qp

The doca_gpu_verbs_export_qp function creates a GPUNetIO handler from a DOCA RDMA Verbs QP object. It takes a DOCA RDMA Verbs QP as input and returns a DOCA GPUNetIO Verbs QP object ( struct doca_gpu_verbs_qp ) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler ( struct doca_gpu_dev_verbs_qp ) using the doca_gpu_verbs_get_qp_dev function.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_export_qp(struct doca_gpu *gpu_dev, struct doca_dev *dev, struct doca_verbs_qp *qp, enum doca_gpu_dev_verbs_nic_handler nic_handler, void *gpu_qp_umem_dev_ptr, struct doca_verbs_cq *cq_sq, struct doca_verbs_cq *cq_rq, struct doca_gpu_verbs_qp **qp_out);

  • gpu_dev: GPUNetIO device handler.

  • dev: DOCA device handler.

  • qp: DOCA RDMA Verbs QP handler.

  • nic_handler: Type of NIC handler.

  • gpu_qp_umem_dev_ptr: GPU memory pointer to UMEM.

  • cq_sq and cq_rq: CQs associated with the Send and Receive Queues in the QP.

    Note

    While either cq_sq or cq_rq can be NULL, they cannot both be NULL simultaneously.

  • qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_get_qp_dev

Extracts a GPU device handler (struct doca_gpu_dev_verbs_qp) from a DOCA GPUNetIO Verbs QP object.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_get_qp_dev(struct doca_gpu_verbs_qp *qp, struct doca_gpu_dev_verbs_qp **qp_gpu);

  • qp: DOCA GPUNetIO Verbs QP handler.

  • qp_gpu: DOCA GPUNetIO Verbs QP GPU device handler in GPU memory.

doca_gpu_verbs_unexport_qp

Unexports a previously exported DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp).

Copy
Copied!
            

doca_error_t doca_gpu_verbs_export_qp(ruct doca_gpu *gpu_dev, struct doca_gpu_verbs_qp *qp);

  • gpu_dev: GPUNetIO device handler.

  • qp: DOCA GPUNetIO Verbs QP handler in CPU memory.

doca_gpu_verbs_bridge_export_qp

The doca_gpu_verbs_bridge_export_qp function creates a DOCA GPUNetIO Verbs QP object from application-defined parameters, acting as a bridge between IBVerbs/mlx5 objects and DOCA GPUNetIO. This allows applications to create required objects like QP, CQ, and UAR using IBVerbs and mlx5 commands and then pass the relevant information to this DOCA GPUNetIO function.

The function returns a DOCA GPUNetIO Verbs QP object (struct doca_gpu_verbs_qp) allocated on the CPU. To use this object in a CUDA kernel, the application must extract a GPU device handler (struct doca_gpu_dev_verbs_qp) using the doca_gpu_verbs_get_qp_dev function.

It is the application's responsibility to ensure that all passed parameters are correctly created and set.

Copy
Copied!
            

doca_error_t doca_gpu_verbs_bridge_export_qp(struct doca_gpu *gpu_dev, uint32_t sq_qpn, void *sq_wqe_addr, uint16_t sq_wqe_num, uint32_t *sq_dbrec, uint64_t *sq_db, size_t uar_size, uint32_t sq_cqn, void *sq_cqe_addr, uint32_t sq_cqe_num, uint32_t *sq_cq_dbrec, uint32_t rq_qpn, void *rq_wqe_addr, uint16_t rq_wqe_num, uint32_t *rq_dbrec, uint32_t rcv_wqe_size, uint32_t rq_cqn, void *rq_cqe_addr, uint32_t rq_cqe_num, uint32_t *rq_cq_dbrec, enum doca_gpu_dev_verbs_nic_handler nic_handler, struct doca_gpu_verbs_qp **qp_out);

  • gpu_dev: GPUNetIO device handler.

  • sq_qpn: Send QP queue number.

  • sq_wqe_addr: Send QP WQE buffer memory address.

  • sq_wqe_num: Send QP number of WQEs.

  • sq_dbrec: Send QP Doorbell Record address.

  • sq_db: Send QP Doorbell address.

  • uar_size: Size of UAR.

  • sq_cqn: Send CQ number.

  • sq_cqe_addr: Send CQ CQE buffer memory address.

  • sq_cqe_num: Send CQ number of CQEs.

  • qp_out: DOCA GPUNetIO Verbs QP handler in CPU memory.

A DOCA Ethernet Txq context, exported for GPUNetIO usage, can be tracked via DOCA PE on the CPU side to check if there are errors when sending packets or to retrieve notification info after sending a packet with any of the doca_gpu_dev_eth_txq_*_enqueue_* functions on the GPU. An example can be found in the DOCA GPU packet processing application with ICMP traffic.

A DOCA Comch Producer or Consumer context, exported for GPUNetIO usage must still be attached to a DOCA PE on the CPU to ensure that remote producer and consumer connections and disconnections are handled properly.

This section provides a list of DOCA GPUNetIO functions that can be used for Ethernet GDAKI network operations within GPU CUDA kernels.

Header Files

  • doca_gpunetio_eth_def.h: Contains constants, enums, and structure definitions.

  • doca_gpunetio_dev_eth_common.cuh: Provides CUDA utility functions common to both Txq and Rxq header files.

  • doca_gpunetio_dev_eth_txq.cuh: Provides CUDA functions for Ethernet send operations. This includes both high-level (with shared QP feature) and low-level (post WQE, ring DB, poll CQ, etc.) functions.

  • doca_gpunetio_dev_eth_rxq.cuh: Provides CUDA functions for Ethernet receive operations.

Note

The new GPUNetIO Ethernet APIs are experimental. Please report any issues encountered during use, as your feedback is crucial for enhancing the code's quality and robustness.

Note

When using the GPUNetIO Ethernet header files, avoid calling functions that start with doca_priv_*. These are internal functions intended for use by the public API only.

Note

The doca_gpunetio_eth_def.h file contains the macro #define DOCA_GPUNETIO_ETH_ENABLE_DEBUG 0. This default setting disables debug prints, including CQE errors. To enable error messages, set this macro to 1.


Execution Scope and Shared Queue

The execution scope for high-level GPUNetIO Ethernet APIs (thread, warp, or block) is specified via the doca_gpu_dev_eth_exec_scope enum as a template argument. This tells the function how many threads are participating in the operation.

  • Send (Txq): The high-level send API supports the shared queue feature. This allows different threads, belonging to different scopes, to concurrently access the same Txq without race conditions.

  • Receive (Rxq): The high-level receive API does not support the shared queue feature. If a thread, warp, or block is using an Rxq, other threads, warps, or blocks cannot use the same Rxq in parallel.

The following subsections detail the send execution scopes.

Thread Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD)

  • Each thread acts as a standalone entity.

  • Each thread posts one or more WQEs and is responsible for its own submit (doorbell ring).

  • Performance: This is the slowest approach due to high contention on atomic operations.

image-2025-10-27_11-39-59-version-1-modificationdate-1769098426840-api-v2.png

Warp Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_WARP)

  • All threads within the warp must call the function.

  • Each thread posts a WQE to a different location, but only the first thread (lane_idx 0) performs the submit operation.

  • Performance: Reduces contention to a per-warp basis.

image-2025-10-27_11-38-9-version-1-modificationdate-1769098427227-api-v2.png

Block Scope (DOCA_GPUNETIO_ETH_EXEC_SCOPE_BLOCK)

  • All threads within the block must call the function.

  • Each thread posts a WQE to a different location, but only the first thread (threadIdx 0) performs the submit operation.

  • Performance: Reduces contention to a per-block basis.

image-2025-10-27_11-38-9-version-1-modificationdate-1769098427227-api-v2.png

Ring DB and CPU Proxy

The doca_gpu_dev_eth_nic_handler enum, passed as a template argument, determines who is responsible for ringing the network card's doorbell.

The available modes are:

  • DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO: Automatically detects the best doorbell ringing option.

  • DOCA_GPUNETIO_ETH_NIC_HANDLER_CPU_PROXY: Enables CPU proxy mode. GPU submit functions provide information to the CPU, which then rings the doorbell.

  • DOCA_GPUNETIO_ETH_NIC_HANDLER_GPU_SM_DB: Enables regular GDAKI mode, where CUDA threads ring the doorbell directly.

Note

When CPU_PROXY mode is enabled, a CPU thread must invoke doca_eth_txq_gpu_cpu_proxy_progress in a loop to detect the GPU's information and ring the doorbell. The GPUNetIO Ethernet Simple Send sample demonstrates this feature.


Memory Consistency (MCST) Algorithm

During a receive operation, the network card writes packet data via PCIe to the application's mapped memory. For pre-Hopper GPUs (see CUGPUDirectRDMAWritesOrdering) , memory consistency (MCST) must be ensured when a CUDA kernel is receiving data into GPU memory.

DOCA GPUNetIO provides two ways to enable the MCST algorithm:

  1. CPU configuration phase: When creating the DOCA Ethernet Rxq, use the function doca_eth_rxq_gpu_enable_mcst_qp to create an internal queue dedicated to the MCST algorithm.

    Copy
    Copied!
                

    cudaGetDeviceProperties(&prop, cuda_id); // If pre-Hopper GPU with __CUDA_ARCH__ < 900 if (prop.major < 9) doca_eth_rxq_gpu_enable_mcst_qp(rxq->eth_rxq_cpu);

  2. GPU data path phase: Pass the doca_gpu_dev_eth_mcst_mode enum as a template argument to the recv function.

Receive, High-level Function: doca_gpu_dev_eth_rxq_recv

This function receives Ethernet packets in a CUDA kernel and is available in different flavors for per-thread, per-warp, or per-block scopes.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD, enum doca_gpu_dev_eth_mcst_mode mcst_mode = DOCA_GPUNETIO_ETH_MCST_DISABLED, enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO, bool enable_attributes = false> __device__ inline doca_error_t doca_gpu_dev_eth_rxq_recv(struct doca_gpu_eth_rxq *rxq, uint32_t max_rx_pkts, uint64_t timeout_ns, volatile uint64_t *out_first_pkt_idx, uint32_t *out_pkt_num, struct doca_gpu_dev_eth_rxq_attr *out_attr);

Parameters:

  • rxq: The Ethernet receive queue GPU handler.

  • max_rx_pkts: Maximum number of packets to receive. The function guarantees the returned packet count is <= this number.

  • timeout_ns: Nanoseconds to wait for packets before returning.

  • out_first_pkt_idx [out]: Index of the first packet received. For block or warp scopes, this variable must be in shared or global memory to be visible to all threads.

  • out_pkt_num [out]: The actual number of packets received. For block or warp scopes, this variable must be in shared or global memory.

  • out_attr [out]: If enable_attributes is true, this structure is filled with per-packet attributes. The application must ensure this array is large enough to hold attributes for max_rx_pkts.

Notes

  • If both max_rx_pkts and timeout_ns are set to 0, the function will hang.

  • For BLOCK or WARP scopes, the max_rx_pkts value should be at least the number of threads in the scope (e.g., blockDim.x or warpSize), as every thread will try to receive at least once.

  • This function does not support the shared QP feature. Access to the rxq must be exclusive to the calling thread, warp, or block. No other scope can work on the same rxq in parallel.

Packet Indexing and Circular Buffer

When the Rxq is created on the CPU, its GPU memory is divided into fixed-size strides (based on max_packet_size). The doca_gpu_dev_eth_rxq_recv function guarantees that these strides are filled sequentially with new packets.

image-2025-10-27_15-43-45-version-1-modificationdate-1769098427630-api-v2.png

The function returns the following output parameters to identify the received packets:

  • out_first_packet_idx: The index of the first stride (packet) filled.

  • out_pkt_num: The total number of packets received.

For example, if out_first_packet_idx is X and out_pkt_num is Y, the function has received Y packets, filling the GPU memory buffer strides from index X to (X + Y - 1).

Note

The output parameters out_first_packet_idx and out_pkt_num must be visible to all threads in the scope (e.g., via CUDA shared memory for warp and block scopes).

The receive buffer is treated as a circular buffer. Once the last stride is filled, the queue wraps around to the first stride.

Note

It is the application's responsibility to consume packets before they are overwritten by the queue wrapping around. This requires proper dimensioning of the Rxq GPU buffer and scaling across multiple receive queues.

Receive: doca_gpu_dev_eth_rxq_get_pkt_addr

This utility function retrieves the memory address of a specific packet in a specific stride.

Copy
Copied!
            

__device__ inline uint64_t doca_gpu_dev_eth_rxq_get_pkt_addr(struct doca_gpu_eth_rxq *rxq, uint64_t packet_idx);

Parameter:

  • rxq: Ethernet receive queue GPU handler.

  • packet_idx: Index of the stride where the packet resides.

Send, High-level Function: doca_gpu_dev_eth_txq_send

This function sends Ethernet packets from a CUDA kernel, with flavors for per-thread, per-warp, or per-block scopes. It enables the shared queue feature, allowing concurrent access to the same Txq.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU, enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO, enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD> __device__ static inline void doca_gpu_dev_eth_txq_send(struct doca_gpu_eth_txq *txq, uint64_t addr, uint32_t mkey, size_t size, enum doca_gpu_eth_send_flags flags, doca_gpu_dev_eth_ticket_t *out_ticket)

Parameters:

  • txq: Ethernet send queue GPU handler.

  • addr: Memory address of the packet to send.

  • mkey: Memory key of the packet.

  • size: Size of the packet in bytes.

  • flags: Send flags from the doca_gpu_eth_send_flags enum.

  • out_ticket: Returns the WQE index (position) in the send queue where the packet was posted.

Send, High-level Function: doca_gpu_dev_eth_txq_wait_send

This function combines the "Accurate Send Scheduling" feature (wait on timestamp) with the send operation. The first thread in the scope posts a "wait on time" barrier, and then all threads in the scope post their send. This function also enables the shared queue feature.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU, enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO, enum doca_gpu_dev_eth_exec_scope exec_scope = DOCA_GPUNETIO_ETH_EXEC_SCOPE_THREAD> __device__ static inline void doca_gpu_dev_eth_txq_wait_send(struct doca_gpu_eth_txq *txq, const uint64_t wait_on_time_ts, uint64_t addr, uint32_t mkey, size_t size, enum doca_gpu_eth_send_flags flags, doca_gpu_dev_eth_ticket_t *out_ticket);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).

  • wqe_idx: Index of the WQE.

  • addr: Packet memory address.

  • mkey: Packet memory key.

  • nbytes: Packet size in bytes.

  • flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_wqe_prepare_send

This low-level function posts a send WQE. It is used by the high-level functions but can also be called directly by the application.

Copy
Copied!
            

__device__ __inline__ static doca_error_t doca_gpu_dev_eth_txq_wqe_prepare_send( const struct doca_gpu_eth_txq *txq, struct doca_gpu_dev_eth_txq_wqe *wqe_ptr, const uint16_t wqe_idx, const uint64_t addr, const uint32_t mkey, const uint32_t nbytes, const enum doca_gpu_eth_send_flags flags);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).

  • wqe_idx: Index of the WQE.

  • addr: Packet memory address.

  • mkey: Packet memory key.

  • nbytes: Packet size in bytes.

  • flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_wqe_prepare_wait_time

This low-level function posts a "wait on time" WQE to create a time barrier. It is used by doca_gpu_dev_eth_txq_wait_send but can also be called directly.

Copy
Copied!
            

__device__ __inline__ static doca_error_t doca_gpu_dev_eth_txq_wqe_prepare_wait_time( const struct doca_gpu_eth_txq *txq, struct doca_gpu_dev_eth_txq_wqe *wqe_ptr, const uint16_t wqe_idx, const uint64_t wait_on_time_ts, const enum doca_gpu_eth_send_flags flags);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • wqe_ptr: Memory pointer to the WQE (can be obtained via doca_gpu_dev_eth_txq_get_wqe_ptr).

  • wqe_idx: Index of the WQE.

  • wait_on_time_ts: Timestamp barrier. The timestamp can be calculated on the GPU (via doca_gpu_dev_eth_txq_calculate_timestamp) or CPU (via doca_eth_txq_calculate_timestamp).

  • flags: Send flags from the doca_gpu_eth_send_flags enum.

Send, Low-level Function: doca_gpu_dev_eth_txq_submit

This function "rings the doorbell" to notify the network card to execute all WQEs posted since the last submit. It supports the shared queue feature; internal atomics allow multiple threads to call it in parallel, and the doorbell register will be updated only by the thread with the greatest WQE index. It also allows choosing between GPU and CPU Proxy handlers via the template parameter.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU, enum doca_gpu_dev_eth_nic_handler nic_handler = DOCA_GPUNETIO_ETH_NIC_HANDLER_AUTO> __device__ static inline void doca_gpu_dev_eth_txq_submit(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • prod_index: The Producer Index (WQE index) to write into the network card's doorbell register.

Send, Low-level Functions: ...update_dbr, ...ring_db, and ...submit_proxy

These functions can replace doca_gpu_dev_eth_txq_submit in special situations, such as when the application has a well-determined send pattern and does not need atomics for shared queue access. The typical sequence is doca_gpu_dev_eth_txq_update_dbr (updates DBREC) followed by doca_gpu_dev_eth_txq_ring_db (updates Doorbell Register).

Copy
Copied!
            

template <enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU> __device__ static inline void doca_gpu_dev_eth_txq_update_dbr(struct doca_gpu_eth_txq *txq, uint32_t prod_index);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • prod_index: The WQE index to write into the network card DBREC.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU> __device__ static inline void doca_gpu_dev_eth_txq_ring_db(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • prod_index: The WQE index to write into the network card doorbell register.

doca_gpu_dev_eth_txq_ring_db works for direct GPU doorbell ringing. To ring the doorbell via CPU proxy, doca_gpu_dev_eth_txq_submit_proxy must be used instead:

Copy
Copied!
            

template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU> __device__ static inline void doca_gpu_dev_eth_txq_submit_proxy(struct doca_gpu_eth_txq *txq, uint64_t prod_index);

Parameter:

  • txq: Ethernet send queue GPU handler.

  • prod_index: The WQE index to write into the CPU proxy shared memory.

Send, Low-level Functions: Poll Completion

These functions allow waiting for send operations to complete. A Completion (CQE) is generated only if the DOCA_GPUNETIO_ETH_SEND_FLAG_NOTIFY flag was used when posting the WQE.

doca_gpu_dev_eth_txq_poll_completion waits for a number of CQEs. It operates at thread scope but supports the shared queue feature (each thread locks num_cqe slots to poll).

Copy
Copied!
            

template <enum doca_gpu_dev_eth_cq_poll_mode cqe_poll = DOCA_GPUNETIO_ETH_CQ_POLL_ALL, enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU> __device__ static inline doca_error_t doca_gpu_dev_eth_txq_poll_completion(struct doca_gpu_eth_txq *txq, const uint32_t num_cqe, enum doca_gpu_dev_eth_wait_flags wait_mode, uint32_t *num_completed);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • num_cqe: Number of CQEs to wait for.

  • wait_mode: Wait in blocking or non-blocking mode (flags from doca_gpu_dev_eth_wait_flags).

  • num_completed: Number of CQEs detected.

Note

When num_cqe > 1, setting the cqe_poll template to DOCA_GPUNETIO_ETH_CQ_POLL_ALL can reduce latency, as the function will only poll the latest CQE (num_cqe - 1) instead of polling sequentially.

doca_gpu_dev_eth_txq_poll_completion_at allows the application to poll for a specific CQE position. It does not support the shared queue feature.

Copy
Copied!
            

template <enum doca_gpu_dev_eth_resource_sharing_mode resource_sharing_mode = DOCA_GPUNETIO_ETH_RESOURCE_SHARING_MODE_GPU, enum doca_gpu_dev_eth_sync_scope sync_scope = DOCA_GPUNETIO_ETH_SYNC_SCOPE_GPU> __device__ static inline doca_error_t doca_gpu_dev_eth_txq_poll_completion_at(struct doca_gpu_eth_txq *txq, const uint64_t cqe_idx, enum doca_gpu_dev_eth_wait_flags wait_mode);

Parameters:

  • txq: Ethernet send queue GPU handler.

  • cqe_idx: Index of the CQE to poll.

  • wait_mode: Wait in blocking or non-blocking mode (flags from doca_gpu_dev_eth_wait_flags).

GPUNetIO Verbs data path functions are defined as inline implementations in header files named doca_gpunetio_dev_verbs_*.

Header Files

  • doca_gpunetio_verbs_def.h: Contains constants, enums, and structure definitions.

  • doca_gpunetio_dev_verbs_qp.cuh: Provides CUDA functions for interacting with a QP, including WQE creation and ringing the network card doorbell.

  • doca_gpunetio_dev_verbs_cq.cuh: Offers CUDA functions for interacting with a CQ, including CQE polling and validation.

  • doca_gpunetio_dev_verbs_onesided.cuh: Enables one-sided operations (e.g., put) with the shared QP feature at the CUDA thread and warp levels.

  • doca_gpunetio_dev_verbs_twosided.cuh: Enables two-sided operations (e.g., send/recv) with the shared QP feature at the CUDA thread and warp levels.

  • doca_gpunetio_dev_verbs_counter.cuh: Provides CUDA functions for counter-like operations via RDMA atomics with the shared QP feature. This requires "Core Direct" to be enabled as a QP attribute.

Note

The GPUNetIO Verbs APIs are currently experimental. Please report any issues encountered during use, as your feedback is crucial for enhancing the code's quality and robustness.

Note

When using the GPUNetIO Verbs header files, avoid calling functions that start with doca_priv_*. These are internal functions intended for use by the public API only.

Note

The doca_gpunetio_verbs_def.h file contains the macro #define DOCA_GPUNETIO_VERBS_ENABLE_DEBUG 0. This default setting disables debug prints, including CQE errors. To enable error messages, set this macro to 1.


Execution Scope and Shared Queue

The high-level GPUNetIO Verbs API can be executed at a thread or warp scope by specifying a value from the doca_gpu_dev_verbs_exec_scope enum as a template argument. This tells the function how many threads are participating in the operation.

The high-level API also supports the shared queue feature, allowing different threads from different scopes to concurrently access the same queue without race conditions.

Thread Scope (DOCA_GPUNETIO_VERBS_EXEC_SCOPE_THREAD)

  • Each thread acts as a standalone entity, posting WQEs and ringing the doorbell.

  • This is the slowest approach due to high contention on atomic operations.

image-2025-7-25_11-29-39-version-1-modificationdate-1769098428120-api-v2.png

Warp Scope (DOCA_GPUNETIO_VERBS_EXEC_SCOPE_WARP)

  • All threads within the warp must call the function.

  • Each thread posts a WQE to a different location, but only the first thread (lane_idx 0) performs the submit (doorbell ring).

  • This reduces contention to a per-warp basis.

image-2025-7-25_11-38-12-version-1-modificationdate-1769098428613-api-v2.png

Ring DB and CPU Proxy

The doca_gpu_dev_verbs_nic_handler enum, passed as a template argument, determines who is responsible for ringing the doorbell.

The available modes are:

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO: Automatically detects the best doorbell ringing option.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_CPU_PROXY: Enables CPU proxy mode. GPU submit functions provide information to the CPU, which then rings the doorbell.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_DB: Enables regular GDAKI mode, where CUDA threads ring the doorbell directly.

  • DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_BF: Experimental BlueFlame doorbell ringing mode (may fail in the current release).

Note

CPU Proxy Mode When CPU_PROXY mode is enabled, a CPU thread must invoke doca_gpu_verbs_cpu_proxy_progress in a loop to detect the GPU's information and ring the doorbell. This mode can be enabled via a command-line option in most GPUNetIO Verbs samples .


Memory Consistency (MCST) Algorithm

During an RDMA Read or Receive operation, the network card writes data via PCIe into the application's mapped memory. For pre-Hopper GPUs (see CUGPUDirectRDMAWritesOrdering) , memory consistency (MCST) must be ensured when a CUDA kernel is receiving data into GPU memory.

DOCA GPUNetIO Verbs applications can enable the MCST algorithm on the GPU by setting the doca_gpu_dev_verbs_mcst_mode mcst_mode template parameter. This parameter is available in functions such as doca_gpu_dev_verbs_get, doca_gpu_dev_verbs_get_wait, and doca_gpu_dev_verbs_recv_wait.

The gpunetio_verbs_twosided_bw sample demonstrates this by checking the CUDA architecture:

Copy
Copied!
            

#if __CUDA_ARCH__ < 900 doca_gpu_dev_verbs_recv_wait<DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU, DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO, DOCA_GPUNETIO_VERBS_MCST_ENABLED>( qp, doca_gpu_dev_verbs_addr{.addr = (uint64_t)dump_flag, .key = dump_flag_mkey}); #else doca_gpu_dev_verbs_recv_wait<DOCA_GPUNETIO_VERBS_RESOURCE_SHARING_MODE_GPU, DOCA_GPUNETIO_VERBS_NIC_HANDLER_AUTO, DOCA_GPUNETIO_VERBS_MCST_DISABLED>( qp, doca_gpu_dev_verbs_addr{.addr = 0, .key = 0}); #endif


Queue Type

Many functions (e.g., submit doorbell, poll CQ) can apply to either the Send Queue or Receive Queue of a QP. A template argument from the doca_gpu_dev_verbs_qp_type enum is used to specify which queue to use.

  • DOCA_GPUNETIO_VERBS_QP_SQ: Uses the Send Queue from the input QP.

  • DOCA_GPUNETIO_VERBS_QP_RQ: Uses the Receive Queue from the input QP.

ConnectX-8 Reliable Doorbell

The ConnectX-8 Reliable Doorbell feature optimizes the send data path by allowing applications to skip the Queue Pair Doorbell Record (QP DBREC) update. This reduces the number of operations the GPU must execute to ring the doorbell for send WQEs (Write, Read, Send, etc.).

The implementation steps are as follows:

  1. Verify support: Check if the device supports the feature using doca_verbs_device_attr_get_is_send_dbr_mode_supported().

  2. Enable feature: Set the new QP attribute using doca_verbs_qp_init_attr_set_send_dbr_mode().

  3. Handle GPU data path: If enabled, use the GPUNetIO handler mode DOCA_GPUNETIO_VERBS_NIC_HANDLER_GPU_SM_NO_DBR.

The gpunetio_verbs_write_lat and gpunetio_verbs_put_counter_lat examples demonstrate this feature. Enable it via the command line option -r 1.

Header: doca_gpunetio_dev_verbs_qp.cuh

This header provides functions for QP manipulation, categorized into three groups:

  • doca_gpu_dev_verbs_wqe_prepare_*: Prepares and posts various RDMA WQEs (Write, Read, Send, Recv, Atomic, Wait, Dump) on QP memory.

  • doca_gpu_dev_verbs_submit: Updates network card registers for doorbell submission (regular, CPU proxy, and BlueFlame) for both SQ and RQ.

  • doca_gpu_dev_verbs_wait_*: Waits for CQEs to arrive on the SQ or RQ's completion queue.

For detailed explanations, refer to the header file. Examples can be found in the doca_gpunetio_verbs_write_bw and doca_gpunetio_verbs_write_lat samples.

Header: doca_gpunetio_dev_verbs_cq.cuh

This header provides functions for manipulating Completion Queues (CQs). Specifically, the doca_gpu_dev_verbs_poll_* functions are used to poll CQEs, waiting for the completion of WQEs in the connected QP.

Header: doca_gpunetio_dev_verbs_onesided.cuh

This header provides shared-QP-aware functions for one-sided RDMA operations at the thread and warp levels.

  • doca_gpu_dev_verbs_put: Posts an RDMA Write WQE and submits. In WARP scope, only the first thread submits.

  • doca_gpu_dev_verbs_put_signal: Similar to put, but adds an RDMA Atomic Fetch and Add. In THREAD scope, each thread posts a Write, an Atomic, and submits. In WARP scope, each thread posts a Write, but only the first thread posts the Atomic and submits.

  • doca_gpu_dev_verbs_get: Posts an RDMA Read WQE and submits. In WARP scope, only the first thread submits. Optionally posts a Dump WQE to ensure data consistency (e.g., ensures Read data is present in memory after CQE arrival, e.g., enum CUGPUDirectRDMAWritesOrdering).

For details, refer to the header file. Examples are in the doca_gpunetio_verbs_put_bw, doca_gpunetio_verbs_put_signal_bw, and doca_gpunetio_verbs_put_signal_lat samples.

Header: doca_gpunetio_dev_verbs_twosided.cuh

This header provides shared-QP-aware functions for two-sided RDMA operations at the thread and warp levels.

  • doca_gpu_dev_verbs_send: Posts an RDMA Send WQE and submits. In WARP scope, only the first thread submits.

  • doca_gpu_dev_verbs_recv: Posts an RDMA Receive WQE and submits. In WARP scope, only the first thread submits.

Requirement flow for two-sided communication:

  1. Peer A posts one or more RDMA Receive WQEs.

  2. Peer A notifies Peer B that it is ready to receive.

  3. Peer B waits for this notification.

  4. Peer B posts one or more RDMA Send WQEs.

The notification can be sent via RDMA Write or Atomic. An example using RDMA Write is in the doca_gpunetio_verbs_twosided_bw sample.

Header: doca_gpunetio_dev_verbs_counter.cuh

This header provides shared-QP functions that use the counter feature. This allows a CQE arrival on a main QP to trigger a WQE on a companion QP. This requires Core Direct to be enabled via doca_verbs_qp_init_attr_set_core_direct_master().

  • doca_gpu_dev_verbs_put_counter: Posts an RDMA Write on the main QP and an RDMA Atomic FetchAdd on the companion QP. The Atomic is executed only after the Write's CQE arrives, allowing for buffer reuse.

  • doca_gpu_dev_verbs_submit_multi_qps: Submits both the main and companion QP in a single call.

For details, refer to the header file. Examples are in the doca_gpunetio_verbs_put_counter_bw and doca_gpunetio_verbs_put_counter_lat samples.

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute RDMA operations. These functions offer a strong and a weak mode.

Copy
Copied!
            

__device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_get_info(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t *curr_position, uint32_t *mask_max_position)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • curr_position – Next available position in the queue

  • mask_max_position – Mask of the total number of positions in the queue

Copy
Copied!
            

__device__ doca_error_t __device__ doca_error_t doca_gpu_dev_rdma_recv_get_info(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t *curr_position, uint32_t *mask_max_position)

  • rdma_r – RDMA receive queue GPU handler

  • curr_position – Next available position in the queue

  • mask_max_position – Mask of the total number of positions in the queue

doca_gpu_dev_rdma_write_*

To RDMA write data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_write_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to

  • remote_offset – Offset, in bytes, to write data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to write

  • local_offset – Offset, in bytes, to fetch data from in the local buffer

  • length – Number of bytes to write

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_write_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array to write data to

  • remote_offset – Offset, in bytes, to write data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to fetch data to write

  • local_offset – Offset, in bytes, to fetch data in the local buffer

  • length – Number of bytes to write

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_read_*

To RDMA read data onto a remote memory location from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes to enqueue operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_read_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, const uint32_t flags_bitmask)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data

  • remote_offset – Offset in bytes to read data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data

  • local_offset – Offset in bytes to store data in the local buffer

  • length – Number of bytes to be read

  • flags_bitmask – Must be 0; reserved for future use

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_read_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *remote_buf, uint64_t remote_offset, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, const uint32_t flags_bitmask, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • remote_buf – Remote DOCA buffer from a DOCA GPU buffer array where to read data

  • remote_offset – Offset in bytes to read data to in the remote buffer

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array where to store remote data

  • local_offset – Offset in bytes to store data in the local buffer

  • length – Number of bytes to be read

  • flags_bitmask – Must be 0; reserved for future use

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_send_*

To RDMA send data from a CUDA kernel, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_send_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • local_offset – Offset in bytes to fetch data in the local buffer

  • length – Number of bytes to send

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_send_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, struct doca_gpu_buf *local_buf, uint64_t local_offset, size_t length, uint32_t imm, const enum doca_gpu_dev_rdma_write_flags flags, uint32_t position);

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • local_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • local_offset – Offset in bytes to fetch data in the local buffer

  • length – Number of bytes to send

  • imm – Immediate value uint32_t

  • flags – One of the flags in the doca_gpu_dev_rdma_write_flags enum

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_commit_*

Once all RDMA write, send or read requests have been enqueue in the RDMA queue, a synchronization point must be reached to consolidate and execute those requests. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_commit_strong(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_commit_weak(struct doca_gpu_dev_rdma *rdma, uint32_t connection_index, uint32_t num_ops)

  • rdma – RDMA queue GPU handler

  • connection_index – In case of RDMA CM, the connection index must be specified. By default, it is 0.

  • num_ops – Number of RDMA requests enqueued since the last commit

doca_gpu_dev_rdma_wait_all

After a commit, RDMA requests are executed by the network card as applications move forward doing other operations. If the application needs to verify all RDMA operations have been done by the network card, this "wait all" function can be used to wait for all previous posted operations. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_wait_all(struct doca_gpu_dev_rdma *rdma, uint32_t *num_commits)

  • rdma – RDMA queue GPU handler

  • num_commits – Output parameter; the number of commit operations completed

Info

This function is optional, and it can be used to ensure all the RDMA Send/Write/Read operations have actually been executed before moving forward with the application.


doca_gpu_dev_rdma_recv_*

To receive data from an RDMA send, send with immediate, or write with immediate, the destination peer should post a receive operation. DOCA GPUNetIO RDMA receive operations must be done with a doca_gpu_dev_rdma_r handler. This handler can be obtained with the function doca_gpu_dev_rdma_get_recv.

Note

All receive operations must use this object.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_get_recv(struct doca_gpu_dev_rdma *rdma, struct doca_gpu_dev_rdma_r **rdma_r)

  • rdma – RDMA queue GPU handler

  • rdma_r – RDMA receive queue GPU handler

Even for the receive side, in this case, DOCA GPUNetIO offers strong and weak modes for enqueuing operations on the RDMA queue. For both modes, the scope is the single CUDA thread.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_strong(struct doca_gpu_dev_rdma_r *rdma_r, struct doca_gpu_buf *recv_buf, size_t recv_length, uint64_t recv_offset, const uint32_t flags_bitmask)

  • rdma_r – RDMA receive queue GPU handler

  • recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • recv_length – Number of bytes to send

  • recv_offset – Offset in bytes to fetch data in the local buffer

  • flags_bitmask – Must be 0; reserved for future use

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_weak(struct doca_gpu_dev_rdma_r *rdma_r, struct doca_gpu_buf *recv_buf, size_t recv_length, uint64_t recv_offset, const uint32_t flags_bitmask, uint32_t position);

  • rdma_r – RDMA receive queue GPU handler

  • recv_buf – Local DOCA buffer from a DOCA GPU buffer array from which to fetch data to send

  • recv_length – Number of bytes to send

  • recv_offset – Offset in bytes to fetch data in the local buffer

  • flags_bitmask - Must be 0; reserved for future use

  • position – Position in the queue to place the RDMA operation. Range: 0 - mask_max_position.

doca_gpu_dev_rdma_recv_commit_*

After posting several RDMA receive operations, a commit function must be invoked to activate the receive in the queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_commit_strong(struct doca_gpu_dev_rdma_r *rdma_r)

  • rdma_r – RDMA receive queue GPU handler

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_commit_weak(struct doca_gpu_dev_rdma_r *rdma_r, uint32_t num_ops)

  • rdma_r – RDMA receive queue GPU handler

  • num_ops – Number of RDMA receive requests enqueued since the last commit

doca_gpu_dev_rdma_recv_wait_all

This function waits for the completion of all previously posted RDMA receive operation. Only 1 CUDA thread can invoke this function at a time. It works in blocking or non-blocking mode.

Copy
Copied!
            

 enum doca_gpu_dev_rdma_recv_wait_flags { DOCA_GPU_RDMA_RECV_WAIT_FLAG_NB = 0, /**< Non-Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait * checks if the receive operation happened (data has been received) * and exit from the function. If nothing has been received, * the function doesn't block the execution. */ DOCA_GPU_RDMA_RECV_WAIT_FLAG_B = 1, /**< Blocking mode: the wait receive function doca_gpu_dev_rdma_recv_wait * blocks the execution waiting for the receive operations to be executed. */ };

Function:

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_rdma_recv_wait_all(struct doca_gpu_dev_rdma_r *rdma_r, const enum doca_gpu_dev_rdma_recv_wait_flags flags, uint32_t *num_ops, uint32_t *imm_val)

  • rdma_r – RDMA receive queue GPU handler

  • flags – receive flags

  • num_ops – Output parameter. Function reports number of completed operations.

  • imm_val – Output parameter. Application-provided buffer where the function can store received immediate values, if any (or 0xFFFFFFFF if no immediate value is received). If nullptr, the function ignores this parameter.

This section provides a list of DOCA GPUNetIO functions that can be used on the GPU only within a CUDA kernel to execute DMA operations.

doca_gpu_dev_dma_memcpy

This function allows a CUDA kernel to trigger a DMA memory copy operation through the DMA GPU engine. There is no strong/weak mode here, the DMA is assuming the strong behavior by default.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_dma_memcpy(struct doca_gpu_dma *dma, struct doca_gpu_buf *src_buf, uint64_t src_offset, struct doca_gpu_buf *dst_buf, uint64_t dst_offset, size_t length);

  • dma – DMA queue GPU handler

  • src_buf – memcpy source buffer

  • src_offset – fetch data starting from this source buffer offset

  • dst_buf – memcpy destination buffer

  • dst_offset – copy data starting from this destination buffer offset

  • lenght – number of bytes to copy

doca_gpu_dev_dma_commit

After posting several DMA memory copies, a commit function must be invoked to execute the operations enqueued in the DMA queue. Only 1 CUDA thread can invoke this function at a time.

Copy
Copied!
            

__device__ doca_error_t doca_gpu_dev_dma_commit(struct doca_gpu_dma *dma);

  • dma – DMA queue GPU handler

This section lists the DOCA GPUNetIO functions that can be used within a CUDA kernel to execute DOCA Comch Producer or Consumer operations. For more information on the Producer/Consumer model, refer to the DOCA Comch documentation.

Note

Only one CUDA thread can invoke these functions per Producer or Consumer object.

Consumer: doca_dev_gpu_comch_consumer_post_recv

This function allows a DOCA Comch Consumer to post a buffer to receive a message from a remote producer.

Copy
Copied!
            

__device__ doca_error_t doca_dev_gpu_comch_consumer_post_recv( struct doca_comch_gpu_consumer *consumer, uint32_t mkey, uintptr_t addr, uint32_t recv_len );

Parameters:

  • consumer: The Consumer GPU handle.

  • mkey: The mkey associated with the buffer to be posted.

  • addr: The address of the buffer to receive the message into.

  • recv_len: The maximum message size that can be written to the buffer.

Consumer: doca_dev_gpu_comch_consumer_recv_wait

This function allows a DOCA Comch Consumer to wait for a message to be received from a remote producer.

Copy
Copied!
            

__device__ doca_error_t doca_dev_gpu_comch_consumer_recv_wait( struct doca_comch_gpu_consumer *consumer, uint32_t *mkey, uintptr_t *addr, uint32_t *recv_len, uint8_t **imm_data, uint32_t *imm_data_len, const enum doca_gpu_dev_comch_wait_flags flags );

Parameters:

  • consumer: The Consumer GPU handle.

  • mkey [out]: The mkey of the buffer that the message has been received into.

  • addr [out]: The address of the buffer that the message has been received into.

  • recv_len [out]: The length of the message that has been received.

  • imm_data: Pointer to a buffer for any immediate data included with the message.

  • imm_data_len [in/out]:

    • Input: The maximum size of immediate data that can be received.

    • Output: The actual size of immediate data received.

  • flags: A doca_gpu_dev_comch_wait_flags value (e.g., blocking or non-blocking).

Producer: doca_dev_gpu_comch_producer_send

This function allows a DOCA Comch Producer to send a message to a remote consumer.

Copy
Copied!
            

__device__ doca_error_t doca_dev_gpu_comch_producer_send( struct doca_comch_gpu_producer *producer, uint32_t mkey, uintptr_t addr, uint32_t send_len, uint8_t *imm_data, uint32_t imm_data_len, uint32_t consumer_id, uint64_t user_msg_id );

Paramters:

  • producer: The Producer GPU handle.

  • mkey: The mkey associated with the buffer to be sent.

  • addr: The address of the buffer to be sent.

  • send_len: The number of bytes to send from the buffer.

  • imm_data: Immediate data to include with the message.

  • imm_data_len: Size of the immediate data to include.

  • consumer_id: The ID of the remote consumer to send the message to.

  • user_msg_id: A user-defined ID that will be returned by doca_dev_gpu_comch_producer_poll when this message has been successfully sent.

Producer: doca_dev_gpu_comch_producer_poll

This function allows a DOCA Comch Producer to poll for the completion of a send operation.

Copy
Copied!
            

__device__ doca_error_t doca_dev_gpu_comch_producer_poll( struct doca_comch_gpu_producer *producer, uint64_t *user_msg_id, const enum doca_gpu_dev_comch_wait_flags flags );

Parameters:

  • producer: The Producer GPU handle.

  • user_msg_id [out]: The user_msg_id that was set in the corresponding doca_dev_gpu_comch_producer_send function.

  • flags: A doca_gpu_dev_comch_wait_flags value (e.g., blocking or non-blocking).

doca_gpu_dev_comch_wait_flags

This enum is used as the flags parameter in both the doca_gpu_dev_comch_consumer_recv_wait and doca_gpu_dev_comch_producer_poll functions. It determines how the function behaves if no operation has completed.

Copy
Copied!
            

enum doca_gpu_dev_comch_wait_flags { DOCA_GPU_COMCH_WAIT_FLAG_NB = 0, /**< Non-Blocking mode: the wait function checks if any of the operations * have been completed (data has been sent/received) and exit from the * function. If nothing has been sent/received, the function doesn't block * the execution. */ DOCA_GPU_COMCH_WAIT_FLAG_B = 1, /**< Blocking mode: the wait function blocks the execution waiting for a * send/receive operations to be completed. */ };


© Copyright 2026, NVIDIA. Last updated on Feb 27, 2026