DOCA Documentation v3.1.0

DOCA DPA Verbs

The DOCA DPA Device Verbs library provides RDMA operations support for DPA applications, enabling high-performance RDMA operations (Send, Receive, Read, Write, Atomic) directly on the DPA device without CPU involvement.

The library follows an RDMA-core/ibverbs-like API design pattern, making it intuitive for developers experienced with traditional RDMA programming. It maintains the same conceptual model of work requests, queue pairs, and completion queues but is optimized for execution within DPA kernels.

Work Request Model

The library uses a work request (WR) based model for posting RDMA operations:

  • Send Work Requests (doca_dpa_dev_verbs_send_wr): For outbound RDMA operations, including Send, RDMA Write, RDMA Read, and Atomic operations.

  • Receive Work Requests (doca_dpa_dev_verbs_recv_wr): For posting receive buffers for incoming data.

Key Features

  • DPA-native RDMA operations: Execute RDMA operations directly on the DPA device.

  • RDMA-core compatible interface: Familiar API design following rdma-core/ibverbs patterns.

  • Work request based model: Post send and receive work requests similar to traditional RDMA programming.

  • Support for QP and SRQ: Enable Queue Pairs and Shared Receive Queues.

  • Scatter-gather (SG) operations: Support complex memory layouts with multiple regions per operation.

The DOCA DPA Verbs library and header files are:

  • libdoca_dpa_dev_verbs.a (DOCA Libs)

  • doca_dpa_dev_verbs.h (DOCA Includes)

dpa_verbs_deployment-version-1-modificationdate-1752488762317-api-v2.png

Note

The DOCA DPA SDK does not implement multi-thread synchronization primitives, and all DOCA DPA objects are non-thread-safe. Developers must ensure that both the user program and kernels are designed to prevent race conditions.

DPA_Verbs_arch-version-2-modificationdate-1752486039417-api-v2.png

Component breakdown:

  1. Host-side Libraries:

    • DOCA Verbs Library: Handles QP/SRQ creation, configuration, and lifecycle management.

    • DOCA DPA Library: Manages DPA context creation, DPA completion contexts, threads, memory management, and overall DPA orchestration.

  2. DPA Device Libraries:

    • DPA Device Verbs Library: Device-side library for direct RDMA operations.

    • DPA Device Library: Device-side completion context processing and completion element handling.

  3. DPA Hardware: The underlying DPA device that executes operations.

DPA Device Handles

The DOCA DPA Verbs library uses specific handles to represent DPA device resources:

  • doca_dpa_dev_verbs_qp_t: Handle for a Queue Pair on the DPA device.

  • doca_dpa_dev_verbs_srq_t: Handle for a Shared Receive Queue on the DPA device.

  • doca_dpa_dev_completion_t: Handle for a Completion Queue on the DPA device.

Work Request Structures

The library defines structures for work requests:

  • doca_dpa_dev_verbs_send_wr: Structure for send work requests.

  • doca_dpa_dev_verbs_recv_wr: Structure for receive work requests.

  • doca_dpa_dev_verbs_sge: SG element for describing memory regions.

These structures are used to configure and post work requests to Queue Pairs or Shared Receive Queues.

Operation Types

The DOCA DPA Verbs library supports various RDMA operation types:

  • Send operations: SEND and SEND_WITH_IMM

  • RDMA operations: RDMA_WRITE, RDMA_WRITE_WITH_IMM, and RDMA_READ

  • Atomic operations: ATOMIC_FETCH_ADD

DPA Completion Context Management

The library provides a completion context management system that spans both host and device sides:

  • Host-side completion contexts (struct doca_dpa_completion): Managed by the DOCA DPA Library, with configurable queue size, thread attachment, and lifecycle control.

  • Device-side completion context processing: Utilizes the doca_dpa_dev_completion_t handle for processing completions on the DPA device.

  • DPA completion elements (doca_dpa_dev_completion_element_t): Represent individual completion events containing metadata about completed operations.

  • Supported completion types include send, receive (RDMA Write with Immediate, Send, Send with Immediate), and error completions.

Enumerations and Constants

The doca_dpa_dev_verbs.h header file contains complete enum definitions and values.

Send Work Request Opcodes

The enum doca_dpa_dev_verbs_send_wr_opcode defines RDMA operation types:

  • WRITE: One-sided RDMA write operation.

  • WRITE_WITH_IMM: RDMA write with immediate data, generating a completion on the receiver.

  • SEND: Two-sided send operation requiring a posted receive buffer on the remote side.

  • SEND_WITH_IMM: Send operation with immediate data attached.

  • READ: One-sided RDMA read operation retrieving data from remote memory.

  • ATOMIC_FETCH_ADD: Atomic fetch-and-add operation on a remote memory location.

Send Work Request Flags

The enum doca_dpa_dev_verbs_send_wr_flags controls work request behavior:

  • SIGNALED: Generates a completion queue entry (CQE) when the operation completes.

  • SOLICITED: Requests a solicited event on the remote side (used with send operations).

Fence Modes

The enum doca_dpa_dev_verbs_send_wr_fm controls ordering and synchronization between work requests:

  • NO_FENCE: No ordering constraints for maximum performance.

  • INITIATOR_SMALL_FENCE: Light ordering constraint for local operations.

  • FENCE: Standard fence ensuring previous operations complete before this one.

  • STRONG_ORDERING_FENCE: Strongest ordering guarantee for critical operations.

  • FENCE_AND_INITIATOR_SMALL_FENCE: Combined fence modes for specific use cases.

SRQ Types

The enum doca_dpa_dpa_dev_verbs_srq_type defines Shared Receive Queue implementation types:

  • LINKED_LIST: SRQ implemented as a linked list structure.

  • CONTIGUOUS: SRQ implemented as a contiguous memory buffer.

SG Element (SGE) Structure

The struct doca_dpa_dev_verbs_sge represents a memory region for data transfer:

  • addr: Virtual address of the memory region (uint64_t).

  • length: Length of the memory region in bytes (uint32_t).

  • lkey: Local key for the memory region (uint32_t).

Note

For non-fully occupied SG lists, set the last entry's lkey field to DOCA_DPA_DEV_VERBS_SGE_TERMINATING_LKEY (0x100) to indicate the end of valid entries.

Send Work Request Configuration

Info

See doca_dpa_dev_verbs_send_wr_set_* and doca_dpa_dev_verbs_send_wr_get_* functions in doca_dpa_dev_verbs.h for complete API signatures.

These APIs configure send work requests before posting them to a queue pair. Configuration must be completed before posting.

Operation Type

doca_dpa_dev_verbs_send_wr_set/get_opcode(): Sets/gets the RDMA operation type (SEND, WRITE, READ, ATOMIC_FETCH_ADD), determining the work request's fundamental behavior

Memory Settings

  • doca_dpa_dev_verbs_send_wr_set/get_sg_list(): Sets/gets the SG list pointing to local memory regions containing the data to send

  • doca_dpa_dev_verbs_send_wr_set/get_sg_num_sge(): Sets/gets the number of SG elements in the list

Control and Synchronization

  • doca_dpa_dev_verbs_send_wr_set/get_send_flags(): Sets/gets completion generation and solicited event control flags

  • doca_dpa_dev_verbs_send_wr_set/get_fence_mode(): Sets/gets ordering constraints between work requests

Optional Data

  • doca_dpa_dev_verbs_send_wr_set/get_imm_data(): Sets/gets immediate data for SEND_WITH_IMM and WRITE_WITH_IMM operations

  • doca_dpa_dev_verbs_send_wr_set/get_invalidate_rkey(): Sets/gets remote key to invalidate for SEND_WITH_INV operations

RDMA-specific Configuration

Required for RDMA WRITE and READ operations:

  • doca_dpa_dev_verbs_send_wr_set/get_rdma_remote_addr(): Sets/gets the target memory address on the remote node

  • doca_dpa_dev_verbs_send_wr_set/get_rdma_rkey(): Sets/gets the remote memory key for accessing the target memory region

Atomic Operations Configuration

Required for atomic operations:

  • doca_dpa_dev_verbs_send_wr_set/get_atomic_remote_addr(): Sets/gets the target memory address for the atomic operation

  • doca_dpa_dev_verbs_send_wr_set/get_atomic_rkey(): Sets/gets the remote memory key for the atomic operation

  • doca_dpa_dev_verbs_send_wr_set/get_atomic_compare_add(): Sets/gets the value to add in fetch-and-add operations

  • doca_dpa_dev_verbs_send_wr_set/get_atomic_swap(): Sets/gets the swap value for compare-and-swap operations

Receive Work Request Configuration

Info

See doca_dpa_dev_verbs_recv_wr_set_* and doca_dpa_dev_verbs_recv_wr_get_* functions in doca_dpa_dev_verbs.h for complete API signatures.

These APIs configure and query receive work requests before posting them to a queue pair or SRQ. Receive work requests prepare buffers to receive incoming data from remote nodes. All configuration must be completed before calling the posting APIs.

Configuration

  • doca_dpa_dev_verbs_recv_wr_set/get_sg_list(): Sets/gets the SG list pointing to local memory regions where incoming data will be stored

  • doca_dpa_dev_verbs_recv_wr_set/get_sg_num_sge(): Sets/gets the number of SG elements in the list

The receive work request configuration is simpler than send work requests since receive operations are passive - they only specify where to store incoming data.

Work Request Posting

Info

See doca_dpa_dev_verbs_qp_post_* functions in doca_dpa_dev_verbs.h for complete API signatures.

This section provides APIs for posting send and receive work requests to queue pairs. All posting functions return a work request counter that can be matched with completion events.

Standard Work Request Posting

  • doca_dpa_dev_verbs_qp_post_send_wr(): Posts a configured send work request to the send queue. Returns the send work request counter for completion tracking.

  • doca_dpa_dev_verbs_qp_post_recv_wr(): Posts a receive work request to the receive queue. Prepares the queue to receive incoming data. Returns the send work request counter for completion tracking.

Raw WQE Posting

  • doca_dpa_dev_verbs_qp_post_send_raw_wqe(): Posts a custom-built send Work Queue Element directly to hardware, bypassing high-level work request processing.

  • doca_dpa_dev_verbs_qp_post_recv_raw_wqe(): Posts a custom-built receive WQE with specified size.

SG List Usage

When using a SG list (sg_list) that is not fully occupied, set the last entry's lkey field to DOCA_DPA_DEV_VERBS_SGE_TERMINATING_LKEY to indicate the end of valid entries.

Commit Operations

Info

See doca_dpa_dev_verbs_qp_*commit* functions in doca_dpa_dev_verbs.h for complete API signatures.

This section provides APIs for committing posted work requests to hardware, making them available for processing by the DPA device. Work requests must be committed for the hardware to process them.

Standard Commit Operations

  • doca_dpa_dev_verbs_qp_commit_send(): Commits all pending send work requests to hardware with internal memory fence for ordering guarantees.

  • doca_dpa_dev_verbs_qp_commit_recv(): Commits all pending receive work requests to hardware with internal memory fence.

Lightweight Commit Operations

Note

User must perform memory fence operations before calling these functions.

  • doca_dpa_dev_verbs_qp_lw_commit_send(): Lightweight send commit without internal memory fence. Higher performance but requires manual memory synchronization.

  • doca_dpa_dev_verbs_qp_lw_commit_recv(): Lightweight receive commit without internal memory fence. User responsible for proper memory ordering.

Shared Receive Queue (SRQ) Operations

Basic SRQ Operations

  • doca_dpa_dev_verbs_srq_post_recv_wr(): Posts a receive work request to the SRQ, specifying the SRQ type and receive work request structure.

  • doca_dpa_dev_verbs_srq_commit_recv(): Commits pending receive work requests to the SRQ with internal memory fence.

  • doca_dpa_dev_verbs_srq_lw_commit_recv(): Lightweight commit for SRQ receive work requests without internal memory fence. User responsible for memory fence operations.

SRQ Raw WQE Operations

doca_dpa_dev_verbs_srq_post_recv_raw_wqe(): Posts a custom-built receive WQE to the SRQ with specified type and size. Follows same principles as QP raw WQE posting.

SG List Usage

When providing a SG list (sg_list) that is not fully occupied, the user must set the last entry's lkey field to DOCA_DPA_DEV_VERBS_SGE_TERMINATING_LKEY to indicate the end of the valid entries.

SRQ Management

Note

This API is relevant only for Linked-List SRQ.

doca_dpa_dev_verbs_srq_linked_list_ack_wr(): Acknowledges processed receive work request in linked list SRQ implementation, specifying the receive work request count (doca_dpa_dev_completion_element_get_wqe_counter() return value) to acknowledge.

Query APIs

Note

The following APIs are for debug and inspection purposes only. Do not modify the returned WQ buffers or DBR addresses, as this can cause undefined behavior and system instability.

Queue Pair Query APIs

  • doca_dpa_dev_verbs_qp_get_wq(): Retrieves work queue attributes, including SQ/RQ buffer addresses, entry counts, and receive WQE size. The returned buffers are read-only; modifying them can cause undefined behavior.

  • doca_dpa_dev_verbs_qp_get_dbr_addr(): Returns the doorbell record address for the queue pair. The returned address is read-only; modifying it can cause undefined behavior .

  • doca_dpa_dev_verbs_qp_get_qpn(): Gets the queue pair number.

  • doca_dpa_dev_verbs_qp_get_user_index(): Retrieves the user-assigned index for the queue pair.

SRQ Query APIs

  • doca_dpa_dev_verbs_srq_get_srqn(): Gets the SRQ number.

  • doca_dpa_dev_verbs_srq_get_wq(): Retrieves SRQ work queue attributes, including buffer address, entry count, and WQE size. The returned buffers are read-only; modifying them can cause undefined behavior.

The DPA Device Verbs library integrates with host-side DOCA Verbs to provide a complete RDMA solution.

QP/SRQ Configuration Modes

Two mutually exclusive modes are available for configuring Queue Pairs (QP) and Shared Receive Queues (SRQ) with DPA integration.

Mode 1: DPA Context Integration (Basic)

This mode provides full DPA integration, where the DPA context manages all DPA-related resources:

  • Set the DPA context using doca_verbs_qp_init_attr_set_dpa_ctx() for QPs or doca_verbs_srq_init_attr_set_dpa_ctx() for SRQs.

  • The DPA context automatically handles memory allocation, doorbell records, and user access regions.

  • This is the basic mode for DPA applications due to simplified resource management.

Mode 2: External Resource Management (Advanced)

This mode provides fine-grained control over DPA resources:

  • Set external user memory (UMEM) for the work queue buffer using doca_verbs_qp_init_attr_set_external_umem() or doca_verbs_srq_init_attr_set_external_umem().

  • Set external doorbell record (DBR) using doca_verbs_qp_init_attr_set_external_dbr_umem() or doca_verbs_srq_init_attr_set_external_dbr_umem().

  • Set external user access region (UAR) using doca_verbs_qp_init_attr_set_external_uar().

  • The application is responsible for manual management of all DPA resources.

  • This mode requires a deep understanding of DPA resource management and memory allocation.

  • It is used for scenarios requiring custom resource allocation strategies and full user control.

Notes:

  • These two modes are mutually exclusive; you cannot mix DPA context mode with external resource mode.

  • Mode 1 is the basic mode for applications.

  • Mode 2 should only be used when custom resource management is required.

  • All resources in Mode 2 must be properly aligned and configured according to DPA hardware requirements.

Host-side Setup Requirements

For Mode 1 (DPA Context Integration)

  1. Create a DOCA Verbs context and protection domain.

  2. Retrieve the DOCA device from the Verbs PD.

  3. Create a DPA context using doca_dpa_create().

  4. Configure queue pair initialization attributes.

  5. Set the DPA context for the queue pair/SRQ.

  6. Create and configure DPA completion contexts.

  7. Associate completion contexts with queue pairs.

For Mode 2 (External Resource Management)

  1. Create a DOCA Verbs context and protection domain.

  2. Retrieve the DOCA device from the Verbs PD.

  3. Allocate and configure external WQ UMEM.

  4. Allocate and configure external DBR UMEM.

  5. Allocate and configure external UAR.

  6. Configure queue pair initialization attributes.

  7. Set external resources for the queue pair/SRQ.

  8. Create and configure DPA completion contexts.

  9. Associate completion contexts with queue pairs.

Basic DPA QP Setup Pattern

Copy
Copied!
            

// Create DOCA Verbs context and PD doca_verbs_context_create(devinfo, 0, &verbs_ctx); doca_verbs_pd_create(verbs_ctx, &pd);     // Retrieve DOCA device from Verbs PD doca_verbs_pd_as_doca_dev(pd, &dev);     // Create DPA context and completion contexts doca_dpa_create(dev, &dpa_ctx); doca_dpa_completion_create(dpa_ctx, 256, &send_completion); doca_dpa_completion_create(dpa_ctx, 256, &recv_completion);   // Configure QP with DPA completions doca_verbs_qp_init_attr_create(&qp_init_attr); doca_verbs_qp_init_attr_set_pd(qp_init_attr, pd); doca_verbs_qp_init_attr_set_external_datapath_en(qp_init_attr, 1); doca_verbs_qp_init_attr_set_dpa_ctx(qp_init_attr, dpa_ctx); doca_verbs_qp_init_attr_set_send_dpa_completion(qp_init_attr, send_completion); doca_verbs_qp_init_attr_set_receive_dpa_completion(qp_init_attr, recv_completion);   // Create queue pair doca_verbs_qp_create(verbs_ctx, qp_init_attr, &qp);   // Get DPA handles doca_dpa_dev_verbs_qp_t dpa_qp; doca_dpa_dev_completion_t send_comp, recv_comp;   doca_verbs_qp_get_dpa_handle(qp, dpa_ctx, &dpa_qp); doca_dpa_completion_get_dpa_handle(send_completion, &send_comp); doca_dpa_completion_get_dpa_handle(recv_completion, &recv_comp);


Advanced DPA QP Setup Pattern

Copy
Copied!
            

// Create DOCA Verbs context and PD doca_verbs_context_create(devinfo, 0, &verbs_ctx); doca_verbs_pd_create(verbs_ctx, &pd);   // Retrieve DOCA device from Verbs PD doca_verbs_pd_as_doca_dev(pd, &dev);   // Create DPA context and completion contexts doca_dpa_create(dev, &dpa_ctx); doca_dpa_completion_create(dpa_ctx, 256, &send_completion); doca_dpa_completion_create(dpa_ctx, 256, &recv_completion);   // Allocate external UMEM for QP using DPA heap addresses uint64_t qp_wq_dpa_addr = doca_dpa_mem_alloc(dpa_ctx, qp_umem_size); doca_umem_dpa_create(dpa_ctx, qp_wq_dpa_addr, &qp_umem);   uint64_t dbr_dpa_addr = doca_dpa_mem_alloc(dpa_ctx, qp_dbr_umem_size); doca_umem_dpa_create(dpa_ctx, dbr_dpa_addr, &qp_dbr_umem);   // Create UAR for doorbell access doca_uar_dpa_create(dpa_ctx, &dpa_uar);   // Configure QP with external resources doca_verbs_qp_init_attr_create(&qp_init_attr); doca_verbs_qp_init_attr_set_pd(qp_init_attr, pd); doca_verbs_qp_init_attr_set_external_datapath_en(qp_init_attr, 1); doca_verbs_qp_init_attr_set_external_umem(qp_init_attr, qp_umem, 0); doca_verbs_qp_init_attr_set_external_dbr_umem(qp_init_attr, qp_dbr_umem, 0); doca_verbs_qp_init_attr_set_external_uar(qp_init_attr, dpa_uar); doca_verbs_qp_init_attr_set_send_dpa_completion(qp_init_attr, send_completion); doca_verbs_qp_init_attr_set_receive_dpa_completion(qp_init_attr, recv_completion);   // Create queue pair doca_verbs_qp_create(verbs_ctx, qp_init_attr, &qp);   // Get DPA handles doca_dpa_dev_verbs_qp_t dpa_qp; doca_dpa_dev_completion_t send_comp, recv_comp;     doca_dpa_dev_verbs_qp_t dpa_qp; doca_verbs_qp_get_dpa_handle(qp, dpa_ctx, &dpa_qp);


SRQ Setup Pattern

Copy
Copied!
            

// Create SRQ with DPA context doca_verbs_srq_init_attr_create(&srq_init_attr); doca_verbs_srq_init_attr_set_pd(srq_init_attr, pd); doca_verbs_srq_init_attr_set_dpa(srq_init_attr, dpa_ctx); doca_verbs_srq_create(verbs_ctx, srq_init_attr, &srq);   // Get SRQ handle and post receives doca_dpa_dev_verbs_srq_t dpa_srq; doca_verbs_srq_get_dpa_handle(srq, dpa_ctx, &dpa_srq);


RDMA Write Operation (Data path example)

Copy
Copied!
            

__dpa_global__ void dpa_rdma_write_example(doca_dpa_dev_verbs_qp_t qp_handle,                                           uint64_t local_addr,                                           uint32_t length,                                           uint32_t lkey,                                           uint64_t remote_addr,                                           uint32_t rkey) {     struct doca_dpa_dev_verbs_send_wr send_wr;     struct doca_dpa_dev_verbs_sge sge;       // Configure scatter-gather element     sge.addr = local_addr;     sge.length = length;     sge.lkey = lkey;       // Configure RDMA write work request     doca_dpa_dev_verbs_send_wr_set_opcode(&send_wr, DOCA_DPA_DEV_VERBS_SEND_WR_OPCODE_WRITE);     doca_dpa_dev_verbs_send_wr_set_sg_list(&send_wr, &sge);     doca_dpa_dev_verbs_send_wr_set_sg_num_sge(&send_wr, 1);     doca_dpa_dev_verbs_send_wr_set_rdma_remote_addr(&send_wr, remote_addr);     doca_dpa_dev_verbs_send_wr_set_rdma_rkey(&send_wr, rkey);     doca_dpa_dev_verbs_send_wr_set_fence_mode(&send_wr, DOCA_DPA_DEV_VERBS_SEND_WR_FM_NO_FENCE);     doca_dpa_dev_verbs_send_wr_set_send_flags(&send_wr, DOCA_DPA_DEV_VERBS_SEND_WR_FLAGS_SIGNALED);       // Post and commit     uint32_t wr_count = doca_dpa_dev_verbs_qp_post_send_wr(qp_handle, &send_wr);     doca_dpa_dev_verbs_qp_commit_send(qp_handle); }


For better performance:

  • Use compile-time constant values for the opcode parameter in doca_dpa_dev_verbs_send_wr_set_opcode(). This allows the compiler to optimize the code path.

  • Use compile-time constant values for the srq_type parameter in doca_dpa_dev_verbs_srq_post_recv_wr(). This enables the compiler to generate more efficient code paths for SRQ operations.

Example: Use enum values like DOCA_DPA_DEV_VERBS_SEND_WR_OPCODE_WRITE, DOCA_DPA_DEV_VERBS_SRQ_TYPE_LINKED_LIST, or DOCA_DPA_DEV_VERBS_SRQ_TYPE_CONTIGUOUS as literal constants instead of passing them through variables.

© Copyright 2025, NVIDIA. Last updated on Aug 25, 2025.