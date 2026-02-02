On This Page
- Introduction
- Prerequisites
- Library Changes From Previous Releases
- Environment
- Architecture
- Configuration Phase
- Execution Phase
- State Machine
- Alternative Datapath Options
- DOCA DMA Samples
DOCA DMA
This guide provides instructions on building and developing applications that require copying memory using Direct Memory Access (DMA).
DOCA DMA provides an API to copy data between DOCA buffers using hardware acceleration, supporting both local and remote memory regions.
The library provides an API for executing DMA operations on DOCA buffers, where these buffers reside either in local memory (i.e., within the same host) or host memory accessible by the DPU. See DOCA Core for more information about the memory subsystem.
Using DOCA DMA, complex memory copy operations can be easily executed in an optimized, hardware-accelerated manner.
This document is intended for software developers wishing to accelerate their application's memory I/O operations and access memory that is not local to the host.
This library follows the architecture of a DOCA Core Context, it is recommended read the following sections before:
DOCA DMA-based applications can run either on the host machine or on the NVIDIA® BlueField® DPU target.
Copying from Host to DPU and vice versa only works with a DPU configured running in DPU mode as described in BlueField Modes of Operation.
DOCA DMA is a DOCA Context as defined by DOCA Core. See DOCA Core Context for more information.
DOCA DMA leverages DOCA Core architecture to expose asynchronous tasks/events that are offloaded to hardware.
DMA can be used to copy data as follows:
Copying from local memory to local memory:
Using DPU to copy memory between host and DPU:
Using host to copy memory between host and DPU:
Objects
Device and Device Representor
The DMA library needs a DOCA device to operate. The device is used to access memory and perform the actual copy. See DOCA Core Device Discovery.
For same BlueField DPU, it does not matter which device is used (PF/VF/SF), as all these devices utilize the same hardware component. If there are multiple DPUs, then it is possible to create a DMA instance per DPU, providing each instance with a device from a different DPU.
To access memory that is not local (from the host to the DPU or vice versa), the DPU side of the application must select a device with an appropriate representor. See DOCA Core Device Representor Discovery.
The device must stay valid for as long as the DMA instance is not destroyed.
Memory Buffers
The memory copy task requires two DOCA buffers containing the destination and the source. Depending on the allocation pattern of the buffers, refer to the table in the "Inventory Types" section. To find what kind of memory is supported, refer to the table in section "Buffer Support".
Buffers must not be modified or read during the memory copy operation.
To start using the library, users must go through a configuration phase as described in DOCA Core Context Configuration Phase.
This section describes how to configure and start the context, to allow execution of tasks and retrieval of events.
Configurations
The context can be configured to match the application use case.
To find if a configuration is supported, or what the min/max value for it is, refer to section "Device Support".
Mandatory Configurations
These configurations are mandatory and must be set by the application before attempting to start the context:
At least one task/event type must be configured. See configuration of tasks and/or events in sections "Tasks" and "Events" respectively for information.
A device with appropriate support must be provided upon creation
Device Support
DOCA DMA requires a device to operate. To picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change (see DOCA Core Device Support), it is recommended to select your device using the following method:
doca_dma_cap_task_memcpy_is_supported
Some devices can allow different capabilities as follows:
The maximum number of tasks
The maximum buffer size
Buffer Support
Tasks support buffers with the following features:
Buffer Type
Source Buffer
Destination Buffer
Local mmap buffer
Yes
Yes
mmap from PCIe export buffer
Yes
Yes
mmap From RDMA export buffer
No
No
Linked list buffer
Yes
No
This section describes execution on CPU using DOCA Core Progress Engine.
Tasks
DOCA DMA exposes asynchronous tasks that leverage the DPU hardware according to the DOCA Core architecture. See DOCA Core Task.
Memory Copy Task
The memory copy task allows copying memory from one location to another. Using buffers as described in Buffer Support.
Task Configuration
Description
API to set the configuration
API to query support
Enable the task
Number of tasks
Maximal buffer size
–
Maximum buffer list size
–
Task Input
Common input as described in DOCA Core Task.
Name
Description
Notes
Source buffer
Buffer that points to the memory to be copied
Only the data residing in the data segment is copied
Destination buffer
Buffer that points to where memory is copied
The data is copied to the tail segment extending the data segment
Task Output
Common output as described in DOCA Core Task.
Task Completion Success
After the task is completed successfully:
The data is copied form source to destination
The destination buffer data segment is extended to include the copied data
Task Completion Failure
If the task fails midway:
The context may enter stopping state, if a fatal error occurs
The source and destination
doca_bufobjects are not modified
The destination buffer contents may be modified
Task Limitations
The operation is not atomic
Once the task has been submitted, then the source and destination should not be read/written to
Source and destination must not overlap
Other limitations are described in DOCA Core Task
Events
DOCA DMA exposes asynchronous events to notify on changes that happen unexpectedly, according to DOCA Core architecture.
The only event DMA exposes is common events as described in DOCA Core Event.
The DOCA DMA library follows the Context state machine as described in DOCA Core Context State Machine.
The following section describes how to move states and what is allowed in each state.
Idle
In this state it is expected that application:
Destroys the context
Starts the context
Allowed operations:
Configuring the context according to section "Configurations"
Starting the context
It is possible to reach this state as follows:
Previous State
Transition Action
None
Create the context
Running
Call stop after making sure all tasks have been freed
Stopping
Call progress until all tasks are completed and freed
Starting
This state cannot be reached.
Running
In this state it is expected that application:
Allocates and submits tasks
Calls progress to complete tasks and/or receive events
Allowed operations:
Allocating a previously configured task
Submitting a task
Calling stop
It is possible to reach this state as follows:
Previous State
Transition Action
Idle
Call start after configuration
Stopping
In this state it is expected that application:
Calls progress to complete all inflight tasks (tasks complete with failure)
Frees any completed tasks
Allowed operations:
Call progress
It is possible to reach this state as follows:
Previous State
Transition Action
Running
Call progress and fatal error occurs
Running
Call stop without freeing all tasks
DOCA DMA allows data path to be run on the CPU or GPU.
For the CPU data path, see Execution Phase .
GPU Datapath
DOCA offers the DOCA GPUNetIO library which provides a programming model for offloading the orchestration of the communication to a GPU CUDA kernel.
The user may run a DMA operation on the GPU data path by configuring the DOCA DMA context used by the application in the following manner:
Obtain DOCA CTX by calling
doca_dma_as_ctx().
Set the datapath of the context to GPU by calling
doca_ctx_set_datapath_on_gpu(). For additional information, refer to DOCA Core Alternative Data Path.
Finish context configuration and start the context by calling
doca_ctx_start(). For additional information, refer to DOCA Core Context.
After configuring the datapath, the user can obtain a GPU handle for the DOCA RDMA context by calling
doca_dma_get_gpu_handle(). The GPU handle must be passed to a GPU CUDA kernel so the DOCA GPUNetIO CUDA device functions can execute datapath operations. For additional information, refer to section "GPU Functions – RDMA" under DOCA GPUNetIO library documentation.
This section describes DOCA DMA samples based on the DOCA DMA library.
The samples illustrate how to use the DOCA DMA API to do the following:
Copy contents of a local buffer to another buffer
Use DPU to copy contents of buffer on the host to a local buffer
All the DOCA samples described in this section are governed under the BSD-3 software license agreement.
Running the Samples
Refer to the following documents:
DOCA Installation Guide for Linux for details on how to install BlueField-related software.
DOCA Troubleshooting for any issue you may encounter with the installation, compilation, or execution of DOCA samples.
To build a given sample:
cd/opt/mellanox/doca/samples/doca_dma/dma_local_copy meson /tmp/build ninja -C /tmp/build
The binary
doca_dma_local_copyis created under
/tmp/build/.
Sample (e.g.,
doca_dma_local_copy) usage:
Usage: doca_<sample_name> [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis -
v, --version Print program version information -l, --log-level Set the (numeric) log level
forthe program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all
commandflags from an input json
fileProgram Flags: -p, --pci_addr <PCI-ADDRESS> PCI device address -t, --text Text to DMA copy
For additional information per sample, use the
-hoption:
/tmp/build/<sample_name> -h
Samples
DMA Local Copy
This sample illustrates how to locally copy memory with DMA from one buffer to another on the DPU. This sample should be run on the DPU.
The sample logic includes:
Locating DOCA device.
Initializing needed DOCA core structures.
Populating DOCA memory map with two relevant buffers.
Allocating element in DOCA buffer inventory for each buffer.
Initializing DOCA DMA memory copy task object.
Submitting DMA task.
Handling task completion once it is done.
Checking task result.
Destroying all DMA and DOCA core structures.
Reference:
/opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_main.c
/opt/mellanox/doca/samples/doca_dma/dma_local_copy/meson.build
DMA Copy DPU
This sample should run only after DMA Copy Host is run and the required configuration files (descriptor and buffer) have been copied to the DPU.
This sample illustrates how to copy memory (which contains user defined text) with DMA from the x86 host into the DPU. This sample should be run on the DPU.
The sample logic includes:
Locating DOCA device.
Initializing needed DOCA core structures.
Reading configuration files and saving their content into local buffers.
Allocating the local destination buffer in which the host text is to be saved.
Populating DOCA memory map with destination buffer.
Creating the remote memory map with the export descriptor file.
Creating memory map to the remote buffer.
Allocating element in DOCA buffer inventory for each buffer.
Initializing DOCA DMA memory copy task object.
Submitting DMA task.
Handling task completion once it is done.
Checking DMA task result.
If the DMA task ends successfully, printing the text that has been copied to log.
Printing to log that the host-side sample can be closed.
Destroying all DMA and DOCA core structures.
Reference:
/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_main.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/meson.build
DMA Copy Host
This sample should be run first. It is user responsibility to transfer the two configuration files (descriptor and buffer) to the DPU and provide their path to the DMA Copy DPU sample.
This sample illustrates how to allow memory copy with DMA from the x86 host into the DPU. This sample should be run on the host.
The sample logic includes:
Locating DOCA device.
Initializing needed DOCA core structures.
Populating DOCA memory map with source buffer.
Exporting memory map.
Saving export descriptor and local DMA buffer information into files. These files should be transferred to the DPU before running the DPU sample.
Waiting until DPU DMA sample has finished.
Destroying all DMA and DOCA core structures.
Reference:
/opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_main.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_host/meson.build