DOCA DMA
Contents:
This guide provides instructions on building and developing applications that require copying memory using Direct Memory Access (DMA).
DOCA DMA provides an API to copy data between DOCA buffers using hardware acceleration, supporting both local and remote memory regions.
The library provides an API for executing DMA operations on DOCA buffers, where these buffers reside either in local memory (i.e., within the same host) or host memory accessible by the DPU. See DOCA Core for more information about the memory subsystem.
Using DOCA DMA, complex memory copy operations can be easily executed in an optimized, hardware-accelerated manner.
This document is intended for software developers wishing to accelerate their application's memory I/O operations and access memory that is not local to the host.
This library follows the architecture of a DOCA Core Context, it is recommended read the following sections before:
DOCA DMA-based applications can run either on the host machine or on the NVIDIA® BlueField® DPU target.
Copying from Host to DPU and vice versa only works with a DPU configured running in DPU mode as described in BlueField Modes of Operation.
DOCA DMA is a DOCA Context as defined by DOCA Core. See DOCA Core Context for more information.
DOCA DMA leverages DOCA Core architecture to expose asynchronous tasks/events that are offloaded to hardware.
DMA can be used to copy data as follows:
- Copying from local memory to local memory:   
- Using DPU to copy memory between host and DPU:   
- Using host to copy memory between host and DPU:   
Objects
Device and Device Representor
The DMA library needs a DOCA device to operate. The device is used to access memory and perform the actual copy. See DOCA Core Device Discovery.
For same BlueField DPU, it does not matter which device is used (PF/VF/SF), as all these devices utilize the same hardware component. If there are multiple DPUs, then it is possible to create a DMA instance per DPU, providing each instance with a device from a different DPU.
To access memory that is not local (from the host to the DPU or vice versa), the DPU side of the application must select a device with an appropriate representor. See DOCA Core Device Representor Discovery.
The device must stay valid for as long as the DMA instance is not destroyed.
Memory Buffers
The memory copy task requires two DOCA buffers containing the destination and the source. Depending on the allocation pattern of the buffers, refer to the table in the "Inventory Types" section. To find what kind of memory is supported, refer to the table in section "Buffer Support".
Buffers must not be modified or read during the memory copy operation.
To start using the library, users must go through a configuration phase as described in DOCA Core Context Configuration Phase.
This section describes how to configure and start the context, to allow execution of tasks and retrieval of events.
Configurations
The context can be configured to match the application use case.
To find if a configuration is supported, or what the min/max value for it is, refer to section "Device Support".
Mandatory Configurations
These configurations are mandatory and must be set by the application before attempting to start the context:
- At least one task/event type must be configured. See configuration of tasks and/or events in sections "Tasks" and "Events" respectively for information. 
- A device with appropriate support must be provided upon creation 
Device Support
DOCA DMA requires a device to operate. To picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change (see DOCA Core Device Support), it is recommended to select your device using the following method:
- doca_dma_cap_task_memcpy_is_supported
Some devices can allow different capabilities as follows:
- The maximum number of tasks 
- The maximum buffer size 
Buffer Support
Tasks support buffers with the following features:
| Buffer Type | Source Buffer | Destination Buffer | 
| Local mmap buffer | Yes | Yes | 
| mmap from PCIe export buffer | Yes | Yes | 
| mmap From RDMA export buffer | No | No | 
| Linked list buffer | Yes | Yes | 
This section describes execution on CPU using DOCA Core Progress Engine.
Tasks
DOCA DMA exposes asynchronous tasks that leverage the DPU hardware according to the DOCA Core architecture. See DOCA Core Task.
Memory Copy Task
The memory copy task allows copying memory from one location to another. Using buffers as described in Buffer Support.
Task Configuration
| Description | API to set the configuration | API to query support | 
| Enable the task |     
 |     
 | 
| Number of tasks |     
 |     
 | 
| Maximal buffer size | – |     
 | 
| Maximum buffer list size | – |     
 | 
Task Input
Common input as described in DOCA Core Task.
| Name | Description | Notes | 
| Source buffer | Buffer that points to the memory to be copied | Only the data residing in the data segment is copied | 
| Destination buffer | Buffer that points to where memory is copied | The data is copied to the tail segment extending the data segment | 
Task Output
Common output as described in DOCA Core Task.
Task Completion Success
After the task is completed successfully:
- The data is copied form source to destination 
- The destination buffer data segment is extended to include the copied data 
Task Completion Failure
If the task fails midway:
- The context may enter stopping state, if a fatal error occurs 
- The source and destination - doca_bufobjects are not modified
- The destination buffer contents may be modified 
Task Limitations
- The operation is not atomic 
- Once the task has been submitted, then the source and destination should not be read/written to 
- Source and destination must not overlap 
- Other limitations are described in DOCA Core Task 
Events
DOCA DMA exposes asynchronous events to notify on changes that happen unexpectedly, according to DOCA Core architecture.
The only event DMA exposes is common events as described in DOCA Core Event.
The DOCA DMA library follows the Context state machine as described in DOCA Core Context State Machine.
The following section describes how to move states and what is allowed in each state.
Idle
In this state it is expected that application:
- Destroys the context 
- Starts the context 
Allowed operations:
- Configuring the context according to section "Configurations" 
- Starting the context 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| None | Create the context | 
| Running | Call stop after making sure all tasks have been freed | 
| Stopping | Call progress until all tasks are completed and freed | 
Starting
This state cannot be reached.
Running
In this state it is expected that application:
- Allocates and submits tasks 
- Calls progress to complete tasks and/or receive events 
Allowed operations:
- Allocating a previously configured task 
- Submitting a task 
- Calling stop 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| Idle | Call start after configuration | 
Stopping
In this state it is expected that application:
- Calls progress to complete all inflight tasks (tasks complete with failure) 
- Frees any completed tasks 
Allowed operations:
- Call progress 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| Running | Call progress and fatal error occurs | 
| Running | Call stop without freeing all tasks | 
DOCA DMA allows data path to be run on the CPU or GPU.
For the CPU data path, see Execution Phase .
GPU Datapath
DOCA offers the DOCA GPUNetIO library which provides a programming model for offloading the orchestration of the communication to a GPU CUDA kernel.
The user may run a DMA operation on the GPU data path by configuring the DOCA DMA context used by the application in the following manner:
- Obtain DOCA CTX by calling - doca_dma_as_ctx().
- Set the datapath of the context to GPU by calling - doca_ctx_set_datapath_on_gpu(). For additional information, refer to DOCA Core Alternative Data Path.
- Finish context configuration and start the context by calling - doca_ctx_start(). For additional information, refer to DOCA Core Context.
After configuring the datapath, the user can obtain a GPU handle for the DOCA RDMA context by calling doca_dma_get_gpu_handle(). The GPU handle must be passed to a GPU CUDA kernel so the DOCA GPUNetIO CUDA device functions can execute datapath operations. For additional information, refer to section "GPU Functions – RDMA" under DOCA GPUNetIO library documentation.
This section describes DOCA DMA samples based on the DOCA DMA library.
The samples illustrate how to use the DOCA DMA API to do the following:
- Copy contents of a local buffer to another buffer 
- Use DPU to copy contents of buffer on the host to a local buffer 
All the DOCA samples described in this section are governed under the BSD-3 software license agreement.
Running the Samples
- Refer to the following documents: - DOCA Installation Guide for Linux for details on how to install BlueField-related software. 
- DOCA Troubleshooting for any issue you may encounter with the installation, compilation, or execution of DOCA samples. 
 
- To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file: - cd/opt/mellanox/doca/samples/doca_dma/dma_local_copy meson /tmp/build ninja -C /tmp/build- The binary - doca_dma_local_copyis created under- /tmp/build/.
- Sample (e.g., - doca_dma_local_copy) usage:- Usage: doca_<sample_name> [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis - - v, --version Print program version information -l, --log-level Set the (numeric) log level- forthe program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all- commandflags from an input json- fileProgram Flags: -p, --pci_addr <PCI-ADDRESS> PCI device address -t, --text Text to DMA copy -ns, --num-src-buf Number of doca_buf- for- sourcebuffer -nd, --num-dst-buf Number of doca_buf- fordestination buffer
- For additional information per sample, use the - -hoption:- /tmp/build/<sample_name> -h Info- The command line option "--num-src-buf" and "--num-dst-buf" are used to show linked-list usage of - doca-buf.
- The maximum supported number of elements is 64. 
- And they are only available for - doca_dma_local_copyand- doca_dma_copy_dpu, because- doca_dma_copy_hostdoes not need to construct any- doca_dma_task_memcpy.
 
Samples
These samples are also available on GitHub.
DMA Local Copy
This sample illustrates how to locally copy memory with DMA from one buffer to another on the DPU. This sample should be run on the DPU.
The sample logic includes:
- Locating DOCA device. 
- Initializing needed DOCA core structures. 
- Populating DOCA memory map with two relevant buffers. 
- Allocating element in DOCA buffer inventory for each buffer. 
- Initializing DOCA DMA memory copy task object. 
- Submitting DMA task. 
- Handling task completion once it is done. 
- Checking task result. 
- Destroying all DMA and DOCA core structures. 
Reference:
- /opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_sample.c
- /opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_main.c
- /opt/mellanox/doca/samples/doca_dma/dma_local_copy/meson.build
DMA Copy DPU
This sample should run only after DMA Copy Host is run and the required configuration files (descriptor and buffer) have been copied to the DPU.
This sample illustrates how to copy memory (which contains user defined text) with DMA from the x86 host into the DPU. This sample should be run on the DPU.
The sample logic includes:
- Locating DOCA device. 
- Initializing needed DOCA core structures. 
- Reading configuration files and saving their content into local buffers. 
- Allocating the local destination buffer in which the host text is to be saved. 
- Populating DOCA memory map with destination buffer. 
- Creating the remote memory map with the export descriptor file. 
- Creating memory map to the remote buffer. 
- Allocating element in DOCA buffer inventory for each buffer. 
- Initializing DOCA DMA memory copy task object. 
- Submitting DMA task. 
- Handling task completion once it is done. 
- Checking DMA task result. 
- If the DMA task ends successfully, printing the text that has been copied to log. 
- Printing to log that the host-side sample can be closed. 
- Destroying all DMA and DOCA core structures. 
Reference:
- /opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_sample.c
- /opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_main.c
- /opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/meson.build
DMA Copy Host
This sample should be run first. It is user responsibility to transfer the two configuration files (descriptor and buffer) to the DPU and provide their path to the DMA Copy DPU sample.
This sample illustrates how to allow memory copy with DMA from the x86 host into the DPU. This sample should be run on the host.
The sample logic includes:
- Locating DOCA device. 
- Initializing needed DOCA core structures. 
- Populating DOCA memory map with source buffer. 
- Exporting memory map. 
- Saving export descriptor and local DMA buffer information into files. These files should be transferred to the DPU before running the DPU sample. 
- Waiting until DPU DMA sample has finished. 
- Destroying all DMA and DOCA core structures. 
Reference:
- /opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_sample.c
- /opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_main.c
- /opt/mellanox/doca/samples/doca_dma/dma_copy_host/meson.build