DOCA DMA - NVIDIA Docs

This guide provides instructions on building and developing applications that require copying memory using Direct Memory Access (DMA).

Introduction

DOCA DMA provides an API to copy data between DOCA buffers using hardware acceleration, supporting both local and remote memory regions.

The library provides an API for executing DMA operations on DOCA buffers, where these buffers reside either in local memory (i.e., within the same host) or host memory accessible by the DPU. See DOCA Core for more information about the memory subsystem.

Using DOCA DMA, complex memory copy operations can be easily executed in an optimized, hardware-accelerated manner.

This document is intended for software developers wishing to accelerate their application's memory I/O operations and access memory that is not local to the host.

Prerequisites

This library follows the architecture of a DOCA Core Context, it is recommended read the following sections before:

Library Changes From Previous Releases

Changes in 2.9.0

N/A

Environment

DOCA DMA-based applications can run either on the host machine or on the NVIDIA® BlueField® DPU target.

Copying from Host to DPU and vice versa only works with a DPU configured running in DPU mode as described in BlueField Modes of Operation.

Architecture

DOCA DMA is a DOCA Context as defined by DOCA Core. See DOCA Core Context for more information.

DOCA DMA leverages DOCA Core architecture to expose asynchronous tasks/events that are offloaded to hardware.

DMA can be used to copy data as follows:

Copying from local memory to local memory:
Using DPU to copy memory between host and DPU:
Using host to copy memory between host and DPU:

Objects

Device and Device Representor

The DMA library needs a DOCA device to operate. The device is used to access memory and perform the actual copy. See DOCA Core Device Discovery.

For same BlueField DPU, it does not matter which device is used (PF/VF/SF), as all these devices utilize the same hardware component. If there are multiple DPUs, then it is possible to create a DMA instance per DPU, providing each instance with a device from a different DPU.

To access memory that is not local (from the host to the DPU or vice versa), the DPU side of the application must select a device with an appropriate representor. See DOCA Core Device Representor Discovery.

The device must stay valid for as long as the DMA instance is not destroyed.

Memory Buffers

The memory copy task requires two DOCA buffers containing the destination and the source. Depending on the allocation pattern of the buffers, refer to the table in the "Inventory Types" section. To find what kind of memory is supported, refer to the table in section "Buffer Support".

Buffers must not be modified or read during the memory copy operation.

Configuration Phase

To start using the library, users must go through a configuration phase as described in DOCA Core Context Configuration Phase.

This section describes how to configure and start the context, to allow execution of tasks and retrieval of events.

Configurations

The context can be configured to match the application use case.

To find if a configuration is supported, or what the min/max value for it is, refer to section "Device Support".

Mandatory Configurations

These configurations are mandatory and must be set by the application before attempting to start the context:

At least one task/event type must be configured. See configuration of tasks and/or events in sections "Tasks" and "Events" respectively for information.
A device with appropriate support must be provided upon creation

Device Support

DOCA DMA requires a device to operate. To picking a device, refer to "DOCA Core Device Discovery".

As device capabilities may change (see DOCA Core Device Support), it is recommended to select your device using the following method:

doca_dma_cap_task_memcpy_is_supported

Some devices can allow different capabilities as follows:

The maximum number of tasks
The maximum buffer size

Buffer Support

Tasks support buffers with the following features:

Buffer Type	Source Buffer	Destination Buffer
Local mmap buffer	Yes	Yes
mmap from PCIe export buffer	Yes	Yes
mmap From RDMA export buffer	No	No
Linked list buffer	Yes	Yes

Execution Phase

This section describes execution on CPU using DOCA Core Progress Engine.

Tasks

DOCA DMA exposes asynchronous tasks that leverage the DPU hardware according to the DOCA Core architecture. See DOCA Core Task.

Memory Copy Task

The memory copy task allows copying memory from one location to another. Using buffers as described in Buffer Support.

Task Configuration

Description	API to set the configuration	API to query support
Enable the task	`doca_dma_task_memcpy_set_conf`	`doca_dma_cap_task_memcpy_is_supported`
Number of tasks	`doca_dma_task_memcpy_set_conf`	`doca_dma_cap_get_max_num_tasks`
Maximal buffer size	–	`doca_dma_cap_task_memcpy_get_max_buf_size`
Maximum buffer list size	–	`doca_dma_cap_task_memcpy_get_max_buf_list_len`

Task Input

Common input as described in DOCA Core Task.

Name	Description	Notes
Source buffer	Buffer that points to the memory to be copied	Only the data residing in the data segment is copied
Destination buffer	Buffer that points to where memory is copied	The data is copied to the tail segment extending the data segment

Task Output

Common output as described in DOCA Core Task.

Task Completion Success

After the task is completed successfully:

The data is copied form source to destination
The destination buffer data segment is extended to include the copied data

Task Completion Failure

If the task fails midway:

The context may enter stopping state, if a fatal error occurs
The source and destination doca_buf objects are not modified
The destination buffer contents may be modified

Task Limitations

The operation is not atomic
Once the task has been submitted, then the source and destination should not be read/written to
Source and destination must not overlap
Other limitations are described in DOCA Core Task

Events

DOCA DMA exposes asynchronous events to notify on changes that happen unexpectedly, according to DOCA Core architecture.

The only event DMA exposes is common events as described in DOCA Core Event.

State Machine

The DOCA DMA library follows the Context state machine as described in DOCA Core Context State Machine.

The following section describes how to move states and what is allowed in each state.

Idle

In this state it is expected that application:

Destroys the context
Starts the context

Allowed operations:

Configuring the context according to section "Configurations"
Starting the context

It is possible to reach this state as follows:

Previous State	Transition Action
None	Create the context
Running	Call stop after making sure all tasks have been freed
Stopping	Call progress until all tasks are completed and freed

Starting

This state cannot be reached.

Running

In this state it is expected that application:

Allocates and submits tasks
Calls progress to complete tasks and/or receive events

Allowed operations:

Allocating a previously configured task
Submitting a task
Calling stop

It is possible to reach this state as follows:

Previous State	Transition Action
Idle	Call start after configuration

Stopping

In this state it is expected that application:

Calls progress to complete all inflight tasks (tasks complete with failure)
Frees any completed tasks

Allowed operations:

Call progress

It is possible to reach this state as follows:

Previous State	Transition Action
Running	Call progress and fatal error occurs
Running	Call stop without freeing all tasks

Alternative Datapath Options

DOCA DMA allows data path to be run on the CPU or GPU.

Info

For the CPU data path, see Execution Phase .

GPU Datapath

DOCA offers the DOCA GPUNetIO library which provides a programming model for offloading the orchestration of the communication to a GPU CUDA kernel.

The user may run a DMA operation on the GPU data path by configuring the DOCA DMA context used by the application in the following manner:

Obtain DOCA CTX by calling doca_dma_as_ctx().
Set the datapath of the context to GPU by calling doca_ctx_set_datapath_on_gpu(). For additional information, refer to DOCA Core Alternative Data Path.
Finish context configuration and start the context by calling doca_ctx_start(). For additional information, refer to DOCA Core Context.

After configuring the datapath, the user can obtain a GPU handle for the DOCA RDMA context by calling doca_dma_get_gpu_handle(). The GPU handle must be passed to a GPU CUDA kernel so the DOCA GPUNetIO CUDA device functions can execute datapath operations. For additional information, refer to section "GPU Functions – RDMA" under DOCA GPUNetIO library documentation.

DOCA DMA Samples

This section describes DOCA DMA samples based on the DOCA DMA library.

The samples illustrate how to use the DOCA DMA API to do the following:

Copy contents of a local buffer to another buffer
Use DPU to copy contents of buffer on the host to a local buffer

Info

All the DOCA samples described in this section are governed under the BSD-3 software license agreement.

Running the Samples

Refer to the following documents:
- DOCA Installation Guide for Linux for details on how to install BlueField-related software.
- DOCA Troubleshooting for any issue you may encounter with the installation, compilation, or execution of DOCA samples.
To build a given sample, run the following command. If you downloaded the sample from GitHub, update the path in the first line to reflect the location of the sample file:
Copy

Copied!
```
            
            cd /opt/mellanox/doca/samples/doca_dma/dma_local_copy
meson /tmp/build
ninja -C /tmp/build
        
```
The binary doca_dma_local_copy is created under /tmp/build/.

Sample (e.g., doca_dma_local_copy ) usage:

Copy
Copied!

            
            Usage: doca_<sample_name> [DOCA Flags] [Program Flags]
 
DOCA Flags:
  -h, --help                              Print a help synopsis
  -v, --version                           Print program version information    
  -l, --log-level                         Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  -j, --json <path>                       Parse all command flags from an input json file
  
Program Flags:
  -p, --pci_addr <PCI-ADDRESS>            PCI device address
  -t, --text                              Text to DMA copy
  -ns, --num-src-buf                      Number of doca_buf for source buffer
  -nd, --num-dst-buf                      Number of doca_buf for destination buffer

For additional information per sample, use the -h option:
Copy

Copied!
```
            
            /tmp/build/<sample_name> -h
        
```
Info
- The command line option "--num-src-buf" and "--num-dst-buf" are used to show linked-list usage of doca-buf.
- The maximum supported number of elements is 64.
- And they are only available for doca_dma_local_copy and doca_dma_copy_dpu, because doca_dma_copy_host does not need to construct any doca_dma_task_memcpy.

Samples

Tip

These samples are also available on GitHub.

DMA Local Copy

This sample illustrates how to locally copy memory with DMA from one buffer to another on the DPU. This sample should be run on the DPU.

The sample logic includes:

Locating DOCA device.
Initializing needed DOCA core structures.
Populating DOCA memory map with two relevant buffers.
Allocating element in DOCA buffer inventory for each buffer.
Initializing DOCA DMA memory copy task object.
Submitting DMA task.
Handling task completion once it is done.
Checking task result.
Destroying all DMA and DOCA core structures.

Reference:

/opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_local_copy/dma_local_copy_main.c
/opt/mellanox/doca/samples/doca_dma/dma_local_copy/meson.build

DMA Copy DPU

Note

This sample should run only after DMA Copy Host is run and the required configuration files (descriptor and buffer) have been copied to the DPU.

This sample illustrates how to copy memory (which contains user defined text) with DMA from the x86 host into the DPU. This sample should be run on the DPU.

The sample logic includes:

Locating DOCA device.
Initializing needed DOCA core structures.
Reading configuration files and saving their content into local buffers.
Allocating the local destination buffer in which the host text is to be saved.
Populating DOCA memory map with destination buffer.
Creating the remote memory map with the export descriptor file.
Creating memory map to the remote buffer.
Allocating element in DOCA buffer inventory for each buffer.
Initializing DOCA DMA memory copy task object.
Submitting DMA task.
Handling task completion once it is done.
Checking DMA task result.
If the DMA task ends successfully, printing the text that has been copied to log.
Printing to log that the host-side sample can be closed.
Destroying all DMA and DOCA core structures.

Reference:

/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/dma_copy_dpu_main.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_dpu/meson.build

DMA Copy Host

Note

This sample should be run first. It is user responsibility to transfer the two configuration files (descriptor and buffer) to the DPU and provide their path to the DMA Copy DPU sample.

This sample illustrates how to allow memory copy with DMA from the x86 host into the DPU. This sample should be run on the host.

The sample logic includes:

Locating DOCA device.
Initializing needed DOCA core structures.
Populating DOCA memory map with source buffer.
Exporting memory map.
Saving export descriptor and local DMA buffer information into files. These files should be transferred to the DPU before running the DPU sample.
Waiting until DPU DMA sample has finished.
Destroying all DMA and DOCA core structures.

Reference:

/opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_sample.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_host/dma_copy_host_main.c
/opt/mellanox/doca/samples/doca_dma/dma_copy_host/meson.build

On This Page