DOCA STA
The quality status of DOCA libraries is listed here.
The DOCA STA library simplifies the integration and offloading of storage target applications, such as SPDK, onto NVIDIA® BlueField®-3 and newer platforms. A storage target is any application that manages data storage access, whether from hard drives, SSDs, network-attached storage (NAS), or cloud services, and serves requests from local or remote initiators.
To enable network-based access, the NVMe-oF protocol extends the capabilities of NVMe storage across network fabrics like RoCEv1/IB RDMA and NVMeTCP. By allowing NVMe commands to execute over the network, NVMe-oF overcomes the limitations of direct-attached storage.
Applications supporting NVMe-oF are resource-intensive, requiring substantial CPU resources. To reduce CPU usage, improve latency, and conserve power, an offload solution is essential. With BlueField-3, storage target applications can leverage the data processing accelerator (DPA) to offload their operations.
DOCA STA provides a public API to manage both control-plane and data-plane operations, abstracting the complexity of communication between the offloaded target application and the hardware accelerator. This abstraction allows developers to streamline the integration process and focus on application logic.
Unlike traditional software-based storage targets that access NVMe drives over PCIe, DPA offload accelerators utilize PCIe peer-to-peer (P2P) topology to connect to NVMe drives. Special kernel configurations are required to enable this topology.
To enable DPA storage target offload, a dedicated patched P2P kernel is required:
- For target applications running on the host machine, install a patched kernel that supports P2P. 
Note that the patches need to be applied manually and a custom kernel needs to be built to make it work.
- For target applications running on a BlueField machine, ensure a P2P-supported BFB is installed when the BlueField is directly attached to NVMe drives. For this you need the latest BFB with Ubuntu 22.04 (and DOCA v2.10.0). 
In addition, the library makes use of existing DOCA libraries, therefore, it is recommended to read the following sections before:
DOCA STA-based applications can run either on the host machine or on the NVIDIA® BlueField-3® networking platform (DPU or SuperNIC).
When running on the BlueField platform, NVMe drives must be directly PCIe-attached, and BlueField-3 or newer is required.
The DOCA STA library consists of two main components:
- Host side – The host-side component exposes an API for use by the target application. Using this API, the application can: - Initialize and configure accelerator parameters. 
- Register callbacks for data processing – Certain NVMe-oF command capsules that cannot be processed by the offload engine are forwarded to the library as "non-offloaded capsules." The application can process these capsules using registered callbacks. 
- Register callbacks for asynchronous events – The offload engine can generate notifications for events such as NVMe drive timeouts. The application can handle these events using registered callbacks. 
- Register tasks for various operations: - Datapath tasks – After processing non-offloaded capsules, the application can register tasks for RDMA operations, such as read, write, or send. 
- Control-path tasks – Some control operations, such as queue pair (QP) disconnection or backend destruction, may take time to complete asynchronously. The application can register tasks for these operations. 
 
 
- Device side – The device-side component consists of low-level accelerator code that implements the target offload logic, such as DPA handler code. 
DOCA STA Contexts
DOCA STA operates with two types of DOCA contexts:
- DOCA STA context - Responsible for system-wide initialization and configuration. 
- Manages all target system-wide operations. 
- There should be only one instance of the DOCA STA context in the system (or virtual machine). 
 
- DOCA STA IO context: - Responsible for RDMA IO operations. 
- Can exist for each system thread (e.g., per - pthread).
 
Key DOCA Libraries
DOCA STA relies on several other DOCA libraries for its functionality. The key libraries are:
- DOCA RDMA – Used for connecting or disconnecting queue pairs (QPs) on the accelerator datapath (e.g., DPA). 
- DOCA Comch – Enables communication to and from the offload accelerator engine. For example, it facilitates posting RDMA tasks and receiving completions or handling QP disconnect tasks and their completions. 
High-Level Connectivity
The following diagram illustrates the high-level connectivity of DOCA STA:
 
Objects
Device and Device Representor
The DOCA STA library utilizes two types of devices: a control device (PF) and one or more network devices (SF).
- Control device (PF) – This device is used to configure the DPA, such as loading the DPA-related application image. 
- Network device (SF) – This device handles incoming and outgoing NVMe-oF traffic. 
Applications using the DOCA STA library must select both a control device and network devices with the appropriate representors. For more information, refer to the sections on DOCA Core Device Discovery and DOCA Core Device Representor Discovery.
To begin using the DOCA STA library, users must complete a configuration phase as outlined in the DOCA Core Context Configuration Phase.
This section details the steps required to configure and initialize the DOCA STA context.
Configurations
The DOCA STA context must be configured to align with the application's use case.
The configuration process involves the following steps:
- Create the DOCA STA object (main context). 
- Add network device(s). 
- Connect the DOCA STA context to the DOCA progress engine (PE). 
- Start the DOCA STA context. 
- Create the DOCA STA IO object(s) (IO context/IO threads). 
- Connect the DOCA STA IO context(s) to the DOCA PE. 
- Start the DOCA STA IO context(s). 
The IO context(s) can only be created after the main context has been started.
The following order must be followed during the start phase:
- Create the main context. 
- Start the main context. 
- Create IO context(s). 
- Start IO context(s). 
The following order must be followed during the stop phase:
- Stop IO context(s). 
- Stop the main context. 
Create Main Context
The library relies heavily on the DPA, so the control device (PF) with STA functionality must be used to create the doca-sta object.
Use the doca_sta_cap_is_supported API to verify if the PF device supports STA functionality before proceeding.
Add Network Device
The library utilizes network device(s) (SF) to manage incoming and outgoing NVMe-oF traffic.
Before starting the doca-sta object, network devices (resources) must be added to it. Use the doca_sta_add_dev API to add network devices (SF) as needed.
Progress Engine
The Progress Engine (PE) is responsible for progressing tasks and events. Each PE is associated with one or more DOCA contexts, but a DOCA context can only be associated with a single PE.
It is the responsibility of the application using the library to create a dedicated PE and connect the doca-sta context to it.
    
    
        
Start Main Context
Once the doca-sta context is created and configured, it must be started to enable its functionality.
Create IO Context
The IO context represents the IO thread, which is responsible for initiating RDMA connectivity flows and handling "non-offload" commands.
Examples of such flows include:
- Connecting QPs 
- Disconnecting QPs 
- Handling non-offload commands 
Non-offload commands refer to commands that the doca-sta library does not handle internally. These commands must be forwarded to the application layer for processing.
The IO context should be created in a similar manner to the main context.
The application using the library is responsible for creating a dedicated PE for each IO context and connecting the doca-sta-io context to this PE.
    
    
        
Start IO Context
Once the doca-sta-io context is created and configured, it must be started to enable IO-related operations.
The DOCA STA library enables offloading of NVMe-oF traffic to the DPA, providing APIs to configure an NVMe-oF target application in compliance with the NVMe specification.
Before the application can begin receiving traffic from the initiator, the following steps must be completed:
- Add subsystems. 
- Add namespaces to the subsystems. 
- Bind NVMe PCI disks to the namespaces. 
The application is responsible for implementing the RDMA connection management service.
For additional details, refer to the doca_sta_subsystem.h and doca_sta_be.h header files.
RDMA Connect
Upon receiving an RDMA connection request, the application should use one of the following APIs:
- doca_sta_io_qp_alloc+- doca_sta_io_qp_accept
- doca_sta_io_qp_connect
If the operation completes successfully, an STA QP (Queue Pair) is created, represented by an STA QP handle.
All communication between the application layer and QP API (and vice versa) must occur within the same IO thread (IO context) from which the QP was created.
Fabric Connect
After a connection is established, the initiator will immediately send a "Fabric Connect" (FC) command capsule.
The application is responsible for handling the FC command as part of the non-offload flow. This includes:
- Retrieving the data/payload for the FC command (if not received as inline data). 
- Verifying the parameters inside the FC command (e.g., - host NQN,- subsystem NQN).
- Completing FC processing by sending a response to the initiator. 
All actions initiated by the application layer are asynchronous and are part of the non-offload flow. Communication between the DPA and host (DPU) is handled in this context.
For further details, refer to the doca_sta_io_non_offload.h header file.
Tasks
The DOCA STA library provides support for asynchronous tasks to handle various asynchronous flows, such as:
- Disconnecting a QP 
- Removing (detaching) a namespace 
- Destroying a back-end queue 
The asynchronous flow is initiated by the application through the allocation of a task using the appropriate API. Once the task is allocated, the application is responsible for submitting it to trigger the desired action.
After submission, the library manages the execution and monitors the completion of the asynchronous flow. When the operation is completed, the library notifies the application by invoking a callback function.
Task Input
Common input as described in DOCA Core Task.
Task Output
Common output as described in DOCA Core Task.
Task Limitations
- The operation is not atomic 
- Other limitations are described in the DOCA Core Task 
Non-offload Flow Task
The application layer is responsible for processing NVMe-oF capsules within the non-offload flow.
Initially, the NVMe-oF capsule is delivered from the DPA to the host and then delegated to the application layer for processing. During this process, the application may need to perform RDMA operations to handle the capsule effectively.
The QP is managed by the DOCA STA library and must only be accessed through the library's API.
The following non-offload tasks can be used by the application layer to initiate RDMA operations for processing non-offload capsules:
- RDMA-READ 
- RDMA-WRITE with RDMA-SEND 
- RDMA-SEND 
For additional details, refer to the doca_sta_io_non_offload.h file.
Non-offload RDMA-READ
The RDMA-READ task initiates an RDMA-READ operation. Upon completion of the operation, the library will invoke the appropriate callback function to notify the application layer of its completion.
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The RDMA-READ operation is completed successfully 
Task Completion Failure
If the task fails midway:
- The RDMA-READ operation fails 
Non-offload RDMA-WRITE + RDMA-SEND
This task is responsible for initiating an RDMA-WRITE operation followed by an RDMA-SEND operation.
Upon successful completion of the operation, the library will invoke the corresponding callback function to notify the application layer about its completion.
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The RDMA-WRITE and RDMA-SEND operation completed successfully 
Task Completion Failure
If the task fails midway:
- The RDMA-WRITE and RDMA-SEND operation fails 
Non-offload RDMA-SEND
This task is responsible for initiating an RDMA-SEND operation.
Upon completion, the library will invoke the appropriate callback function to notify the application layer of its successful completion.
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The RDMA-SEND operation is completed successfully 
Task Completion Failure
If the task fails midway:
- The RDMA-SEND operation fails 
QP Disconnect Task
The QP disconnect task is responsible for initiating the asynchronous flow required to disconnect a QP. The process includes the following steps:
- Move the QP to the error state. 
- Send a notification message from the host (DPU) to the corresponding DPA, indicating the QP disconnect request. 
- The DPA verifies that there are no pending inbound or outbound IO operations. 
- Once verification is complete, the DPA sends a response back to the host (DPU). 
- The library notifies the application about the completion of the disconnect task through a callback function. 
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The QP is disconnected and might be destroyed 
Task Completion Failure
If the task fails midway:
- The QP was not disconnected 
Remove Namespace Task
The remove (detach) namespace task initiates the asynchronous process for detaching a namespace. The flow includes the following steps:
- Send a notification message from the host (DPU) to the corresponding DPA about the detach namespace request. 
- The DPA marks the namespace as no longer valid. 
- The DPA sends a response back to the host (DPU) confirming the detachment. 
- The library notifies the application about the task's completion through a callback function. 
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The namespace is removed 
- Any I/O sent to the removed namespace is completed with an error 
Task Completion Failure
If the task fails midway:
- The namespace is not removed 
Destroy Back-end Queue Task
The destroy back-end (BE) queue task initiates the asynchronous process for destroying a back-end queue. The flow includes the following steps:
- Send a notification message from the host (DPU) to the appropriate DPA to request the destruction of the back-end queue. 
- The DPA marks the back-end queue as destroyed. 
- Any outstanding commands in the queue are completed with an error status. 
- The DPA sends a response back to the host (DPU) confirming the destruction. 
- The library notifies the application about the task's completion through a callback function. 
Task Configuration
| Description | API to Set the Configuration | 
| Enable the task |     
 | 
Task Allocation
| Description | API to Set the Configuration | 
| Allocation the task |     
 | 
Task Completion Success
After the task is completed successfully:
- The BE queue is removed 
Task Completion Failure
If the task fails midway:
- The BE queue is not removed 
The DOCA STA library follows the context state machine as described in DOCA Core Context State Machine.
The following section describes how to move states and what is allowed in each state.
Idle
In this state, it is expected that the application:
- Destroys the context 
- Starts the context 
Allowed operations:
- Configuring the context according to the section 
- Starting the context 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| None | Create the context | 
| Running | Call stop after making sure all tasks have been freed | 
| Stopping | Call progress until all tasks are completed and freed | 
Starting
This state cannot be reached.
Running
In this state, it is expected that the application:
- Allocates and submits tasks 
- Calls progress to complete tasks and/or receive events 
Allowed operations:
- Allocating a previously configured task 
- Submitting a task 
- Calling stop 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| Idle | Call start after the configuration | 
Stopping
In this state, it is expected that the application:
- Calls progress to complete all inflight tasks (tasks completed with failure) 
- Frees any completed tasks 
Allowed operations:
- Call progress 
It is possible to reach this state as follows:
| Previous State | Transition Action | 
| Running | Call progress and a fatal error occurs | 
| Running | Call stop without freeing all tasks | 
Offload Sample
This sample application demonstrates the basic usage of the DOCA STA API:
            
            cd /opt/mellanox/doca/applications
meson setup /tmp/build
ninja -C /tmp/build
 
tmp/build/sta_offload/doca_sta_offload --doca_lib_lvl 50 --doca_app_lvl 40 --sf_dev mlx5_0
    
For a reference application utilizing the DOCA STA library, please get in touch with NVIDIA Enterprise Support.