DOCA STA
Contents:
A storage target is any application that can access data storage. This includes hard drives, SSDs, network-attached storage (NAS), and cloud storage services.
It serves storage access requests from local or remote initiators.
To allow the storage target to receive remote requests (over the network), the NVMe ( Non-Volatile Memory Express ) over fabrics (NVMeOF) protocol was introduced.
NVMe over Fabrics (NVMe-oF) is a protocol specification designed to extend the capabilities of NVMe storage across network fabrics (for example, RoCEv1/IB RDMA and NVMeTCP).
NVMe-oF extends the concept of storage targets by enabling NVMe commands to be executed over a network, rather than being limited to direct-attached storage.
Storage target application that supports NVMe-oF is a heavy application that consumes a substantial number of CPU cores.
To improve latency, free CPU resources and save power, an offload solution is required. Starting from BF3, the target application shall be offloaded on the DPA.
To abstract the communication between the target offload application (in our case SPDK) integration with the underlying offload accelerator and to ease the integration,
a new DOCA library is introduced (DOCA_STA). The library shall expose a public API that shall be used by the target application (DOCA application) for control-plane and data-plane handling.
While a software-based storage target accesses the NVMe drives directly over the PCIe, the DPA storage target offload accelerator accesses the NVMe drives
over PCIe-P2P (peer-to-peer) topology. To allow such access special kernel configuration is required.
To operate DPA storage target offload, there is a need to use a dedicated patched P2P kernel:
In case the target application runs on host machine, need to install a patched kernel that supports P2P (TBD: add patch link)
In case the target application runs on BF machine (attached directly to NVMe drives) then a P2P supported BFB should be installed (TBD: link to BFB)
In addition, the library makes use of existing DOCA libraries, therefore, it is recommended to read the following sections before:
DOCA STA-based applications can run either on the host machine or on the NVIDIA® BlueField3® networking platform (DPU or SuperNIC).
When running of BlueField platform, the NVMe drives should be directly PCIe-attached to the platform, in addition, need to use at least BlueField3 as
it is the first platform that contains the DPA accelerator.
DOCA STA library is composed of two parts:
Host side - exposes API to be used by the target application. Using this API the application can:
Initialize and configure the accelerator parameters
Register callbacks for data processing - some NVMe-oF command capsules could not processed by the offload engine and are forwarded to the library (non-offloaded capsules).
The application can process these capsules using a registered callback
Register callbacks for asynchronous events processing - the offload engine can generate and notify the library about events that occurred, such as NVMe drive timeout.
The application can process these events using a registered callback
Register tasks for various datapath and control-path operations:
Datapath tasks - after the application processes non-offloaded capsules, it can register tasks to the library for RDMA transmission (R/W/Send)
Control-path tasks - some control-path operations can take time and may be completed asynchronously, for example, QP disconnect or destroy the backend.
The application can register tasks for these operations
Device side - accelerator low-level code that implements the target offload logic (e.g. DPA handlers code).
DOCA STA has two types of DOCA contexts:
DOCA STA context - a DOCA context that is responsible for system initialization and configuration. Perform all target system-wide operations.
There should be only a single instance of DOCA STA context in the system (VM)
DOCA STA IO context - a DOCA context that is responsible for RDMA IO operations. A DOCA STA IO context can exist per system thread (pthread)
Since DOCA STA makes use of other DOCA libraries. The key libraries are:
DOCA RDMA - used to connect/disconnect QPs on the accelerator datapath (e.g DPA)
DOCA COMCH - used as a communication to/from the offload accelerator engine. For example, post RDMA tasks and receive completions or post QP disconnect tasks and get completions
The following diagram illustrates the high-level connectivity of DOCA STA:

Objects
Device and Device Representor
The STA library uses a control device (PF) and a network device(s) (SF). The control device is used to set-up
the DPA, e.g. load the image (DPA-related application).
The network device is used to manage incoming and outgoing NVMe-oF traffic.
The application that uses doca-sta library, must select a control and network devices with an appropriate representor.
For more details - see DOCA Core Device Discovery and DOCA Core Device Representor Discovery.
To start using the library, users must go through a configuration phase as described in the DOCA Core Context Configuration Phase.
This section describes how to configure and start the doca-sta context.
Configurations
The doca-sta context should be configured to match the application use case.
Such configuration includes:
creation doca_sta object (main context)
add network device(s)
connect doca-sta context to the doca progress engine (PE)
start doca-sta context
create doca-sta-io object (io context/io threads)
connect doca-sta-io context to the doca progress engine (PE
start doca-sta-io context(s)
Note: the io context may be created only after the start of the main context.
Only the following start configuration order is allowed:
create the main context
start the main context
create io context(s)
start io context(s)
Only the following stop configuration order is allowed:
stop io context(s)
stop the main context
Create main context
The library extensively uses the DPA, so the control device (PF) that supports STA functionality should be used to create the doca-sta object.
You should use doca_sta_cap_is_supported
API to check if the PF device supports the STA functionality.
Add network device
The library will utilize the network device(s) (SF) to manage incoming and outgoing NVMe-oF traffic.
Before starting the doca-sta object, the network devices (resources) must be added to it.
Use the doca_sta_add_dev
API to add a network device (SF) for this purpose.
Progress engine
The PE is used to progress tasks and events. The progress engine is associated with one or
more doca contexts. A doca context can only be associated with one progress engine.
Note: it is the responsibility of the application that used the library to create a dedicated PE and
connect doca-sta context to this PE.
Start main context
When doc-sta context is created and configured, it should be started.
Create io context
The io context represents the io thread. The io thread is used to initiate RDMA connectivity flows
as well as handle 'non-offload' commands.
The following are example of such flows:
connect QP
disconnect QP
handle non-offload command
The non-offload command is the command that should not be handled by doca-sta library itself.
Such commands should be delivered to the application layer for further processing.
The io context should be created in the same way as the main context:
It is the responsibility of the application that uses the library to create a dedicated PE for each io context
and connect doca-sta-io context to this PE.
Start io context
When doc-sta-io context is created and configured, it should be started.
The library should be used for offloading the NVMe-oF traffic on DPA.
The library provides APIs for configuring an NVMe-oF target application in compliance with the NVMe specification.
The application must complete the following steps before it can receive traffic from the initiator side.
add subsystems
add namespaces into subsystems
bound NVMe PCI disks to the namespaces
Note: The application utilizing the library is responsible for implementing the RDMA connection management service.
Please refer to the content of the doca_sta_subsystem.h and doca_sta_be.h header files for more information.
RDMA connect
Upon receiving the RDMA connect request, the application should call an appropriate API:
doca_sta_io_qp_alloc + doca_sta_io_qp_accept
or
doca_sta_io_qp_connect
Upon successful completion, the sta QP is created. The sta QP handle represents the sta QP.
Note: subsequent communication of the application layer with QP API and vice versa, should be done in the context of
the same io thread (io context) from which the QP was created.
Fabric connect
Immediately, after establishing a connection, the "fabric connect" (FC) command capsule will be sent from the initiator.
The FC should be handled by the application (non-offload flow).
The application layer is responsible for:
get data/payload for the FC command (if it was not received as inline data)
verify parameters inside FC command (host nqn, subsystem nqn, etc.)
complete FC processing (send a response to the initiator)
Note: all actions initiated by the application layer are asynchronous. They are part of the
non-offload flow/communication between DPA ← → host (DPU).
Please refer to the content of the doca_sta_io_non_offload.h header file for more information.
Tasks
The DOCA STA exposes asynchronous tasks that should be used for different async flows like:
disconnect QP
remove (detach) namespace
destroy back-end queue
The asynchronous flow is initiated by allocating a task through the appropriate API.
The application is responsible for submitting the task to initiate the desired action.
The library is responsible for submitting and monitoring the completion of the asynchronous flow.
Once completed, the library will notify the application about completion by issuing the callback function.
Task Input
Common input as described in DOCA Core Task.
Task Output
Common output as described in DOCA Core Task.
Task Limitations
The operation is not atomic
Other limitations are described in the DOCA Core Task
Non-offload flow task
The app layer is responsible for processing the NVMeoF capsules for the non-offload flow.
First, the NVMeoF capsule should be delivered from the DPA → host and eventually delegated to the app layer.
It might be required to perform the RDMA operations during the processing of the capsule.
Note: the QP is owned by DOCA STA library and must be accessed by the library API only.
The following are the non-offload tasks that can be used for initiating the RDMA operation
(by the application layer, for processing non-offload capsules):
RDMA-READ
RDMA_WRITE with RDMA_SEND
RDMA_SEND
Please refer to the content of the doca_sta_io_non_offload.h file for more information.
non-offload RDMA-READ
The task is responsible for initiating RDMA-READ operation.
Upon completion, an appropriate callback function will be invoked to notify the application layer about completion.
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The RDMA-READ operation was completed successfully
Task Completion Failure
If the task fails midway:
The RDMA-READ operation was failed
non-offload RDMA-WRITE + RDMA-SEND
The task is responsible for initiating the RDMA-WRITE operation in conjunction with the RDMA-SEND.
Upon completion, an appropriate callback function will be invoked to notify the application layer about completion.
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The RDMA-WRITE & RDMA-SEND operation was completed successfully
Task Completion Failure
If the task fails midway:
The RDMA-WRITE & RDMA-SEND operation was failed
non-offload RDMA-SEND
The task is responsible for initiating the RDMA-SEND operation.
Upon completion, an appropriate callback function will be invoked to notify the application layer about completion.
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The RDMA-SEND operation was completed successfully
Task Completion Failure
If the task fails midway:
The RDMA-SEND operation was failed
QP disconnect task
The QP disconnect task is responsible for initiating the asynchronous flow of QP disconnect:
move QP to the error state
send a notification message from a host (DPU) to the appropriate DPA
about QP disconnect
DPA should verify that there are no in/out IO is available
DPA will send a response to the host/DPU
notify the application about task completion
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The QP is disconnected and might be destroyed
Task Completion Failure
If the task fails midway:
the QP was not disconnected
Remove namespace task
The remove (detach) namespace task is responsible for initiating the asynchronous flow
of the detach namespace:
send a notification message from a host (DPU) to the appropriate DPA about detach namespace
DPA should mark the namespace as not valid anymore
DPA will send a response to the host/DPU
notify the application about task completion
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The namespace is removed
Any I/O sent to the removed namespace will be completed with an error
Task Completion Failure
If the task fails midway:
The namespace is not removed
Destroy back-end queue task
The destroy 'be queue' task is responsible for initiating the asynchronous flow of the destroy back-end queue:
send a notification message from a host (DPU) to the appropriate DPA about destroying be queue
DPA should mark the 'be queue' as destroyed
Any outstanding commands should be completed with an error
DPA will send a response to the host/DPU
notify the application about task completion
Task Configuration
Description | API to set the configuration |
Enable the task |
|
Task Allocation
Description | API to set the configuration |
Allocation the task |
|
Task Completion Success
After the task is completed successfully:
The be queue is removed
Task Completion Failure
If the task fails midway:
The be queue is not removed
The DOCA STA library follows the context state machine as described in DOCA Core Context State Machine.
The following section describes how to move states and what is allowed in each state.
Idle
In this state, it is expected that the application:
Destroys the context
Starts the context
Allowed operations:
Configuring the context according to the section
Starting the context
It is possible to reach this state as follows:
Previous State | Transition Action |
None | Create the context |
Running | Call stop after making sure all tasks have been freed |
Stopping | Call progress until all tasks are completed and freed |
Starting
This state cannot be reached.
Running
In this state, it is expected that the application:
Allocates and submits tasks
Calls progress to complete tasks and/or receive events
Allowed operations:
Allocating a previously configured task
Submitting a task
Calling stop
It is possible to reach this state as follows:
Previous State | Transition Action |
Idle | Call start after the configuration |
Stopping
In this state, it is expected that the application:
Calls progress to complete all inflight tasks (tasks completed with failure)
Frees any completed tasks
Allowed operations:
Call progress
It is possible to reach this state as follows:
Previous State | Transition Action |
Running | Call progress and a fatal error occurs |
Running | Call stop without freeing all tasks |
Offload sample
This sample application demonstrates the basic usage of the doca-sta API.
cd
/opt/mellanox/doca/applications
meson setup /tmp/build
ninja -C /tmp/build
tmp/build/sta_offload/doca_sta_offload --doca_lib_lvl 50 --doca_app_lvl 40 --sf_dev mlx5_0
For a reference application utilizing the DOCA STA library, please get in touch with NVIDIA Enterprise Support.