NVIDIA DOCA NVMe Emulation App Guide
This document provides an NVMe emulation implementation on top of the NVIDIA® BlueField® DPU.
The NVMe emulation application exhibits how to use the DOCA DevEmu PCI Generic API along with SPDK to emulate an NVMe PCIe function using hardware acceleration to fully emulate the storage device.
Brief Introduction to NVMe
NVMe (Non-Volatile Memory Express) is a high-performance storage protocol designed for accessing non-volatile storage media. It operates over the PCIe bus, providing a direct and high-speed connection between the CPU and storage devices. This results in significantly lower latency and higher input/output operations per second (IOPS) compared to older storage protocols. NVMe achieves these performance improvements through its scalable, multi-queue architecture, which allows thousands of I/O commands to be processed in parallel.
NVMe emulation is a mechanism that simulates NVMe device behavior in virtualized or development environments, eliminating the need for physical NVMe hardware.
Controller Registers
NVMe controllers feature several memory-mapped registers located within memory regions defined by the Base Address Registers (BARs). These registers convey the controller's status, enable the host to configure operational settings, and facilitate error reporting.
Key NVMe controller registers include:
CC – Controller Configuration – Configures the controller's operational parameters, including enabling or disabling it and specifying I/O command sets
CSTS – Controller Status – Reports the controller's status, including its readiness and any fatal errors
CAP – Capabilities – Details the capabilities of the controller
VS – Version – Indicates the supported NVMe version
AQA – Admin Queue Attributes – Specifies the sizes of the admin queues
Initialization
Initializing an NVMe controller involves configuring the controller registers and preparing the system to communicate with the NVMe device. The process typically follows these steps:
The host clears the enable bit in the CC register (sets it to 0) to reset the controller.
The host configures the controller by setting initial parameters in the CC register and enables it by setting the enable bit.
The host sets up the admin submission and completion queues to handle administrative commands.
Administrative commands are issued to retrieve namespace information and perform other setup tasks.
The host creates I/O submission and completion queues, preparing the controller for I/O operations.
Reset and Shutdown
Reset and shutdown operations ensure the proper handling of the controller and the maintenance of data integrity.
The reset process involves setting the enable bit in the CC register to 0, which stops all I/O operations and clears the controller's state, returning it to a known state.
Shutdown is initiated by setting the shutdown notification (SHN) field in the CC register. This allows the controller to halt operations, flush caches, and ensure that any data in flight is safely handled before powering off or resetting.
Completion Queue
The completion queue (CQ) in NVMe stores entries that the controller writes after processing commands. Each entry in the CQ corresponds to a command submitted through a submission queue (SQ), and the host checks the CQ to track the status of these commands.
CQs are implemented as circular buffers in host memory. The host can either poll the CQ or use interrupts to be notified when new entries are available.
Completion Queue Element
Each completion queue element (CQE) in an NVMe CQ is an individual entry that contains status information about a completed command, including:
CID – Identifier of the completed command
SQID – ID of the SQ from which the command was issued
SQHD – Marks the point in the SQ up to which commands have been completed
SF – Indicates the status of the completed command
P – Flags whether the completion entry is new
Submission Queue
The submission queue (SQ) is where the host places admin and I/O commands for the controller to execute. It operates as a circular buffer, with each SQ paired with a corresponding CQ. NVMe supports multiple SQs, with each one assigned to a specific CQ.
The controller is notified of new commands via the doorbell mechanism.
Submission Queue Element
An SQ element (SQE) is an individual entry in an SQ that represents a command for the NVMe controller to process. Each SQE contains:
CID – A unique identifier for the command
OPC – The opcode that specifies the operation to be performed
PSDT – Indicates whether physical region pages (PRPs) or scatter-gather lists (SGLs) are used for data transfer associated with the command
NSID – The identifier of the target namespace for the command
Additionally, the SQE includes fields for command-specific details, such as logical block addresses and transfer sizes.
Admin Queue Pair
The admin queue pair (QP) consists of an admin SQ and an admin CQ, both assigned a command identifier (CID) of 0. There is only one admin QP for each NVMe controller, and it is created asynchronously during the initialization phase. Unlike I/O queues, this QP is dedicated solely to controller management tasks, facilitating the processing of administrative commands.
Admin Commands
The following subsections detail the admin commands currently supported by the transport.
Identify Admin Command
The SPDK_NVME_OPC_IDENTIFY 0x06
command allows the host to query information from the NVMe controller. It retrieves a data buffer that describes the attributes of the NVMe subsystem, the controller, or the namespace. This information is essential for host software to configure and utilize the storage device effectively.
Create I/O SQ Admin Command
The SPDK_NVME_OPC_CREATE_IO_SQ 0x01
command allows the host to instruct the NVMe controller to establish a new queue for submitting I/O commands.
Key parameters:
SQID – Identifies the specific I/O SQ being created
Queue depth – Specifies the number of entries the queue can hold
CQID – Identifies the associated CQ
Once the host sends the command with the necessary parameters, the controller allocates the required resources and returns a status indicating success or failure.
Delete I/O SQ Admin Command
The SPDK_NVME_OPC_DELETE_IO_SQ 0x00
command is used to remove an I/O SQ when it is no longer required.
Key parameters:
SQID – The identifier of the I/O SQ to be deleted
Once the host issues the command, the controller releases all resources associated with the queue and returns a status confirming the deletion. After the queue is deleted, no additional I/O commands can be submitted to it.
Create I/O CQ Admin Command
The SPDK_NVME_OPC_CREATE_IO_CQ 0x05
command is issued by the host to set up an I/O CQ in the NVMe controller.
Key parameters:
CQID – The identifier of the I/O CQ to be created
Queue depth – The number of entries the CQ can hold
PRP1/PRP2 – Pointers to the memory location where the CQ entries are stored
MSIX – The interrupt vector associated with this CQ
Once the host issues the command, the controller allocates the necessary resources, links the CQ to the specified interrupt vector, and returns a status confirming the creation.
Delete I/O CQ Admin Command
The SPDK_NVME_OPC_DELETE_IO_CQ 0x04
command is issued by the host to remove an existing I/O CQ from the NVMe controller.
Key parameters:
CQID – The identifier of the I/O CQ to be deleted
Upon receiving this command, the NVMe controller removes the specified CQ and frees all associated resources. Before a CQ can be deleted, all SQs linked to it must be either deleted or reassigned to another CQ. The controller returns a status code indicating whether the deletion was successful or if an error occurred. Once deleted, the CQ no longer processes completion entries from any linked SQ.
Get Features Admin Command
The SPDK_NVME_OPC_GET_FEATURES 0x0A
command is issued by the host to query specific features supported by the NVMe controller.
Key parameters:
Feature ID (FID) – Specifies which feature the host wants to retrieve. The controller returns information based on the requested feature. Common features include:
Arbitration (FID 0x01)
Power Management (FID 0x02)
Temperature Threshold (FID 0x04)
Error Recovery (FID 0x05)
Volatile Write Cache (FID 0x06)
Number of Queues (FID 0x07)
Interrupt Coalescing (FID 0x08)
Depending on the FID, the feature information may be returned in the CQE or written to an output buffer in host memory. If an output buffer is used, the host provides a memory region that the controller accesses via physical region page (PRP) entries or SGLs.
Select – Determines which version of the feature value to return. There are four options:
Current (0x0) – Returns the active value of the feature
Default (0x1) – Returns the default value of the feature
Saved (0x2) – Returns the saved value from non-volatile memory
Supported Capabilities (0x3) – Returns the capabilities supported by the controller for the feature
After executing the command, the controller returns a status code in the CQE indicating whether the query is successful.
Set Features Admin Command
The SPDK_NVME_OPC_SET_FEATURES 0x09
command is issued by the host to modify specific features on the NVMe controller.
Key parameters:
Feature ID (FID) – specifies which feature the host intends to modify. Common features include:
Arbitration (FID 0x01)
Power Management (FID 0x02)
Temperature Threshold (FID 0x04)
Error Recovery (FID 0x05)
Volatile Write Cache (FID 0x06)
Number of Queues (FID 0x07)
Interrupt Coalescing (FID 0x08)
Data location – depending on the FID, the new value can be provided directly in the SQE or stored in an input buffer located in host memory and is accessible by the controller via PRP or SGL
Save – this field allows the host to specify whether the modification should persist after a controller reset, if set the modified value is saved in the controller non-volatile memory
After the command is issued and the controller modifies the feature as requested, it returns a status code in the CQE, indicating whether the modification is successful or if an error occurred.
Log Page Admin Command
The SPDK_NVME_OPC_GET_LOG_PAGE 0x02
command is issued by the host to retrieve various types of log pages from the NVMe controller for monitoring and diagnosing the state of an NVMe device.
Key parameters:
LID – log page identifier that specifies the type of log page to retrieve. Some common log pages include:
SMART/Health Information (LID 0x02) – provides device health metrics, temperature, available spare, etc.
Error Information (LID 0x01) – contains details about errors encountered by the controller.
Firmware Slot Information (LID 0x03) – information on the firmware slots and active firmware.
Telemetry Host-Initiated (LID 0x07) – contains telemetry data about device performance.
NUMD – the number of DWORDs of log data to return, allowing for partial or full-page retrieval.
Log page data location – the retrieved log data is written into the output buffer provided by the host. The buffer is accessible by the controller via PRP or an SGL.
When the host issues the SPDK_NVME_OPC_GET_LOG_PAGE
command, the controller retrieves the requested log information and writes it to the host-provided memory. The controller then returns a status code in the CQE, indicating the success or failure of the operation.
I/O Queue Pair
An I/O QP consists of one SQ and its corresponding CQ, which are used to perform data transfers (I/O operations). Multiple I/O QPs can be created to enable parallel I/O operations, allowing each QP to function independently and maximizing the use of multi-core processors.
I/O SQ – the queue where the host places read, write, and flush commands
I/O CQ – the queue where the controller posts completion entries after processing the commands
Supported NVMe Commands
The following subsections detail the NVMe commands currently supported by the transport.
NVMe Flush Command
The SPDK_NVME_OPC_FLUSH 0x00
command is issued by the host to ensure that any data residing in volatile memory is securely written to permanent storage. If no volatile write cache is present or enabled, the Flush command completes successfully without any effect. Once the flush operation is finished, the controller updates the associated I/O CQ with a completion entry.
NVMe Read Command
The SPDK_NVME_OPC_READ 0x02
command is one of the core I/O operations in NVMe. This command is issued by the host to retrieve data from a specified namespace and transfer it to the host memory.
Key parameters:
NSID – the identifier of the namespace from which the data is being read.
LBA (logical block address) – the starting address of the data to be read within the namespace.
NLB (number of LBAs) – specifies the size of the read operation in terms of the number of logical blocks to be read.
Destination buffer – The controller retrieves the read data from the namespace and sends it to the host memory, where the destination buffer is specified using PRP or SGL.
Upon completion, the controller posts a completion entry to the I/O CQ that includes a status code indicating success or failure.
NVMe Write Command
The SPDK_NVME_OPC_WRITE 0x01
command is one of the core I/O operations in NVMe. This command is issued by the host to write data to a specific namespace at a given LBA.
Key parameters:
NSID – identifier of the namespace where the data is being written
LBA – destination address within the namespace where data is written
NLB – specifies the size of the write operation in terms of the number of logical blocks to be written
Source buffer – the data to be written is located in host memory, and the controller reads this data from the source buffer, which is provided using PRP or SGL.
Upon completion, the controller posts a completion entry to the I/O CQ, which includes a status code indicating the success or failure of the write operation.
Brief Introduction to SPDK
The Storage Performance Development Kit (SPDK) is an open-source framework that provides tools and libraries for building high-performance, scalable storage solutions, particularly for NVMe devices. SPDK enables applications to implement transport protocols for NVMe over Fabrics (NVMe-oF) by bypassing the kernel and using user-space drivers, allowing direct interaction with storage hardware. This approach significantly reduces latency and overhead, making it ideal for demanding storage environments.
A key component of SPDK is its highly optimized NVMe driver, which operates entirely in user space. By allowing direct communication with NVMe devices without involving the kernel, this driver minimizes I/O latency and enhances performance, supporting both local NVMe storage and remote NVMe devices over NVMe-oF.
SPDK Threading Model
SPDK's threading model is designed for high concurrency, scalability, and low-latency I/O processing. It operates on a cooperative multitasking model where SPDK threads are assigned to pollers, and tasks are executed entirely in user space without kernel involvement. Each SPDK thread runs on a dedicated CPU core, ensuring minimal context switching and allowing tight control over workloads in a non-preemptive environment.
Reactor Threads
At the core of this model are reactor threads, which serve as SPDK's main execution threads responsible for handling I/O processing and application logic. Each reactor thread is bound to a specific core and operates in polling mode, continuously checking for tasks and I/O requests instead of relying on interrupts.
Example:
struct spdk_thread *thread = spdk_get_thread();
This function retrieves the current SPDK thread, which is mapped to a specific core.
SPDK allows the registration of pollers to reactor threads. These pollers are periodic functions designed to complete asynchronous I/O operations, manage RPC servers, or perform custom operations. Once a poller completes an operation, it can trigger a user-defined callback to finalize the task.
Example of registering a poller:
struct spdk_poller *my_poller = spdk_poller_register(my_poll_function, arg, poll_interval_us);
Here, my_poll_function
is invoked repeatedly by the reactor thread.
Poll Groups
Poll groups consist of multiple SPDK threads that work together to manage I/O across multiple devices or connections. A poll group enables a set of threads to coordinate and handle shared workloads, ensuring efficient distribution of tasks across available cores with minimal latency.
Thread Synchronization
In SPDK’s cooperative threading model, thread synchronization is designed to be efficient and minimal, as threads do not experience preemptive context switching like traditional kernel threads. This design allows for fine-grained control over when tasks are executed. However, coordination between threads becomes necessary in certain situations, such as when handling shared resources or passing tasks between cores.
Instead of relying on traditional locking mechanisms, which can introduce performance bottlenecks due to contention, SPDK employs message passing as the primary method for thread communication. This involves sending events or tasks between threads through a lockless event ring, enabling coordination without the overhead associated with locks.
Example of sending a message between threads:
void
send_message(struct spdk_thread *target_thread, spdk_msg_fn fn, void
*arg) {
spdk_thread_send_msg(target_thread, fn, arg);
}
Where:
target thread
– the SPDK to send a message tofn
– the function to be executed in the context of the target threadarg
– the argument passed to the function
Block Device
SPDK provides a flexible system for working with different types of storage devices, such as NVMe SSDs, virtual block devices, asynchronous input/output (AIO) devices, and RAM disks. It uses high-performance, user-space APIs to enable applications to bypass the OS's kernel, reducing delays and improving performance.
SPDK's block device (bdev) layer offers applications a unified way to perform read and write operations on these devices. It supports popular devices out of the box and allows users to create custom block devices. Advanced features like Redundant Array of Independent Disks (RAID) and acceleration with technologies like Data Plane Development Kit (DPDK) are also supported.
Block devices can be easily created or destroyed using SPDK's remote procedure call (RPC) server. In NVMe-oF environments, a block device represents a namespace, which helps manage storage across different systems. This makes SPDK ideal for building fast, scalable storage solutions.
NVMe-oF Target
The NVMe-oF target is a user-space application within the SPDK framework that exposes block devices over network fabrics such as Ethernet, InfiniBand, or Fibre Channel (FC). It typically uses transport protocols like TCP, RDMA, or virtual function I/O (vfio-user) to enable clients to access remote storage.
To use the NVMe-oF target, it is required to configure it to work with one of these transport protocols:
TCP
RDMA (over InfiniBand or RoCE)
vfio-user (mainly for virtual machines)
FC-NVMe (less common in SPDK environments)
NVMe-oF Transport
Each NVMe-oF transport plays a crucial role in facilitating communication between the NVMe-oF initiator (the client) and the target. The transport handles how NVMe commands are transmitted over the network fabric. For example:
TCP – uses IP-based addressing over Ethernet
RDMA – leverages the low-latency, high-throughput characteristics of InfiniBand or RoCE
vfio-user – provides virtualization support, allowing virtual machines to access NVMe devices
FC-NVMe – uses Fibre Channel, often found in enterprise SAN environments
Additionally, SPDK is designed to be flexible, allowing developers to create custom transports and extend the functionality of the NVMFS target beyond the standard transports provided by SPDK.
The NVMe-oF transport is responsible for:
Establishing the connection between the initiator and the target
Translating network-layer commands into SPDK NVMe commands
Managing data transfer across the network fabric
Once a connection is established through the transport, the NVMe-oF target processes the NVMe commands sent by the initiator.
Application Layer
Above the transport layer is the SPDK application layer, which is transport-agnostic. This means that regardless of the transport being used (TCP, RDMA, etc.), the application layer handles NVMe commands uniformly. It is responsible for:
Managing subsystems, namespaces, and controllers
Processing NVMe commands received over the network
Mapping these commands to the appropriate storage devices (e.g., NVMe SSDs or virtual devices like SPDK's
malloc
ornull
devices)
This uniform application layer ensures that the transport layer interacts with the same logic for processing and responding to NVMe commands, regardless of the underlying network fabric.
NVMe Driver in SPDK
In the context of NVMe storage, the initiator is the host system that needs to access an NVMe storage device to send or retrieve data. The driver is responsible for generating and sending the appropriate commands, such as read and write operations. The driver uses the transport to communicate with the NVMe target. The transport sends those commands either over a network in the case of a remote target or directly to a local NVMe device through the PCIe bus. The target receives these requests from the initiator, processes them, and responds with the data or completion status.
RPC Server
SPDK's Remote Procedure Call (RPC) server provides a flexible interface for clients to interact with various SPDK services, such as NVMe-oF, block devices (bdevs), and other storage subsystems. The server is based on JSON-RPC and runs within the SPDK application, allowing external clients to dynamically configure, control, and manage SPDK components without the need for application restarts. Through RPC requests, users can create, delete, or query subsystems, configure network storage layers, and manage NVMe-oF targets. The main functionality of the RPC server is to process incoming RPC commands, which are usually in JSON format, execute the functions based on the RPC requests, and return the results to the client.
The RPC server runs within the SPDK application.
RPC Client
The RPC client in SPDK interacts with the RPC server to issue commands for configuring and managing various SPDK subsystems. It sends a command, typically in JSON format, asking the server to perform a certain task or retrieve data. The main functionality of the RPC client is to construct the request message that it wants the server to process, send the request to the RPC server, and receive the response from the server.
The RPC client runs on the user or application side.
Transport RPCs
SPDK provides several transport RPCs used to configure and manage the transport layer for NVMe-oF. The following are the primary transport RPCs:
nvmf_create_transport
nvmf_get_transports
nvmf_subsystem_add_listener
nvmf_get_subsystems
nvmf_delete_listener
nvmf_get_stats
nvmf_delete_transport
Block Device RPCs
SPDK provides several block device RPCs used to configure and manage block devices. The following are the primary block device RPCs:
nvmf_create_transport
nvmf_get_transports
nvmf_subsystem_add_listener
nvmf_get_subsystems
nvmf_delete_listener
nvmf_get_stats
nvmf_delete_transport
Namespace RPCs
Namespace RPCs are used to manage NVMe namespaces within NVMe-oF subsystems or NVMe controllers. Here is a list of namespace RPCs available:
nvmf_subsystem_add_ns
nvmf_subsystem_remove_ns
nvmf_subsystem_get_ns
nvmf_subsystem_get_ns_stats
nvmf_subsystem_get_namespaces
Solution Overview
Integration with SPDK
Using the DOCA Generic PCIe Emulation SDK, the BlueField DPU can emulate NVMe devices through PCIe endpoints. This allows the DPU to appear as a physical NVMe device to the host system, enabling the host to send NVMe commands. While the DPU hardware can handle data movement and basic I/O tasks, managing the complete NVMe protocol— including admin commands, queue management, and register operations—requires additional software support. This is where SPDK comes in.
DOCA Generic Device Emulation as NVMe-oF Transport
While NVMe-oF is designed for remote transports like TCP or RDMA, SPDK enables us to treat PCIe as another transport option by adding a memory-based transport. This allows the DPU to function as if it is communicating with a remote NVMe-oF target, even though it is local to the host system.
To implement this, NVIDIA uses a DOCA transport, which acts as a custom transport layer that serves as a connection tunnel and provides NVMe-oF with generic emulation capabilities. By leveraging SPDK’s RPCs for NVMe-oF, the DPU can effectively emulate an NVMe device. The DOCA transport ensures efficient routing and processing of NVMe commands, while SPDK handles the software-based emulation.
The application utilizes the NVMe-oF application transport layer to implement an NVMe emulation solution, inspired by this SPDK blog post.
Emulated Function as NVMe Controller
In the DOCA transport, the NVMe controller is mapped to a PCIe DOCA device known as the emulation manager. In this context, the emulation manager serves as the hardware interface that provides access to the NVMe controller, exposing its capabilities through specific PCIe registers and memory-mapped regions.
When connecting a device to a controller, the transport is responsible for providing the controller's unique ID through the Connect command, as specified by the NVMe-oF protocol.
To make the core NVMe-oF target logic work with our DOCA transport, we need to implement specific operations in the spdk_nvmf_transport_ops
structure. These operations handle tasks like managing connections, transferring data, and processing NVMe commands for DOCA. This structure provides a standard way to connect different transports to SPDK, allowing the core NVMe-oF logic to work with any transport without needing to know its specific details.
const
struct spdk_nvmf_transport_ops spdk_nvmf_transport_doca = {
.name = "DOCA"
,
.type = SPDK_NVME_TRANSPORT_CUSTOM,
.opts_init = nvmf_doca_opts_init,
.create = nvmf_doca_create,
.dump_opts = nvmf_doca_dump_opts,
.destroy = nvmf_doca_destroy,
.listen = nvmf_doca_listen,
.stop_listen = nvmf_doca_stop_listen,
.listen_associate = nvmf_doca_listen_associate,
.poll_group_create = nvmf_doca_poll_group_create,
.get_optimal_poll_group = nvmf_doca_get_optimal_poll_group,
.poll_group_destroy = nvmf_doca_poll_group_destroy,
.poll_group_add = nvmf_doca_poll_group_add,
.poll_group_remove = nvmf_doca_poll_group_remove,
.poll_group_poll = nvmf_doca_poll_group_poll,
.req_free = nvmf_doca_req_free,
.req_complete = nvmf_doca_req_complete,
.qpair_fini = nvmf_doca_close_qpair,
.qpair_get_listen_trid = nvmf_doca_qpair_get_listen_trid,
};
New SPDK RPCs
Since the DOCA transport requires specific configurations that are not covered by the existing SPDK RPCs and differ from other transports (e.g., managing emulation managers), custom RPCs must be implemented to expose these options to users:
RPC |
Description |
Details |
Arguments |
Output |
Example |
|
Provides the ability to list all emulation managers, which are equivalent to DOCA devices |
Returns the names of all available local DOCA devices with management capabilities. |
None. |
If successful, the RPC returns a list of device names for the emulation managers. If it fails, it returns an error code. |
|
|
Provides the ability to create an emulated function under a specified device name. |
Creates a new representor device, retrieves its VUID, and then closes the device. |
|
If successful, the RPC returns the VUID of the newly created function. In case of failure, it returns an error code. |
|
|
Provides the ability to destroy an emulated function. |
Destroys a DOCA device representor. |
|
On success, the RPC returns nothing. On failure, it returns an error code. |
|
|
Lists all the emulated functions under the specified device name. |
Lists all the available representor devices. |
|
If successful, the RPC returns a list containing the VUID and PCIe address of all the emulated functions under the specified device name. In case of failure, it returns an error code. |
|
/usr/bin/spdk_rpc.py
is the Python script that sends RPC commands to SPDK. The spdk_rpc.py
script is responsible for handling SPDK commands via the JSON-RPC interface.
Extended RPCs
Some existing SPDK RPCs must be modified because certain configurations or capabilities of the DOCA transport are not supported by the default RPCs:
RPC |
Description |
Details |
Arguments |
Output |
Example |
|
Creates a new NVMe-oF transport by defining the transport type and configuration parameters, allowing the SPDK target to communicate with hosts using the specified transport. |
Creates DOCA transport and its resources |
|
None |
|
|
Adds a listener to an NVMe-oF subsystem, enabling it to accept connections over a specified transport. |
Hot plugs the device, allowing the host to interact with it as an NVMe device |
|
None |
|
/usr/bin/spdk_rpc.py
is the Python script that sends RPC commands to SPDK. The spdk_rpc.py
script is responsible for handling SPDK commands via the JSON-RPC interface.
Data Structures
To implement the APIs for spdk_transport_ops
mentioned above, we created transport-specific data structures that efficiently interact with these APIs. These structures are designed to manage the transport's state, connections, and operations.
The architecture is structured as follows: The upper layer represents the host, followed by the DPU, and further down is the DPA, which is part of the DPU. Within the DPU, the NVMe-oF application runs, divided into two parts:
The NVMe-oF library – which is used as a black box
The DOCA transport – Implemented by NVIDIA
In the DOCA transport, there are several poll groups, each representing a thread. In addition to the poll group list, the transport maintains an emulation managers list, consisting of all devices managed by this transport. There is also a special poll group instance dedicated to polling PCIe events.
Diving deeper into the poll group, there are two progress engines: one that polls the I/O queues and another that handles the admin queues. Additionally, there is a list of PCIe device poll groups, each of which manages the CQs, SQs, and host memory mappings for a specific device.
The following sections provide more information on each of the main structures.
Emulation Manager
During initialization, the DOCA transport scans for all available DOCA devices that can serve as emulation managers. For each of these devices, it creates a PCIe type, initializes a DPA instance, assigns a DPA application, and starts the DPA process. It also opens the DOCA device (emulation manager). All of these devices are stored within the emulation manager context, where they are tracked for the duration of the transport's activity.
PCIe Device Admin
This is the representor of the emulated device and it contains the following:
DOCA transport – pointer to the DOCA transport this device belongs to
Emulation managers – the emulation manager
PCIe device – the emulated PCIe device
SPDK subsystem – pointer to the SPDK subsystem this device belongs to
Device representor – The PCIe device representor
Transport ID – the transport ID this device belongs to
DOCA listener state – used to track initialization and reset flows
Controller ID – pointer to SPDK NVMe-oF controller
Stateful region values – the stateful region values and are updated after each query
SPDK NVMe-oF controller – pointer to the SPDK NVMe-oF controller associated with this device
DOCA admin QP – admin QP context
Admin QP poll group – manages the admin QP and it is selected from the system using the round-robin method
FLR flag – indicates if an function-level reset (FLR) event has occurred
Destroy flow flag – indicates if PCIe device should be destroyed
Once the user issues the add listener RPC, this context is established to facilitate the hot-plugging of the emulated device to the host, allowing it to start listening for interactions from the host.
DOCA Poll Group
Each poll group in SPDK is associated with a thread running on a specific CPU core. For each core, a reactor thread is responsible for executing pollers, and the poll group is one of those pollers. When an NVMe-oF is created, it is assigned to each poll group so that the transport can handle I/O and connection management for the devices across multiple CPU cores, ensuring that each transport has a representative within each poll group.
DOCA poll group fields:
SPDK NVMe-oF transport poll group (
struct spdk_nvmf_transport_poll_group
) – refers to a structure that is part of the NVMe-oF subsystem and is responsible for handling transport specific I/O operations at the poll group level.Progress engine (
struct doca_pe
) – when each poll group runs its set of pollers, it also invokes the DOCA progress engine to manage transport operations. Thedoca_pe_progress
function is called on the DOCA side to drive the progress engine within each poll group, integrating DOCA’s PE into SPDK’s poll group mechanism.Admin QP progress engine (
struct doca_pe
) – another progress engine dedicated to handling admin queues, while the previous one manages the I/O queues. This separation allows for greater control over the polling rate of each queue type, optimizing performance.Admin QP poll rate and admin QP rate limiter – these determine how often the system checks the admin queue for new commands.
List of PCIe device poll group – List of devices that this poll group typically polls for their queues.
Admin Poll Group
This object is a per-transport entity that functions as a dedicated unit for polling PCIe events and managing admin queue activities.
Progress engine – the DOCA used by the poller
SPDK Poller – this poller operates on the SPDK application thread, continuously monitoring for PCIe events such as FLR, stateful region changes, and hot-plug events
SPDK thread – the application thread associated with the currently executing SPDK thread
PCIe admins – a list of all the PCIe device administrators
DOCA Transport
This structure holds the overall state and configuration of the transport and it includes:
SPDK NVMe-oF transport – defines the transport layer within the SPDK NVMe-oF framework. It holds essential data for managing the transport, including configuration parameters, operational states, and connections.
Emulation managers – contains all the devices managed by this transport
Poll groups – contains all the poll groups actively polling for this transport
Last selected poll group – used to assist with round-robin selection of poll groups, ensuring an even distribution of workload across poll groups during transport operations
Admin poll group – described previously
Number of listeners – the number of devices within this transport
PCIe Device Poll Group
The following diagram illustrates the relationship between PCIe devices and the poll groups within a transport. Each PCIe device contains I/O and admin queues, which are distributed across the poll groups.
This relationship is managed by a structure called the "PCIe device poll group" (struct nvmf_doca_pci_dev_poll_group
), which holds the device's memory map (mmap), the PCIe device's admin details, admin QPs if applicable, a list of I/O queues, and the poll group responsible for polling those queues.
When creating a new I/O or admin queue for a specific device and poll group, the existence of a PCIe device poll group structure that links the two is first checked. If it does not exist, a new struct nvmf_doca_pci_dev_poll_group
is created to combine them.
Admin QP
This structure manages the queues for a specific device. It holds an admin CQ and an admin SQ to handle admin command operations. Additionally, it includes lists for I/O CQs and I/O SQs, which manage data-related operations. The structure also contains a flag, stopping_all_io_cqs
, to indicate whether all CQs should be stopped, facilitating a graceful halt of the device's queue processing when needed.
DOCA IO
The I/O structure is responsible for managing I/O operations, including receiving doorbells on the CQ and its associated SQs. It handles reading SQEs from the host, writing SQEs back to the host, and raising MSI-X interrupts. This structure contains a single CQ, along with an MSI-X vector index raised by the DPA, and a doorbell completion that is polled by the DPA thread.
It also holds several callbacks:
Post CQE – invoked once a CQE is posted to the host CQ, freeing resources afterward
Fetch CQE – invoked when an SQE is fetched from the host, parsing and executing the request
Copy data – invoked after data is copied to the host, completing the request and freeing resources
Stop SQ – invoked when an admin or I/O SQ is stopped to fully release resources
Stop IO – invoked when an admin or I/O CQ is stopped to complete resource cleanup
The function nvmf_doca_io_create
synchronously creates an emulated I/O.
DOCA CQ
The CQ's main task is to write cookies to host. The main fields are:
SQ ID – If this is an admin CQ then the CQ ID is 0.
DOCA queue – A shared structure that serves as both the CQ and SQ for admin and I/O operations, mirroring the host's queues with the same size. It is responsible for fetching SQEs from the host and posting CQEs back to the host. The NVMe driver issues commands through its SQs and receives their completions via the CQs.
To facilitate efficient processing by the DPU, the DOCA queue leverages DMA to handle data transfers between the host and DPU in both directions. Each queue is equipped with a pointer to a DMA structure that contains the necessary resources for these operations. During initialization, DMA resources and local buffers are allocated based on the queue's size. The DOCA queue also maintains an array of tasks, with each task at index
idx
corresponding to and synchronized with the task at the same index in the host's queue. When DMA operations are required, these resources are utilized for data transfer.The following outlines its main fields:
Buffer inventory – to allocate the queue elements
DMA – DMA context handles data transfer to and from the host
Local queue mmap – represents the local memory where elements are stored for copying
Local queue address – local address for copying the elements to or from the host
Elements (DMA tasks) – the elements themselves are DMA tasks for copy/write
Number of elements – the maximum number of elements the queue can hold
DOCA Comch
A full-duplex communication channel is used to facilitate message passing between the DPA and DPU in both directions. This channel is contained within the DOCA IO and consists of two key components:
A send message queue (
nvmf_doca_dpa_msgq
) with a producer completion context (doca_dpa_completion
)A receive message queue (
nvmf_doca_dpa_msgq
) with a consumer completion context (doca_comch_consumer_completion
)
DOCA DPA Thread
In addition to the communication channel, the I/O structure also includes a DPA thread context, with each thread dedicated to a single CQ. It consists of:
A pointer to the DPA
A pointer to the DPA thread
The necessary arguments for the DPA thread include:
Consumer Completion – continuously polled by the DPA thread to detect new messages from the host
Producer Completion – monitored to verify if messages have been successfully sent
DB Completion Context – provides the doorbell values and connects to both CQ and SQ doorbells
MSIX – allows the device to send interrupts to the host
The DPA thread performs three main operations:
nvmf_doca_dpa_thread_create
– creates a DPA thread by providing the DOCA DPA, the DPA handle, and the size of the arguments to be passed to the DPA. It also allocates memory on the devicenvmf_doca_dpa_thread_run
– copies the arguments to the DPA and runs the threadnvmf_doca_dpa_thread_destroy
– deletes the thread and frees the allocated arguments
DOCA SQ
The main fields that construct the DOCA SQ (nvmf_doca_sq
) include:
DOCA IO – a reference to the DOCA I/O to which this SQ belongs, and where its completion is posted. Multiple SQs can belong to a single I/O.
DOCA queue – the DOCA queue, previously described, which handles copying SQEs from the host
DMA pool – a pool of DMA (
nvmf_doca_dma_pool
) data copy operations (to be defined shortly)DB – the doorbell associated with this SQ
DB handle – the DPA handle of the doorbell
SQ ID – the SQ identifier (SQID)
State – the state of the SQ, used for monitoring purposes
Request pool – NVMe-oF request pool memory (to be defined shortly)
Request pool memory – a list of NVMe-oF DOCA empty requests, used whenever a new SQE is received and a request needs to be prepared
NVMe-oF QP – the QP created for this SQ by the NVMe-oF target, which is used to execute commands
The creation of the SQ is asynchronous because, when the CQ is created, the DPA is also initialized. However, since the DPA is already running when creating the SQ, it is necessary to update the DPA about the newly added SQ. Directly modifying its state could lead to synchronization issues, which is why a communication channel is used, making the process asynchronous.
DOCA DMA Pool
An SQ includes the nvmf_doca_dma_pool
structure, which manages data transfer operations between the host and the DPU in both directions. It consists of the following elements:
Local data memory – memory allocated for local data buffers
Local data mmap – memory-mapped region for the local data buffers
Local data pool – a pool of local data buffers
Host data mmap – memory mapping that provides access to host data buffers
Host data inventory – an inventory for allocating host data buffers
DMA – a DMA context used for transferring data between the host and the DPU
This structure is initialized whenever the SQ is created. The size of the local data memory is determined by multiplying the maximum number of DMA copy operations by the maximum size in bytes for each DMA copy operation. All these local buffers are allocated during the creation of the SQ.
The key operations performed on the DMA are:
nvmf_doca_sq_get_dpu_buffer
– retrieves a buffer in DPU memory, enabling data transfers between the host and the DPUnvmf_doca_sq_get_host_buffer
– retrieves a buffer pointing to host memory, also used for data transfers between the host and the DPUnvmf_doca_sq_copy_data
– copies data between the host and the DPU. This operation is asynchronous, and upon completion, it invokes thenvmf_doca_io::copy_data_cb
callback function.
DOCA NVMe-oF Request
The NVMe-oF target utilizes requests to handle incoming commands. When a new command is received by any transport (not limited to the DOCA transport), it creates an instance of the struct spdk_nvmsf_request
. This structure contains various elements, including the NVMe command, the SQ to which the command belongs, the QP, the IO Vector (IOV), and other relevant data. In our design, we introduce a new wrapper structure called nvmf_doca_request
, which encapsulates the NVMe-oF request structure along with additional fields specific to the DOCA transport.
The main fields included in this structure:
SPDK NVMe-oF request – an instance of struct
spdk_nvmf_request
which represents the command being processedSQ – index of the submission queue element that holds this request
SPDK cpl – used to track and indicate the completion status of NVMe commands
SPDK NVMe command – holds essential information about the operation to be performed (e.g., command type, data pointers, parameters)
Host and DPU DOCA buffers – are pointers to data buffers located at the host or DPU, containing data associated with this command
SQE index – index of the submission queue element that holds this request
DOCA request callback – is the function invoked upon request completion, receiving the appropriate DOCA request callback arguments
The key operations performed on requests include:
nvmf_doca_request_pool_create
– when the SQ is created, a request pool is allocated with a size matching the SQ depthnvmf_doca_request_pool_destroy
– this function destroys the request pool when the SQ is removednvmf_doca_request_get
– retrieves an NVMe-oF request object from the pool associated with a specific SQ, and is called after fetching an SQE from the hostnvmf_doca_request_complete
– completes an NVMe-oF request by invoking its callback and then releasing the request back to the pool
Control Path Flows
DOCA Transport Listen
Hotplug and Hotunplug
Start with
add_listener
RPC Trigger – The flow begins when theadd_listener
RPC is called. SPDK then pauses all poll groups.Look up the representor by its VUID – The transport searches for a representor using a given VUID.
Create emulated PCIe device – Once the representor is found, the transport creates an emulated PCIe device and assigns it to a poll group.
Initialize memory mapping (mmap) – For each poll group, the transport sets up a memory-mapped area representing the host memory.
At this stage, the NVMe driver detects the newly hot-plugged device.
Controller Register Events
Initialization
Controller initialization – The process begins by configuring the controller registers:
The NVMe driver writes to the ASQ and ACQ registers to set up the admin SQ and admin CQ.
The driver configures the controller by writing to the controller configuration (CC) register, setting parameters like memory page size, arbitration, and timeout values.
The driver sets the CC.EN bit in the CC register to 1, transitioning the controller from Disabled to Enabled.
The NVMe driver waits for the CC.RDY (Controller Status Register - Ready bit) to become 1. This indicates that the controller has successfully completed its initialization and is ready to process commands.
Stateful region callback trigger – At this point, the PCIe-emulated device triggers a callback to the stateful region. This callback detects the host's initialization process by checking for changes in the enable bit. The callback may occur multiple times but proceeds only if the enable bit has been altered.
CQ and DPU setup – The callback proceeds to create the CQ resources and sets up the DPA thread on the DPU. The DPA thread is equipped with two message queues: one for sending and one for receiving.
Binding CQ doorbell to DB completion – An RPC is sent to bind the CQ doorbell to the DB completion context. This is done while the DPA is not yet active to prevent synchronization issues.
SQ resource creation – The SQ resources are created, including the SQE pools and the local buffer size needed for copying SQEs from the host to the DPU. DOCA DMA is used for the data transfer operations.
SQ bind DB message – The SQ sends a "bind DB" message to the DPA.
DPA receives bind DB message – the DPA processes the "bind DB" message and sends the SQ's doorbell information to the DB completion context.
SQ Sends Bind Done Message – the SQ sends a "bind done" message to the DPU.
Start SQ DB – the DPU receives the "bind done" message and starts the SQ DB.
NVMe-oF QP and request pool creation – an NVMe-oF QP and an NVMe-oF request pool are created.
Asynchronous QP creation – the NVMe-oF library starts an asynchronous operation to create the QP.
NVMe-oF library calls –
The library creates the QP and calls the transport to get the optimal poll group.
The library calls
poll_group_add
on the selected thread.
NVMe-oF connect request – the transport sends an NVMe-oF connect request.
Set property request – after the connect request is complete, a
set_property
request is sent to update the controller configuration as provided by the host during initialization.Callback triggered – once the
set_property
request is finished, the NVMe-oF triggers callbacks.Update stateful region – the transport updates the stateful region, setting the
CSTS.RDY
bit to 1.Host polling unblocked – with the
CSTS.RDY
set to 1, host polling is now unblocked, completing the initialization process.
Reset and Shutdown Flow
The reset flow in NVMe using SPDK is crucial for maintaining the integrity and stability of the storage subsystem, allowing the system to recover gracefully afterward.
The reset process can be initiated by:
The host – the host can initiate a shutdown or reset of the NVMe controller by configuring specific registers in the controller's register space. In this case, the
handle_controller_register_events()
function is triggered.Configuring shutdown notification (SHN) – the host can write to the CC (Controller Configuration) register, specifically the SHN field, to specify how the controller should handle a shutdown:
Normal shutdown (
SPDK_NVME_SHN_NORMAL
) – allows for a graceful shutdown where the controller can complete outstanding commands.Abrupt shutdown (
SPDK_NVME_SHN_ABRUPT
) – forces an immediate shutdown without completing outstanding commands.
Resetting the NVMe controller – The host resets the NVMe controller by setting the
CC.enable
bit to zero. Thehandle_controller_register_events()
function is also triggered in this caseThe host can initiate a reset of the NVMe controller by using FLR. In this case, the
flr_event_handler_cb()
is triggered.
DOCA transport – the DOCA transport can initiate the reset flow through
nvmf_doca_on_initialization_error()
when it detects an internal error condition during initialization
Once a shutdown or reset is requested, the transport proceeds to destroy all resources associated with the controller. If it is a shutdown request, the shutdown status is updated accordingly. When the host performs a FLR or a controller reset, the transport must take several actions: it destroys all SQs and CQs across all poll groups for the specified device, destroys the admin SQ and CQ via the admin thread, stops the PCIe device, and then restarts it.
Regardless of the reason for the reset, the flow begins from nvmf_doca_pci_dev_admin_reset()
. This function marks the start of the asynchronous process for resetting the PCIe device's NVMe-oF context. The flow consists of callbacks that are triggered in sequence to track the completion of each process and proceed to the next step. The flow proceeds as follows:
If the admin QP exists, the process first checks for any I/O SQs
If I/O SQs are found, an asynchronous flow begins to stop all I/O SQs
For each I/O SQ associated with the admin queue, the corresponding poll group responsible for destroying its specific SQ is retrieved, as no other poll group can perform this action
Once all I/O SQs are stopped, if any I/O CQs remain, a message is sent to each poll group instructing them to delete their I/O CQs
After all I/O queues are destroyed, the flow proceeds to destroy the admin CQ and SQ
Once this flow is complete, it moves to nvmf_doca_pci_dev_admin_reset_continue()
to finalize the reset:
If a reset is issued by configuring the NVMe controller registers, the
CSTS.BITS.SHST
is set toSPDK_NVME_SHST_COMPLETE
andCSTS.BITS.RDY
is set to 0If the reset is triggered by a FLR, the PCIe device context is stopped using:
doca_ctx_stop(doca_devemu_pci_dev_as_ctx())
I/O QP Create/Destroy
The process of creating and destroying I/O QPs begins with the initiator (host) sending an NVMe-oF connect command to the NVMe-oF target following the completion of transport initialization. The host sends an NVMe-oF connect command to the NVMe-oF target after the transport has completed its initialization. The NVMe-oF target receives an admin command and begins processing it through nvmf_doca_on_fetch_sqe_complete()
.
Based on the command opcode, the following steps are executed:
Creating an I/O CQ –
SPDK_NVME_OPC_CREATE_IO_CQ
–handle_create_io_cq()
:The system selects a poll group responsible for creating the CQ.
The target searches for the
nvmf_doca_pci_dev_poll_group
entity within the selected poll group. If it is not found, this indicates that it is the first queue associated with the specific device managed by this poll group, requiring the creation of a new entity.The target allocates a DOCA IO (
nvmf_doca_io
) using the attributes and data provided in the command, such as CQ size, CQ ID, CQ address, and CQ MSI-X.Create the DMA context, the message queues, consumer and producer handle and the DPA thread.
Once the asynchronous allocation and setup are complete, the target posts a CQE to indicate that the operation has succeeded. This is done through the static function
nvmf_doca_poll_group_create_io_cq_done()
.
Creating an I/O SQ –
SPDK_NVME_OPC_CREATE_IO_SQ
–handle_create_io_sq()
:Based on the CQ ID argument provided in the command, the system first identifies the I/O entity to which the new SQ should be added. From this entity, it retrieves the associated
nvmf_doca_pci_dev_poll_group
.A DOCA SQ
nvmf_doca_sq
is allocated using the attributes and data specified in the command, including the SQ size, SQ ID, and SQ address. This allocation is handled within thenvmf_doca_poll_group_create_io_sq()
function.The DMA context and the DB completions are also created.
Once the asynchronous process of creating the SQ is complete, a CQE is posted via the
nvmf_doca_poll_group_create_io_sq_done()
function.The target begins the QP allocation, which creates a new QP comprising an SQ and a CQ.
The target determines the optimal poll group for the new QP by calling
get_optimall_poll()
group ensuring that both the CQ and SQ attached to it runs on the same poll group.The target adds the newly created QP to it using
nvmf_doca_poll_group_add()
, enabling management of the QP's events and I/O operations.After the connection is established, the initiator can start sending I/O commands through the newly created QP.
Destroying an I/O CQ –
SPDK_NVME_OPC_DELETE_IO_CQ
–handle_delete_io_cq()
:The process starts by fetching the identifier of the queue which must be deleted from the NVMe request.
The corresponding
nvmf_doca_io
entity to the identifier is located.The associated poll group is extracted from the I/O, as it is responsible for destroying the CQ associated with it. The thread retrieved schedules the execution of
nvmf_doca_pci_dev_poll_group_stop_io_cq()
usingspdk_thread_send_msg
.The
nvmf_doca_io_stop()
function is called to initiate the stopping process. If there are CQs in this I/O that are not idle, it triggersnvmf_doca_io_stop_continue()
to advance the sequence. This flow then executes a series of asynchronous callback functions in order, ensuring that each step completes fully before the next begins, performing the following actions:Stop the DOCA devemu PCIe device doorbell to prevent triggering completions on the associated doorbell completion context
Stop the NVMe-oF DOCA DMA pool
Stop the DMA context associated with the CQ and frees all elements of the NVMe-oF DOCA CQ.
Stop the NVMe-oF DOCA DPA communication channel, halting both the receive and send message queues.
Stop and destroy the PCIe device DB.
Stop the MSI-X.
Stop and destroy the DB completion.
Destroy the communication channel.
Destroy the DPA thread.
The target posts a CQE to indicate that the operation has succeeded.
Destroying an I/O SQ –
SPDK_NVME_OPC_DELETE_IO_SQ
–handle_delete_io_sq()
:The process starts by fetching the identifier of the queue that must be deleted from the NVMe request.
The corresponding
nvmf_doca_sq
entity to the identifier is located.The associated poll group is then extracted from the I/O, as it is responsible for destroying the SQ associated with it. The thread retrieved schedules the execution of
nvmf_doca_pci_dev_poll_group_stop_io_sq()
usingspdk_thread_send_msg
.The
nvmf_doca_sq_stop()
function is called to initiate the stopping process.The stopping process begins by calling
spdk_nvmf_qpair_disconnect()
to disconnect an NVMe-oF QP, clean up associated resources, and terminate the connection.nvmf_doca_sq_stop_continue()
is triggered to initiate a sequence of asynchronous callback functions, ensuring that each step completes before proceeding to the next. It performs the following actions:Disconnect an NVMe-oF QP, cleaning up associated resources and terminate the connection.
Stop the DOCA devemu PCIe device doorbell to prevent triggering completions on the associated doorbell completion context.
Send unbind SQ doorbell message to DPA.
Stop the NVMe-oF DOCA DMA pool.
Stop the DMA context associated with the SQ and frees all elements of the NVMe-oF DOCA SQ.
Destroy the resources associated with the SQ (i.e., the DMA pool, the queue, and the request pool).
The target posts an SQE to indicate that the operation has succeeded.
Data Path Flows
From the host's perspective, it communicates with a standard NVMe device. To simulate this experience, the DOCA transport uses the available NVMe-oF APIs to effectively mimic the behavior of a real NVMe device.
The data path flow involves a series of steps that handle the transfer and processing of commands between the host and the NVMe-oF target. It begins with the host writing commands to the SQ entries, specifying operations such as read or write requests that the NVMe-oF target processes.
The following diagram provides an overview of the data path steps:
The host writes the SQE.
The host rings the doorbell (DB), and the DB value is received by the DPA.
The producer forwards the DB value to the BlueField Arm.
The system reads the SQE from the host.
The SQE is processed.
A CQE is written back.
If MSI-X is enabled, the producer triggers the MSI-X interrupt.
The DPU's consumer raises the MSI-X interrupt to the host.
Each of these steps is further expanded upon in the following subsections.
Retrieving Doorbell
The process begins by retrieving the doorbell values, which indicate new commands submitted by the host, allowing the system to identify and process pending commands:
The DPA wakes up and checks the reason for activation. This could be due to the DPU posting something for the DPA consumer or a new doorbell value that needs to be passed to the DPU. In this case, the DPA detects the new doorbell value and sends it to the DPU via message queues. The DPU then calculates the number of SQEs that need to be fetched from the host's SQ and retrieves the commands using DMA via the nvmf_doca_sq_update_pi()
function.
Fetching SQE from Host
After retrieving the SQE, the system must translate this command into a request object that the NVMe-oF target can understand and process. The SQE contains crucial command details, such as read or write operations. This process involves populating an spdk_nvmf_request
structure, which includes:
Command parameters extracted from the SQE.
Associated data buffer locations (if any data is to be read from or written to).
Metadata and additional information necessary for processing the command.
For admin commands, the SQEs are handled by the nvmf_doca_on_fetch_sqe_complete()
function, while I/O NVMe-oF commands are managed by the nvmf_doca_on_fetch_nvm_sqe_complete()
function. Both functions are responsible for filling the nvmf_doca_request
structure.
A request is obtained from the SQ request pool, as described previously, and is populated by setting various attributes based on the specific command being issued. These attributes may include the Namespace ID, the length of the request, and the queue to which the request belongs.
There are three options for how data direction is handled in the command:
Next, there are three options for data direction handled in the command:
No data transfer –
After preparing the request, the system sets the callback function to
post_cqe_from_response()
.The request is executed using
spdk_nvmf_request_exec()
.The system posts the CQE to indicate completion.
Data transfer from host to DPU –
After preparing the request, the system retrieves buffers from the SQ pool and initializes them with the details of the data to be copied from the host.
It invokes
nvmf_doca_sq_copy_data()
, which performs a DMA copy from the host to the DPU.Once the asynchronous copy completes,
spdk_nvmf_request_exec()
is called to continue processing.The system posts the CQE to signal completion.
Data transfer from DPU to host –
After preparing the request, the system retrieves buffers from the SQ pool and initializes them with the data to be copied and the destination address on the host.
It invokes
nvmf_doca_sq_copy_data()
to perform a DMA copy from the DPU to the host.Once the asynchronous copy finishes,
spdk_nvmf_request_exec()
is called to complete processing.The system posts the CQE to indicate completion.
While the overall flow for NVMe commands and admin commands is similar, there are subtle differences in the transport implementation to address the unique requirements of each command type. For I/O commands like read and write, the system may involve large data block transfers. Here, the PRPs come into play, as they are used to describe the memory locations for the data to be read or written. The PRPs provide a list of physical addresses that the NVMe device uses to access the host's memory directly.
In this scenario, multiple DMA operations may be required for copying the data. After preparing the request, the function nvme_cmd_map_prps()
is invoked to iterate over the entire PRP list, preparing the retrieved buffers from the pool and initializing them with the corresponding data and destination addresses. Once the buffers are properly set up, the function buffer_ready_copy_data_host_to_dpu()
is called, which iterates through all the buffers and invokes nvmf_doca_sq_copy_data()
for each one. Only after all asynchronous copy tasks for the buffers are completed does the function nvmf_doca_request_complete()
get called to signal the end of the request processing.
Dispatching Command to NVMe-oF Target
Once the request is built, the function spdk_nvmf_request_exec()
is called to execute it. This function initiates the processing of the request by determining the command type and dispatching it to the appropriate handler within the NVMe-oF target for processing. The NVMe-oF target then interprets the command, allocates necessary resources, and performs the requested operation.
When a request is dispatched to the NVMe-oF target for execution via spdk_nvmf_request_exec()
, a completion callback function is typically configured. This callback is invoked once the request has been fully processed, signaling to the transport layer that the NVMe-oF target has completed handling the request.
Posting CQE to Host
After the command has been processed, a CQE is created and posted back to the host to indicate the completion status of the command. This entry includes details about the operation's success or any errors encountered.
Raising RMSI-X
If the DMA for posting the CQE is completed successfully, MSIX is raised to the host to inform it that a completion event has occurred. This allows the host to read the CQE and process the result of the request.
In this scenario, the DPU calls nvmf_doca_io_raise_msix()
, which sends a message through nvmf_doca_dpa_msgq_send()
. This action prompts the DPA to wake up and attempt to retrieve the consumer completion context. The DPA then receives a message from the DPU instructing it to raise MSIX.
This flow diagram illustrates the steps from fetching SQEs and preparing the request to posting the CQE, highlighting the three possible data scenarios (where PRP is not involved):
Limitations
Supported SPDK Versions
The supported SPDK version is 23.01.
Supported Commands
Refer to section "Supported Admin Commands" for the list of admin commands currently supported by the transport
Refer to section "Supported NVMe Commands" for the list of NVMe commands currently supported by the transport
SPDK Stop Listener Flow
The stop listener flow (spdk_nvmf_stop_listen
) can be initiated through the remove_listener
RPC. The current version of SPDK has limitations regarding the asynchronous handling of remove_listener
requests. For this reason, it is recommended that the stop listener function be called only after freeing up resources such as memory buffers and QPs. This can be accomplished on the host side by issuing the unbind script:
python3 samples/doca_devemu/devemu_pci_vfio_bind.py --unbind 0000:62:00.0
This application leverages the following DOCA libraries:
For additional information about the used DOCA libraries, please refer to the respective programming guides.
NVIDIA® BlueField®-3
SPDK version 23.0
Please refer to the NVIDIA DOCA Installation Guide for Linux for details on how to install BlueField-related software.
The installation of DOCA's reference application contains the sources of the applications, alongside the matching compilation instructions. This allows for compiling the applications "as-is" and provides the ability to modify the sources, then compile a new version of the application.
For more information about the applications as well as development and compilation tips, refer to the DOCA Reference Applications page.
The sources of the application can be found under the application's directory: /opt/mellanox/doca/applications/nvme_emulation/
.
Compiling All Applications
All DOCA applications are defined under a single meson project. So, by default, the compilation includes all of them.
To build all the applications together, run:
cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build
Infodoca_nvme_emulation
is created under/tmp/build/applications/nvme_emulation/
.Alternatively, one can set the desired flags in the
meson_options.txt
file instead of providing them in the compilation command line:Edit the following flags in
/opt/mellanox/doca/applications/meson_options.txt
:Set
enable_all_applications
tofalse
Set
enable_nvme_emulation
totrue
The same compilation commands should be used, as were shown in the previous section:
cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build
Compiling With Custom SPDK
If you plan to use a custom or alternative SPDK version, update the paths in the following variables via meson:
spdk_lib_path
spdk_incl_path
spdk_dpdk_lib_path
spdk_isal_prefix
Troubleshooting
Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue you may encounter with the compilation of the DOCA applications.
Prerequisites
From the server on the DPU – The user must allocate hugepages and run the application, which will remain active and continuously process incoming RPC requests. This can be done with the following commands :
$ echo
1024
> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages $ sudo mount -t hugetlbfs -o pagesize=2M nodev /mnt/huge $ sudo /tmp/ariej_build/applications/nvme_emulation/doca_nvme_emulationFrom the client on the DPU – The user can send various RPC requests to the application during its execution. For example, to remove a listener, the following command can be used:
$ sudo PYTHONPATH=/doca/applications/nvme_emulation/ /usr/bin/spdk_rpc.py nvmf_subsystem_remove_listener nqn.
2016
-06
.io.spdk:cnode1 -t doca -a MT2306XZ00AYGES1D0F0
Application Execution
The NvMR emulation application is provided in source form, hence a compilation is required before the application can be executed.
Application usage instructions:
Usage: doca_nvme_application [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level
for
the program <10
=DISABLE,20
=CRITICAL,30
=ERROR,40
=WARNING,50
=INFO,60
=DEBUG,70
=TRACE> --sdk-log-level Set the SDK (numeric) log levelfor
the program <10
=DISABLE,20
=CRITICAL,30
=ERROR,40
=WARNING,50
=INFO,60
=DEBUG,70
=TRACE> -j, --json <path> Parse all command flags from an input json file Program Flags: -p, --pci-addr DOCA Comm Channel device PCIe address -r, --rep-pci DOCA Comm Channel device representor PCIe address -f, --file File to send by the client / File to write by the server -t, --timeout Application timeoutfor
receiving file content messages,default
is5
sec -c, --config <config> JSON config file (default
none) --json <config> JSON config file (default
none) --json-ignore-init-errors don't exit on invalid config entry -d, --limit-coredumpdo
not set max coredump size to RLIM_INFINITY -g, --single-file-segments force creating just one hugetlbfs file -h, --help showthis
usage -i, --shm-id <id> shared memory ID (optional) -m, --cpumask <mask or list> core mask (like0xF
) or core list of'[]'
embraced (like [0
,1
,10
])for
DPDK -n, --mem-channels <num> channel number of memory channels usedfor
DPDK -p, --main-core <id> main (primary) corefor
DPDK -r, --rpc-socket <path> RPC listen address (default
/var/tmp/spdk.sock) -s, --mem-size <size> memory size in MBfor
DPDK (default
: 0MB) --disable-cpumask-locks Disable CPU core lock files. --silence-noticelog disable notice level logging to stderr --msg-mempool-size <size> global message memory pool size in count (default
:262143
) -u, --no-pci disable PCIe access --wait-for
-rpc waitfor
RPCs to initialize subsystems --max-delay <num> maximum reactor delay (in microseconds) -B, --pci-blocked <bdf> PCIe addr to block (can be used more than once) -R, --huge-unlink unlink huge files after initialization -v, --version print SPDK version -A, --pci-allowed <bdf> PCIe addr to allow (-B and -A cannot be used at the same time) --huge-dir <path> use a specific hugetlbfs mount to reserve memory from --iova-mode <pa/va> set IOVA mode ('pa'
for
IOVA_PA and'va'
for
IOVA_VA) --base-virtaddr <addr> the base virtual addressfor
DPDK (default
:0x200000000000
) --num-trace-entries <num> number of trace entriesfor
each core, must be power of2
, setting0
to disable trace (default
32768
) --rpcs-allowed comma-separated list of permitted RPCS --env-context Opaque contextfor
use of the env implementation --vfio-vf-token VF token (UUID) shared between SR-IOV PF and VFsfor
vfio_pci driver -L, --logflag <flag> enable log flag (all, accel, aio, app_config, app_rpc, bdev, bdev_concat, bdev_ftl, bdev_group, bdev_malloc, bdev_null, bdev_nvme, bdev_raid, bdev_raid0, bdev_raid1, bdev_raid5f, blob, blob_esnap, blob_rw, blobfs, blobfs_bdev, blobfs_bdev_rpc, blobfs_rw, ftl_core, ftl_init, gpt_parse, json_util, log, log_rpc, lvol, lvol_rpc, notify_rpc, nvme, nvme_vfio, nvmf, nvmf_tcp, opal, rdma, reactor, rpc, rpc_client, sock, sock_posix, thread, trace, uring, vbdev_delay, vbdev_gpt, vbdev_lvol, vbdev_opal, vbdev_passthru, vbdev_split, vbdev_zone_block, vfio_pci, vfio_user, virtio, virtio_blk, virtio_dev, virtio_pci, virtio_user, virtio_vfio_user, vmd) -e, --tpoint-group <group-name>[:<tpoint_mask>] group_name - tracepoint group namefor
spdk trace buffers (bdev, nvmf_rdma, nvmf_tcp, blobfs, thread, nvme_pcie, nvme_tcp, bdev_nvme, nvme_nvda_tcp, all) tpoint_mask - tracepoint maskfor
enabling individual tpoints inside a tracepoint group. First tpoint inside a group can be enabled by setting tpoint_mask to1
(e.g. bdev:0x1
). Groups and masks can be combined (e.g. thread,bdev:0x1
). All available tpoints can be found in /include/spdk_internal/trace_defs.hInfoThe above usage printout can be printed to the command line using the
-h
(or--help
) options:/tmp/build/applications/nvme_emulation/doca_nvme_emulation -h
Command Line Flags
The application uses the same command-line flags as SPDK, enabling configuration and behavior control similar to standard SPDK applications.
For more details refer to the official SPDK Documentation.
Troubleshooting
Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue you may encounter with the installation or execution of the DOCA applications.
References
/opt/mellanox/doca/applications/nvme_emulation/
/opt/mellanox/doca/applications/nvme_emulation/build_device_code.sh
/opt/mellanox/doca/applications/nvme_emulation/dependencies
/opt/mellanox/doca/applications/nvme_emulation/device
/opt/mellanox/doca/applications/nvme_emulation/host
/opt/mellanox/doca/applications/nvme_emulation/meson.build
/opt/mellanox/doca/applications/nvme_emulation/nvme_emulation.c
/opt/mellanox/doca/applications/nvme_emulation/rpc_nvmf_doca.py