DOCA Storage Comch to RDMA Zero Copy Application Guide
The doca_storage_comch_to_rdma_zero_copy application serves as a bridge between the initiator and a single storage target. It's only role in the data path is to forward the io requests and io responses between the initiator and storage target.
The doca_storage_comch_to_rdma_zero_copy application performs the following functions:
Relay of io requests from the initiator to the storage target
Relay of io responses from the storage target to the initiator
To achieve this it expects to be able to connect to a storage target using TCP connections and will then listen for an incoming connection from a single initiator using doca_comch_server.
The doca_storage_comch_to_rdma_zero_copy application is split into to two functional areas:
Control time and shared resources
Per thread data path resources
The flow of the application similarity executes in two main phases:
Control phase
Data path phase
Control Phase
The state starts by connecting to the storage target, then waiting for a client connection. Once all connections are established the application waits for the appropriate control commands:
Query storage
Init storage
Start storage
Processing each control command follows a similar pattern of:
Relay the command to the storage target
Wait for the storage target to respond
Do the required post processing and consistency checks on the storage responses
Respond to the client
The start storage control command will kick off the data path phase. Data threads will begin executing while the main thread proceeds to wait for the final control messages to complete the application lifecycle:
Stop storage
Shutdown
Data Path Phase
This phase happens per thread and involves each thread performing the requested IO operations requested by the client. Read and write requests are simply forwarded to the storage target, no actual processing is carried out by the data threads.
Read Data Flow
The regular read flow consists of the stages detailed in the following subsections.
1. Initiator Request
The initiator sends an I/O request to the zero copy application.
The zero copy application forwards the request verbatim to the storage target
2. RDMA Transfer
The storage target performs a RDMA write operation
3. Target Response
The zero copy application receives a response from the storage target
The zero copy application forwards the request verbatim to the initiator
Write Data Flow
1. Initiator Request
The initiator sends an I/O request to the zero copy application.
The zero copy application forwards the request verbatim to the storage target
2. RDMA Transfer
The storage target performs a RDMA read operation.
3. Target Response
The zero copy application receives a response from the storage target
The zero copy application forwards the request verbatim to the initiator
This application leverages the following DOCA libraries:
This application is compiled as part of the set of storage applications. For compilation instructions, refer to the DOCA Storage page.
Application Execution
This application can only run within the NVIDIA® BlueField® DPU.
DOCA Storage Comch to RDMA Zero Copy is provided in source form. Therefore, compilation is required before the application can be executed.
Application usage instructions:
Usage: doca_storage_comch_to_rdma_zero_copy [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse command line flags from an input json file Program Flags: -d, --device Device identifier -r, --representor Device host side representor identifier --cpu CPU core to which the process affinity can be set --storage-server Storage server addresses in <ip_addr>:<port> format --command-channel-name Name of the channel used by the doca_comch_client. Default: "doca_storage_comch" --control-timeout Time (in seconds) to wait while performing control operations. Default: 5
InfoThis usage printout can be printed to the command line using the
-h(or--help) options:./doca_storage_comch_to_rdma_zero_copy -h
For additional information, refer to section "Command-line Flags".
CLI example for running the application on the BlueField:
./doca_storage_comch_to_rdma_zero_copy -d 03:00.0 -r 3b:00.0 --storage-server 172.17.0.1:12345 --cpu 0
NoteBoth the DOCA Comch device PCIe address (
03:00.0) and the DOCA Comch device representor PCIe address (3b:00.0) should match the addresses of the desired PCIe devices.NoteStorage target IP
address:porttuples should be updated to refer to the running storage target applications.
Command-line Flags
Flag Type | Short Flag | Long Flag | Description |
General flags |
|
| Print a help synopsis |
|
| Print program version information | |
|
| Set the log level for the application:
| |
N/A |
| Set the log level for the program:
| |
|
| Parse command line flags from an input JSON file as well as from the CLI (if provided) | |
Program flags |
|
| DOCA device identifier. One of:
Note
This flag is a mandatory. |
|
| DOCA Comch device representor PCIe address Note
This flag is a mandatory. | |
N/A |
| Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0. Note
The user can specify this argument multiple times to create more threads.
Note
This flag is a mandatory. | |
N/A |
| IP address and port to use to establish the control TCP connection to the target. Note
This flag is a mandatory. | |
N/A |
| Allows customizing the server name used for this application instance if multiple comch servers exist on the same device. | |
N/A |
| Time, in seconds, to wait while performing control operations |
Troubleshooting
Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue encountered with the installation or execution of the DOCA applications.
The flow of the application is broken down into key functions / steps:
zero_copy_app app{parse_cli_args(argc, argv)};
storage::install_ctrl_c_handler([&app]() {
app.abort("User requested abort");
});
app.connect_to_storage();
app.wait_for_comch_client_connection();
app.wait_for_and_process_query_storage();
app.wait_for_and_process_init_storage();
app.wait_for_and_process_start_storage();
app.wait_for_and_process_stop_storage();
app.wait_for_and_process_shutdown();
app.display_stats();
Main/Control Thread Flow
-
zero_copy_app app{parse_cli_args(argc, argv)};
Parse CLI arguments and use these to create the application instance. Initial resources are also created at this stage:
-
DOCA_LOG_INFO(
"Open doca_dev: %s", m_cfg.device_id.c_str()); m_dev = storage::open_device(m_cfg.device_id);Open a
doca_devas specified by the CLI argument:-dor--device -
DOCA_LOG_INFO(
"Open doca_dev_rep: %s", m_cfg.representor_id.c_str()); m_dev_rep = storage::open_representor(m_dev, m_cfg.representor_id);Open a
doca_dev_repas specified by the CLI argument:-ror--representor -
m_storage_control_channel = storage::control::make_tcp_client_control_channel(m_cfg.storage_server_address);
Create TCP client control channels (Control channel objects provide a unified API so that a TCP client, TCP server,
doca_comch_client, anddoca_comch_serverall have a consistent API)InfoSee
storage_common/control_channel.hppfor more information about the control channel abstraction. -
m_client_control_channel = storage::control::make_comch_server_control_channel(m_dev, m_dev_rep, m_cfg.command_channel_name.c_str(),
this, new_comch_consumer_callback, expired_comch_consumer_callback);Create a Comch server control channel (Containing a
doca_comch_serverinstance) using the device, representor and channel name as specified by the CLI argument--command-channel-nameor the default value if none was specified.
-
-
storage::install_ctrl_c_handler([&app]() { app.
abort("User requested abort"); });Set a signal handler for Ctrl+c keyboard inputs so the app can shutdown gracefully.
-
app.connect_to_storage();
Connect to the TCP server hosted by the storage target as defined by the CLI argument:
--storage-server-
voidzero_copy_app::connect_to_storage(void) {while(!m_storage_control_channel->is_connected()) { std::this_thread::sleep_for(std::chrono::milliseconds{100});if(m_abort_flag) {throwstorage::runtime_error{DOCA_ERROR_CONNECTION_ABORTED,"Aborted while connecting to storage"}; } } }Poll the storage target control channel until either it connects, or the user aborts the application.
-
-
app.wait_for_comch_client_connection();
Wait for a
doca_comch_clientto connect.-
voidzero_copy_app::wait_for_comch_client_connection(void) {while(!m_client_control_channel->is_connected()) { std::this_thread::sleep_for(std::chrono::milliseconds{100});if(m_abort_flag) {throwstorage::runtime_error{DOCA_ERROR_CONNECTION_ABORTED,"Aborted while connecting to client"}; } } }Poll the Comch server control channel until a
doca_comch_clienthas connected, or the user aborts the application. If any further Comch client attempts to connect to the server it will be automatically rejected by the control channel which is designed for a 1:1 relationship between clients and servers. A sleep is placed in this loop as it may take the user / operator a few seconds to start the client so there is no gain to polling any faster.
-
-
app.wait_for_and_process_query_storage();
Wait for the initiator to send a
query_storage_requestcontrol message and then perform the required actions to fulfill the request:Forward the query storage request to the storage target.
Wait for storage target to respond.
Send a response to the initiator:
Send a
start_storage_responsemessage upon success or anerror_responsemessage if anything failed
-
app.wait_for_and_process_init_storage();
Wait for the initiator to send a
init_storage_requestcontrol message and then perform the required actions to fulfill the request:use the
init_storage_payloaddata to:Set core count (
m_core_count) as the number of cores requested by the initiator (number of--cpuarguments provided to the initiator) OR fail if this is more than the number of--cpuarguments provided to the service.Set number of transactions per core (
m_transaction_count) to: the number of transactions requested by the initiator doubled. This is doubled to allow for batched task submission and avoid race conditions where the initiator can see a response to a transaction and try to re-submit it before the associated Comch producer event callback is received by the server meaning the initiator will continually re-try to send the task and degrade performance until the service catches up and re-submits the consumer task. This should be uncommon, but to make sure it can never happen double the transaction count is allocated so even if every single transaction on the initiator hit this issue there is a full second set of transactions ready on the service side to receive the tasks and avoid any contention. A user can experiment with reducing this value to save memory if desired).Import then re-export initiator IO blocks mmap, this allows the storage target to read / write directly to / from the initiator memory.
Send init storage request to storage target using:
The service transaction count (double the initiator value).
The initiator core count.
The re-exported IO blocks mmap.
Send a response to the initiator:
Send a
init_storage_responsemessage upon success or anerror_responsemessage if anything failed
Perform the first stages of the worker threads initialization. These steps are carried out for each thread, but only one thread performs the steps at any time this simplifies the sending and receiving of control messages, the user could modify this flow to execute in parallell if they so desired.
Create thread bound to the Nth CPU provided to the service via the
--cpuCLI arguments-
m_workers[ii].execute_control_command( worker_create_objects_control_command { m_dev, m_client_control_channel->get_comch_connection(), m_transaction_count} );
Initialize thread context (asychronously)
-
connect_rdma(ii, storage::control::rdma_connection_role::io_data, cid);
Create RDMA data connections (asynchronously) The thread will connect to the storage target and create a RDMA context which will be idle from the service's point of view, but is used by the storage target to perform RDMA read / write operations. See the DOCA Storage Target RDMA Application Guide for an explanation why there are two RDMA contexts per thread
-
connect_rdma(ii, storage::control::rdma_connection_role::io_control, cid);
Create RDMA data connections (asynchronously) The thread will connect to the storage target and create a RDMA context which will be used to exchange IO requests and responses using RDMA send/recv tasks. See the DOCA Storage Target RDMA Application Guide for an explanation why there are two RDMA contexts per thread.
-
app.wait_for_and_process_start_storage();
Wait for the initiator to send a
start_storage_requestcontrol message and then perform the required actions to fulfill the request:Forward the start storage request to the storage target.
Wait for storage target to respond.
Signal all work threads to begin data path operation.
Send a response to the initiator:
Send a
start_storage_responsemessage upon success or anerror_responsemessage if anything failed.
Data path execution takes place now until either the user abort the program or a stop message is received.
-
app.wait_for_and_process_stop_storage();
Wait for the initiator to send a
stop_storage_requestcontrol message and then perform the required actions to fulfill the request:Forward the stop storage request the storage target.
Wait for storage target to respond.
Signal all work threads to stop data path operation.
Collect run time stats.
Send a response to the initiator:
Send a
stop_storage_responsemessage upon success or anerror_responsemessage if anything failed.
-
app.wait_for_and_process_shutdown();
Wait for the initiator to send a
shutdown_requestcontrol message and then perform the required actions to fulfill the request:Forward the shutdown request the storage target.
Wait for storage target to respond.
Destroy worker thread objects.
Send a response to the initiator:
Send a
stop_storage_responsemessage upon success or anerror_responsemessage if anything failed.
-
app.display_stats();
Display runtime statistics.
Application destructor is triggered:
Destroy control channels.
Destroy initiator IO blocks
doca_mmap.Close
doca_dev_rep.Close
doca_dev.
Program exits.
Worker/Data Path Thread Flow
The work thread proc executes in two phases: Control / configuration phase, followed by data path phase where read, write, and recovery operations take place.
Worker Init Process
void zero_copy_app_worker::thread_proc(zero_copy_app_worker *self, uint16_t core_idx) noexcept
The worker starts by executing a loop of:
Lock mutex.
If message pointer is not null:
Process the configuration message.
Set the operation result.
Unlock the mutex.
Yield.
The following configuration operations can be performed by the worker thread:
-
voidzero_copy_app_worker::create_worker_objects(worker_create_objects_control_commandconst&cmd)Create general worker objects:
Create IO message memory.
Create IO message mmap (to allow the messages to be accessed by DOCA comch and DOCA RDMA).
Allocate doca buffer inventory.
Create
doca_peto drive the DOCA contexts (doca_rdma,doca_comch_consumer,doca_comch_producer).Create
doca_comch_consumeranddoca_comch_producercontexts.Initialize and start contexts.
-
voidzero_copy_app_worker::export_local_rdma_connection_blob(worker_export_local_rdma_connection_command &cmd)Export RDMA context connection binary blob.
-
voidzero_copy_app_worker::import_remote_rdma_connection_blob(worker_import_local_rdma_connection_commandconst&cmd)Import remote RDMA context connection binary blob.
-
voidzero_copy_app_worker::are_contexts_ready(worker_are_contexts_ready_control_command &cmd)constnoexceptPoll all contexts to check they are ready to perform data path operations.
-
voidzero_copy_app_worker::prepare_tasks(worker_prepare_tasks_control_commandconst&cmd)Prepare transaction contexts by:
Allocating
doca_bufobjects.Allocating
doca_taskobjects.Setting task user data.
-
voidzero_copy_app_worker::start_data_path(void)Break out of the wait for configuration event loop and start the data path loop.
After the configuration phase the mutex is not used again
After breaking out of the initial configuration loop, the thread submits receive tasks (comch consumer tasks, and RDMA recv tasks) then enters the data path function: run_data_path_ops.
Worker Data Path Process
void zero_copy_app_worker::run_data_path_ops(zero_copy_app_worker::hot_data &hot_data)
{
DOCA_LOG_INFO("Core: %u running", hot_data.core_idx);
while (hot_data.run_flag) {
doca_pe_progress(hot_data.pe) ? ++(hot_data.pe_hit_count) : ++(hot_data.pe_miss_count);
}
while (hot_data.error_flag == false && hot_data.in_flight_transaction_count != 0) {
doca_pe_progress(hot_data.pe) ? ++(hot_data.pe_hit_count) : ++(hot_data.pe_miss_count);
}
DOCA_LOG_INFO("Core: %u complete", hot_data.core_idx);
}
During the data path phase the thread simply polls the doca_pe as quickly as possible to check for a task completion from one of the thread DOCA contexts (doca_rdma , doca_comch_consumer, doca_comch_producer) The interesting work is done in the callbacks of these tasks. The flow will always start with a consumer task completion. This is the reception of the IO message from the initiator. For the zero copy use-case the callbacks simply forward the IO requests:
Comch consumer → RDMA send
RDMA recv → Comch producer
/opt/mellanox/doca/applications/storage/