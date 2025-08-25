On This Page
- Introduction
- System Design
- Architecture
- DOCA Libraries
- Compiling the Application
- Running the Application
- Application Code Flow
DOCA Storage Comch to RDMA GGA Offload Application Guide
The
doca_storage_comch_to_rdma_gga_offload application acts as a bridge between the initiator and three storage targets. It actively participates in data transfer and leverages
doca_libs to accelerate certain data operations transparently to both the initiator and the targets.
The
doca_storage_comch_to_rdma_gga_offload application performs the following key functions:
Orchestrates communication between the initiator and the three storage instances
Periodically triggers data recovery flows to simulate data loss
Performs data recovery using erasure coding
Performs inline data decompression
To accomplish these tasks, the application establishes TCP connections to three storage targets and listens for an incoming connection from a single initiator using
doca_comch_server.
The
doca_storage_comch_to_rdma_gga_offload application is divided into two main functional areas:
Control-time and shared resources
Per-thread data path resources
The application execution follows two primary phases:
Control phase
Data path phase
Control Phase
This phase begins with establishing connections to the required storage targets, followed by awaiting a client connection. Once all connections are in place, the application waits for specific control commands:
Query storage
Init storage
Start storage
Each control command is processed using the following sequence:
Relay the command to each connected storage target.
Wait for responses from all storage targets.
Perform required post-processing and consistency checks on the responses.
Send a response back to the client.
Issuing the start storage command initiates the data path phase. While the data threads begin execution, the main thread continues to wait for final control commands to complete the application's lifecycle:
Stop storage
Shutdown
Data Path Phase
This phase is executed per thread and involves each thread performing I/O operations requested by the client. For a read I/O operation, one of two flows is used:
Regular read (no recovery)
Recovery read
By default, only the regular read flow is executed. However, periodic recovery operations can be enabled by the user—see the "Command Line Flags" section for details.
Regular Read Data Flow
The regular read flow consists of the stages detailed in the following subsections.
1. Initiator Request
The initiator sends an I/O request to the GGA offload application.
The GGA offload application translates the I/O address from the initiator's memory range to the corresponding GGA offload memory range.
For example, a read at offset 8 KB in initiator memory is remapped to
<GGA-offload-base> + 8 KB.
2. RDMA Transfers
The GGA offload application sends the adjusted read requests to the
data 1and
data 2storage targets.
Each storage target performs an RDMA write into the GGA offload memory region at the designated offsets.
3. Target Responses
Both storage targets send I/O responses upon completing the RDMA transfers.
The GGA offload application waits until both responses are received.
4. Decompression and Final Response
The application performs decompression using the received data blocks. Decompression output is written directly to the initiator memory using the
doca_decompresstask.
The application sends an I/O response to the initiator, completing the operation.
Recovery Read Data Flow
This flow is similar to the regular read, but with added steps for error correction using erasure coding. The process involves the stages detailed in the following subsections.
1. Initiator Request
The initiator sends an I/O request to the GGA offload application.
Given that recovery is required, the application:
Translates the I/O address normally.
Modifies one of the two storage requests to target the parity partition instead of a standard data partition.
Adds an output offset field to redirect the parity data into a reserved region of GGA offload memory.
2. RDMA Transfers
Two RDMA transfers are issued:
One to the surviving data partition.
One to the parity partition.
The transferred blocks are written into the GGA offload memory with appropriate alignment.
3. Target Responses
Both storage targets reply with I/O responses.
The GGA offload application waits for both to complete before proceeding.
4. Data Recovery
The application performs recovery using the available half block and parity data to reconstruct the missing block.
. 5. Decompression and Final Response
The recovered data is then decompressed. As with the regular read, decompression output is written directly to initiator memory.
An I/O response is sent to the initiator, completing the operation.
This application leverages the following DOCA libraries:
This application is compiled as part of the set of storage applications. For compilation instructions, refer to the DOCA Storage Applications page.
Application Execution
This application can only run within the NVIDIA® BlueField® DPU.
DOCA Storage Comch to RDMA GGA Offload is provided in source form. Therefore, compilation is required before the application can be executed.
Application usage instructions:
Usage: doca_storage_comch_to_rdma_gga_offload [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all command flags from an input json file Program Flags: -d, --device Device identifier -r, --representor Device host side representor identifier --cpu CPU core to which the process affinity can be set --data-1-storage Storage server addresses in <ip_addr>:<port> format --data-2-storage Storage server addresses in <ip_addr>:<port> format --data-p-storage Storage server addresses in <ip_addr>:<port> format --matrix-type Type of matrix to use. One of: cauchy, vandermonde Default: vandermonde --command-channel-name Name of the channel used by the doca_comch_client. Default: "doca_storage_comch" --control-timeout Time (in seconds) to wait while performing control operations. Default: 5 --trigger-recovery-read-every-n Trigger a recovery read flow every N th request. Default: 0 (disabled)Info
This usage printout can be printed to the command line using the
-h(or
--help) options:
./doca_storage_comch_to_rdma_gga_offload -h
For additional information, refer to section "Command-line Flags".
CLI example for running the application on the BlueField:
./doca_storage_zero_copy_comch_to_rdma -d 03:00.0 -r 3b:00.0 --data-1-storage 172.17.0.1:12345 --data-2-storage 172.17.0.2:12345 --data-p-storage 172.17.0.3:12345 --cpu 0Note
Both the DOCA Comch device PCIe address (
03:00.0) and the DOCA Comch device representor PCIe address (
3b:00.0) should match the addresses of the desired PCIe devices.Note
Storage target IP
address:porttuples should be updated to refer to the running storage target applications.
The application also supports a JSON-based deployment mode in which all command-line arguments are provided through a JSON file:
./doca_storage_comch_to_rdma_gga_offload --json [json_file]
For example:
./doca_storage_comch_to_rdma_gga_offload --json doca_storage_comch_to_rdma_gga_offload_params.jsonNote
Before execution, ensure that the JSON file contains valid configuration parameters, particularly the correct PCIe device addresses required for deployment.
Command-line Flags
Flag Type
Short Flag
Long Flag/JSON Key
Description
JSON Content
General flags
Print a help synopsis
N/A
Print program version information
N/A
Set the log level for the application:
N/A
Set the log level for the program:
Parse all command flags from an input JSON file
N/A
Program flags
DOCA device identifier. One of:
Note
This flag is a mandatory.
DOCA Comch device representor PCIe address
Note
This flag is a mandatory.
N/A
Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0.
Note
The user can specify this argument multiple times to create more threads.
Note
This flag is a mandatory.
N/A
IP address and port to use to establish the control TCP connection to the target.
Note
This flag is a mandatory.
N/A
IP address and port to use to establish the control TCP connection to the target.
Note
This flag is a mandatory.
N/A
IP address and port to use to establish the control TCP connection to the target.
Note
This flag is a mandatory.
N/A
Type of matrix to use. One of:
N/A
Allows customizing the server name used for this application instance if multiple comch servers exist on the same device.
N/A
Time, in seconds, to wait while performing control operations
N/A
Trigger a recovery read flow every Nth request. Set to 0 to disable recovery reads entirely. For example, a value of 1 triggers a recovery read for every request, while a value of 10 triggers a recovery read for one out of every ten requests.
Troubleshooting
Control Phase
-
gga_offload_app app{parse_cli_args(argc, argv)};
Parse CLI arguments, apply default values, and create the application instance.
-
app.connect_to_storage();
Connect to each of the three storage targets over TCP.
-
app.wait_for_comch_client_connection();
Create a
doca_comch_serverinstance and wait for a
doca_comch_clientto connect.
-
app.wait_for_and_process_query_storage();
Wait for the initiator to send a query storage control message, then:
Send a query storage message to each storage target
Verify that each storage target reports the same storage size
Send a query storage response back to the initiator
-
app.wait_for_and_process_init_storage();
Wait for the initiator to send an init storage control message, then:
Verify that the requested core count does not exceed the available cores
Create and export storage memory
Send an init storage message to each storage target
Wait for responses from all storage targets
Create data path resources:
Worker threads
IO message memory regions
doca_peobjects
doca_comch_consumerobjects
doca_comch_producerobjects
doca_rdmaconnection objects
doca_ecobjects
doca_compressobjects
Send an init storage response
-
app.wait_for_and_process_start_storage();
Wait for the initiator to send a start storage control message, then:
Send a start storage message to each storage target
Wait for responses from all storage targets
Create task objects
Submit listening tasks (
doca_comch_consumerand RDMA receive tasks)
Signal worker threads to begin processing
Send a start storage response
-
app.wait_for_and_process_stop_storage();
Wait for the initiator to send a stop storage control message (test complete), then:
Send a stop storage message to each storage target
Wait for responses from all storage targets
Signal worker threads to stop
Gather and post-process execution statistics
Destroy
doca_comch_consumerobjects
Destroy
doca_comch_producerobjects
Send a stop storage response
-
app.wait_for_and_process_shutdown();
Wait for the initiator to send a shutdown control message, then:
Send a shutdown message to each storage target
Wait for responses from all storage targets
Destroy all remaining data path objects
Send a shutdown response
-
app.display_stats();
Display collected statistics and destroy all control path objects.
Data Path Phase
-
while(m_hot_data.run_flag ==
false) { std::this_thread::yield();
if(m_hot_data.error_flag)
return; }
The main data thread enters a spin-wait loop, yielding execution until all threads and resources are initialized. If an error is detected (
error_flagis set), the thread exits early.
-
while(m_hot_data.run_flag) { doca_pe_progress(m_hot_data.pe) ? ++(m_hot_data.pe_hit_count) : ++(m_hot_data.pe_miss_count); }
Once started, the thread enters a tight loop, continuously polling the progress engine (
doca_pe_progress). Each iteration updates the hit/miss counters based on whether any task completions were triggered. This loop drives the data path by processing task completions as fast as possible.
-
while(m_hot_data.error_flag ==
false&& m_hot_data.in_flight_transaction_count != 0) { doca_pe_progress(m_hot_data.pe) ? ++(m_hot_data.pe_hit_count) : ++(m_hot_data.pe_miss_count); }
This final loop ensures that all in-flight transactions complete before exiting. It continues polling the progress engine as long as there are active transactions and no error has occurred.
doca_comch_consumer_task_post_recv_cb
This is the comch consumer callback function which initiates the read or recovery flow. It marks the first step in processing the initiator's request, determining which data must be fetched and which two storage targets to query.
doca_comch_consumer_task_post_recv_cbreinterprets the task and context user data, and invokes
gga_offload_app_worker::hot_data::start_transaction.
start_transactionprepares two IO requests, depending on the required mode:
read–
data 1and
data 2
recover_a–
parityand
data 2(to recover
data 1)
recover_b–
data 1and
parity(to recover
data 2)
IO request preparation:
storage::io_message_view::set_correlation_id(cid, part_a_io_message); storage::io_message_view::set_correlation_id(cid, part_b_io_message); storage::io_message_view::set_correlation_id(cid, response_io_message); storage::io_message_view::set_type(type, part_a_io_message); storage::io_message_view::set_type(type, part_b_io_message); storage::io_message_view::set_type(storage::io_message_type::result, response_io_message); storage::io_message_view::set_user_data(user_data, part_a_io_message); storage::io_message_view::set_user_data(user_data, part_b_io_message); storage::io_message_view::set_user_data(user_data, response_io_message); storage::io_message_view::set_io_address(local_io_addr, part_a_io_message); storage::io_message_view::set_io_address(local_io_addr + half_block_size, part_b_io_message); storage::io_message_view::set_io_address(host_io_addr, response_io_message); storage::io_message_view::set_io_size(half_block_size, part_a_io_message); storage::io_message_view::set_io_size(half_block_size, part_b_io_message); storage::io_message_view::set_io_size(transaction.io_size, response_io_message); storage::io_message_view::set_remote_offset(remote_offset_a, part_a_io_message); storage::io_message_view::set_remote_offset(remote_offset_b, part_b_io_message);
The transaction mode (
read,
recover_a, or
recover_b) is saved.
The prepared IO requests are then sent to the corresponding storage targets.
doca_rdma_task_receive_cb
After each storage target completes its respective data transfer, it sends a response. Once the callback receives the second of the two responses, it checks the transaction mode. If the mode is
read, it invokes the decompression step:
gga_offload_app_worker::hot_data::start_decompress. Otherwise, it invokes the EC recovery step:
gga_offload_app_worker::hot_data::start_recover.
doca_ec_task_recover_cb
In scenarios requiring data recovery, the decompression step (
gga_offload_app_worker::hot_data::start_decompress) is invoked immediately after the recovery completes.
doca_compress_task_decompress_lz4_stream_cb
Completion of the decompression also entails storing the resulting decompressed data into the remote initiator's memory. At this point, the transaction is considered complete and may be safely re-used.
