What can I help you with?
DOCA Documentation v3.0.0

DOCA Storage Comch to RDMA GGA Offload Application Guide

The doca_storage_comch_to_rdma_gga_offload application acts as a bridge between the initiator and three storage targets. It actively participates in data transfer and leverages doca_libs to accelerate certain data operations transparently to both the initiator and the targets.

The doca_storage_comch_to_rdma_gga_offload application performs the following key functions:

  • Orchestrates communication between the initiator and the three storage instances

  • Periodically triggers data recovery flows to simulate data loss

  • Performs data recovery using erasure coding

  • Performs inline data decompression

To accomplish these tasks, the application establishes TCP connections to three storage targets and listens for an incoming connection from a single initiator using doca_comch_server.

The doca_storage_comch_to_rdma_gga_offload application is divided into two main functional areas:

  • Control-time and shared resources

  • Per-thread data path resources

gga_offload_-_objects-version-2-modificationdate-1744639603937-api-v2.png

The application execution follows two primary phases:

  • Control phase

  • Data path phase

Control Phase

This phase begins with establishing connections to the required storage targets, followed by awaiting a client connection. Once all connections are in place, the application waits for specific control commands:

  • Query storage

  • Init storage

  • Start storage

Each control command is processed using the following sequence:

  1. Relay the command to each connected storage target.

  2. Wait for responses from all storage targets.

  3. Perform required post-processing and consistency checks on the responses.

  4. Send a response back to the client.

Issuing the start storage command initiates the data path phase. While the data threads begin execution, the main thread continues to wait for final control commands to complete the application's lifecycle:

  • Stop storage

  • Shutdown

Data Path Phase

This phase is executed per thread and involves each thread performing I/O operations requested by the client. For a read I/O operation, one of two flows is used:

  • Regular read (no recovery)

  • Recovery read

By default, only the regular read flow is executed. However, periodic recovery operations can be enabled by the user—see the "Command Line Flags" section for details.

Regular Read Data Flow

The regular read flow consists of the stages detailed in the following subsections.

1. Initiator Request

  1. The initiator sends an I/O request to the GGA offload application.

  2. The GGA offload application translates the I/O address from the initiator's memory range to the corresponding GGA offload memory range.

    • For example, a read at offset 8 KB in initiator memory is remapped to <GGA-offload-base> + 8 KB.

gga_offload_-_regular_read_01_-_IO_request-version-2-modificationdate-1744639722407-api-v2.png

2. RDMA Transfers

  1. The GGA offload application sends the adjusted read requests to the data 1 and data 2 storage targets.

  2. Each storage target performs an RDMA write into the GGA offload memory region at the designated offsets.

gga_offload_-_regular_read_02_-_RDMA-version-3-modificationdate-1744639779067-api-v2.png

3. Target Responses

  1. Both storage targets send I/O responses upon completing the RDMA transfers.

  2. The GGA offload application waits until both responses are received.

gga_offload_-_regular_read_03_-_RDMA_IO_response-version-3-modificationdate-1744639820683-api-v2.png

4. Decompression and Final Response

  1. The application performs decompression using the received data blocks. Decompression output is written directly to the initiator memory using the doca_decompress task.

    gga_offload_-_regular_read_04_-_Decompress-version-2-modificationdate-1744639917087-api-v2.png

  2. The application sends an I/O response to the initiator, completing the operation.

    gga_offload_-_regular_read_05_-_Decompress_IO_response-version-2-modificationdate-1744639965093-api-v2.png

Recovery Read Data Flow

This flow is similar to the regular read, but with added steps for error correction using erasure coding. The process involves the stages detailed in the following subsections.

1. Initiator Request

  1. The initiator sends an I/O request to the GGA offload application.

  2. Given that recovery is required, the application:

    1. Translates the I/O address normally.

    2. Modifies one of the two storage requests to target the parity partition instead of a standard data partition.

    3. Adds an output offset field to redirect the parity data into a reserved region of GGA offload memory.

gga_offload_-_recovery_read_01_-_IO_request-version-2-modificationdate-1744640033580-api-v2.png

2. RDMA Transfers

  1. Two RDMA transfers are issued:

    • One to the surviving data partition.

    • One to the parity partition.

  2. The transferred blocks are written into the GGA offload memory with appropriate alignment.

gga_offload_-_recovery_read_02_-_RDMA-version-4-modificationdate-1744640077767-api-v2.png

3. Target Responses

  1. Both storage targets reply with I/O responses.

  2. The GGA offload application waits for both to complete before proceeding.

gga_offload_-_recovery_read_03_-_RDMA_IO_response-version-5-modificationdate-1744640170287-api-v2.png

4. Data Recovery

  1. The application performs recovery using the available half block and parity data to reconstruct the missing block.

    gga_offload_-_recovery_read_04_-_EC_recover-version-2-modificationdate-1744640222650-api-v2.png

. 5. Decompression and Final Response

  1. The recovered data is then decompressed. As with the regular read, decompression output is written directly to initiator memory.

    gga_offload_-_regular_read_04_-_Decompress-version-2-modificationdate-1744639917087-api-v22.png

  • An I/O response is sent to the initiator, completing the operation.

    gga_offload_-_regular_read_05_-_Decompress_IO_response-version-2-modificationdate-1744639965093-api-v2.png

This application leverages the following DOCA libraries:

This application is compiled as part of the set of storage applications. For compilation instructions, refer to the DOCA Storage Applications page.

Application Execution

Warning

This application can only run within the NVIDIA® BlueField® DPU.

Info

DOCA Storage Comch to RDMA GGA Offload is provided in source form. Therefore, compilation is required before the application can be executed.

  • Application usage instructions:

    Copy
    Copied!
                

    Usage: doca_storage_comch_to_rdma_gga_offload [DOCA Flags] [Program Flags]   DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all command flags from an input json file   Program Flags: -d, --device Device identifier -r, --representor Device host side representor identifier --cpu CPU core to which the process affinity can be set --data-1-storage Storage server addresses in <ip_addr>:<port> format --data-2-storage Storage server addresses in <ip_addr>:<port> format --data-p-storage Storage server addresses in <ip_addr>:<port> format --matrix-type Type of matrix to use. One of: cauchy, vandermonde Default: vandermonde --command-channel-name Name of the channel used by the doca_comch_client. Default: "doca_storage_comch" --control-timeout Time (in seconds) to wait while performing control operations. Default: 5 --trigger-recovery-read-every-n Trigger a recovery read flow every N th request. Default: 0 (disabled)

    Info

    This usage printout can be printed to the command line using the -h (or --help) options:

    Copy
    Copied!
                

    ./doca_storage_comch_to_rdma_gga_offload -h

    For additional information, refer to section "Command-line Flags".

  • CLI example for running the application on the BlueField:

    Copy
    Copied!
                

    ./doca_storage_zero_copy_comch_to_rdma -d 03:00.0 -r 3b:00.0 --data-1-storage 172.17.0.1:12345 --data-2-storage 172.17.0.2:12345 --data-p-storage 172.17.0.3:12345 --cpu 0

    Note

    Both the DOCA Comch device PCIe address (03:00.0) and the DOCA Comch device representor PCIe address (3b:00.0) should match the addresses of the desired PCIe devices.

    Note

    Storage target IP address:port tuples should be updated to refer to the running storage target applications.

  • The application also supports a JSON-based deployment mode in which all command-line arguments are provided through a JSON file:

    Copy
    Copied!
                

    ./doca_storage_comch_to_rdma_gga_offload --json [json_file]

    For example:

    Copy
    Copied!
                

    ./doca_storage_comch_to_rdma_gga_offload --json doca_storage_comch_to_rdma_gga_offload_params.json

    Note

    Before execution, ensure that the JSON file contains valid configuration parameters, particularly the correct PCIe device addresses required for deployment.

Command-line Flags

Flag Type

Short Flag

Long Flag/JSON Key

Description

JSON Content

General flags

h

help

Print a help synopsis

N/A

v

version

Print program version information

N/A

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

Copy
Copied!
            

"log-level": 60

N/A

sdk-log-level

Set the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

Copy
Copied!
            

"sdk-log-level": 40

j

json

Parse all command flags from an input JSON file

N/A

Program flags

d

device

DOCA device identifier. One of:

  • PCIe address: 3b:00.0

  • InfiniBand name: mlx5_0

  • Network interface name: en3f0pf0sf0

Note

This flag is a mandatory.

Copy
Copied!
            

"device": "03:00.0"

r

representor

DOCA Comch device representor PCIe address

Note

This flag is a mandatory.

Copy
Copied!
            

"representor": "3b:00.0"

N/A

--cpu

Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0.

Note

The user can specify this argument multiple times to create more threads.

Note

This flag is a mandatory.

Copy
Copied!
            

"cpu": 6

N/A

--data-1-storage

IP address and port to use to establish the control TCP connection to the target.

Note

This flag is a mandatory.

Copy
Copied!
            

"data-1-storage": "172.17.0.1:12345"

N/A

--data-2-storage

IP address and port to use to establish the control TCP connection to the target.

Note

This flag is a mandatory.

Copy
Copied!
            

"data-2-storage": "172.17.0.1:12345"

N/A

--data-p-storage

IP address and port to use to establish the control TCP connection to the target.

Note

This flag is a mandatory.

Copy
Copied!
            

"data-p-storage": "172.17.0.1:12345"

N/A

--matrix-type

Type of matrix to use. One of:

  • cauchy

  • vandermonde

Copy
Copied!
            

"matrix-type": "vandermonde"

N/A

--command-channel-name

Allows customizing the server name used for this application instance if multiple comch servers exist on the same device.

Copy
Copied!
            

"command-channel-name": "doca_storage_comch"

N/A

--control-timeout

Time, in seconds, to wait while performing control operations

Copy
Copied!
            

"control-timeout": 5

N/A

--trigger-recovery-read-every-n

Trigger a recovery read flow every Nth request. Set to 0 to disable recovery reads entirely. For example, a value of 1 triggers a recovery read for every request, while a value of 10 triggers a recovery read for one out of every ten requests.

Copy
Copied!
            

"trigger-recovery-read-every-n": 0


Troubleshooting

Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue encountered with the installation or execution of the DOCA applications.

Control Phase

  1. Copy
    Copied!
                

    gga_offload_app app{parse_cli_args(argc, argv)};

    Parse CLI arguments, apply default values, and create the application instance.

  2. Copy
    Copied!
                

    app.connect_to_storage();

    Connect to each of the three storage targets over TCP.

  3. Copy
    Copied!
                

    app.wait_for_comch_client_connection();

    Create a doca_comch_server instance and wait for a doca_comch_client to connect.

  4. Copy
    Copied!
                

    app.wait_for_and_process_query_storage();

    Wait for the initiator to send a query storage control message, then:

    • Send a query storage message to each storage target

    • Verify that each storage target reports the same storage size

    • Send a query storage response back to the initiator

  5. Copy
    Copied!
                

    app.wait_for_and_process_init_storage();

    Wait for the initiator to send an init storage control message, then:

    • Verify that the requested core count does not exceed the available cores

    • Create and export storage memory

    • Send an init storage message to each storage target

    • Wait for responses from all storage targets

    • Create data path resources:

      • Worker threads

      • IO message memory regions

      • doca_pe objects

      • doca_comch_consumer objects

      • doca_comch_producer objects

      • doca_rdma connection objects

      • doca_ec objects

      • doca_compress objects

    • Send an init storage response

  6. Copy
    Copied!
                

    app.wait_for_and_process_start_storage();

    Wait for the initiator to send a start storage control message, then:

    • Send a start storage message to each storage target

    • Wait for responses from all storage targets

    • Create task objects

    • Submit listening tasks (doca_comch_consumer and RDMA receive tasks)

    • Signal worker threads to begin processing

    • Send a start storage response

  7. Copy
    Copied!
                

    app.wait_for_and_process_stop_storage();

    Wait for the initiator to send a stop storage control message (test complete), then:

    • Send a stop storage message to each storage target

    • Wait for responses from all storage targets

    • Signal worker threads to stop

    • Gather and post-process execution statistics

    • Destroy doca_comch_consumer objects

    • Destroy doca_comch_producer objects

    • Send a stop storage response

  8. Copy
    Copied!
                

    app.wait_for_and_process_shutdown();

    Wait for the initiator to send a shutdown control message, then:

    • Send a shutdown message to each storage target

    • Wait for responses from all storage targets

    • Destroy all remaining data path objects

    • Send a shutdown response

  9. Copy
    Copied!
                

    app.display_stats();

    Display collected statistics and destroy all control path objects.

Data Path Phase

  1. Copy
    Copied!
                

    while (m_hot_data.run_flag == false) { std::this_thread::yield(); if (m_hot_data.error_flag) return; }

    The main data thread enters a spin-wait loop, yielding execution until all threads and resources are initialized. If an error is detected (error_flag is set), the thread exits early.

  2. Copy
    Copied!
                

    while (m_hot_data.run_flag) { doca_pe_progress(m_hot_data.pe) ? ++(m_hot_data.pe_hit_count) : ++(m_hot_data.pe_miss_count); }

    Once started, the thread enters a tight loop, continuously polling the progress engine (doca_pe_progress). Each iteration updates the hit/miss counters based on whether any task completions were triggered. This loop drives the data path by processing task completions as fast as possible.

  3. Copy
    Copied!
                

    while (m_hot_data.error_flag == false && m_hot_data.in_flight_transaction_count != 0) { doca_pe_progress(m_hot_data.pe) ? ++(m_hot_data.pe_hit_count) : ++(m_hot_data.pe_miss_count); }

    This final loop ensures that all in-flight transactions complete before exiting. It continues polling the progress engine as long as there are active transactions and no error has occurred.

doca_comch_consumer_task_post_recv_cb

This is the comch consumer callback function which initiates the read or recovery flow. It marks the first step in processing the initiator's request, determining which data must be fetched and which two storage targets to query.

  1. doca_comch_consumer_task_post_recv_cb reinterprets the task and context user data, and invokes gga_offload_app_worker::hot_data::start_transaction.

  2. start_transaction prepares two IO requests, depending on the required mode:

    • readdata 1 and data 2

    • recover_aparity and data 2 (to recover data 1)

    • recover_bdata 1 and parity (to recover data 2)

    • IO request preparation:

      Copy
      Copied!
                  

      storage::io_message_view::set_correlation_id(cid, part_a_io_message); storage::io_message_view::set_correlation_id(cid, part_b_io_message); storage::io_message_view::set_correlation_id(cid, response_io_message);   storage::io_message_view::set_type(type, part_a_io_message); storage::io_message_view::set_type(type, part_b_io_message); storage::io_message_view::set_type(storage::io_message_type::result, response_io_message);   storage::io_message_view::set_user_data(user_data, part_a_io_message); storage::io_message_view::set_user_data(user_data, part_b_io_message); storage::io_message_view::set_user_data(user_data, response_io_message);   storage::io_message_view::set_io_address(local_io_addr, part_a_io_message); storage::io_message_view::set_io_address(local_io_addr + half_block_size, part_b_io_message); storage::io_message_view::set_io_address(host_io_addr, response_io_message);   storage::io_message_view::set_io_size(half_block_size, part_a_io_message); storage::io_message_view::set_io_size(half_block_size, part_b_io_message); storage::io_message_view::set_io_size(transaction.io_size, response_io_message);   storage::io_message_view::set_remote_offset(remote_offset_a, part_a_io_message); storage::io_message_view::set_remote_offset(remote_offset_b, part_b_io_message);

  3. The transaction mode (read, recover_a, or recover_b) is saved.

  4. The prepared IO requests are then sent to the corresponding storage targets.

doca_rdma_task_receive_cb

After each storage target completes its respective data transfer, it sends a response. Once the callback receives the second of the two responses, it checks the transaction mode. If the mode is read, it invokes the decompression step: gga_offload_app_worker::hot_data::start_decompress. Otherwise, it invokes the EC recovery step: gga_offload_app_worker::hot_data::start_recover.

doca_ec_task_recover_cb

In scenarios requiring data recovery, the decompression step (gga_offload_app_worker::hot_data::start_decompress) is invoked immediately after the recovery completes.

doca_compress_task_decompress_lz4_stream_cb

Completion of the decompression also entails storing the resulting decompressed data into the remote initiator's memory. At this point, the transaction is considered complete and may be safely re-used.

  • /opt/mellanox/doca/applications/storage/

© Copyright 2025, NVIDIA. Last updated on May 22, 2025.