DOCA Documentation v2.9.0
DOCA SDK 2.9.0 Download PDF

NVIDIA DOCA Storage Zero Copy Initiator Comch Application Guide

DOCA Storage Zero Copy Initiator Comch application (initiator_comch) plays the following roles:

  1. Demonstrates how to utilize the DOCA Comch API (client-server) to communicate configuration between the x86 host and BlueField

  2. Demonstrates how to utilize the DOCA Comch API (producer-consumer) and hardware acceleration to offload the efficient transfer of messages between the x86 host and BlueField in the data path.

  3. Provides a benchmark for the performance of such an application/use case.

DOCA Storage Zero Copy Initiator Comch application creates an area of local memory and a set of message buffers to instruct the doca_storage_zero_copy_comch_to_rdma (comch_to_rdma) application to perform read and write operations to and from the created local memory region. The initiator_comch application is responsible for providing the memory region details and access details to comch_to_rdma. The initiator_comch application has no knowledge of the specifics of the doca_storage_zero_copy_target_rdma (target_rdma) application and is not directly involved with the actions required to carry out the RDMA operations to affect the transfer of data to and from target_rdma.

Data path objects are created per thread and, to maintain simplicity, a single memory region is used and each each thread and its IO message will refer to a different segment of the single exported memory region. Ensuing each thread uses a separate region of the exported memory removes the complexity of multi-threaded access to the memory. If desired, users may choose to expand the application to support multiple unique memory regions so there is one per thread.

initiator_system_design-version-3-modificationdate-1730146053840-api-v2.png

DOCA Storage Zero Copy Initiator Comch executes in three stages:

  1. Preparation.

  2. Data path.

  3. Teardown.

Preparation Stage

During this stage the application performs the following:

  1. Allocates the required DOCA objects and memory for the control path.

  2. Creates a DOCA Comch client and connects to comch_to_rdma.

  3. Sends a "configure data path" control message (buffer count, buffer size, doca_mmap export details) to comch_to_rdma.

  4. Waits for a configure data data path control message response from comch_to_rdma.

  5. Creates data path objects.

  6. Sends a "start data path connections" control message to comch_to_rdma.

  7. Waits for a "start data path" control message response from comch_to_rdma.

  8. Populates all IO messages with the necessary data.

  9. Sends a "start storage" control message to comch_to_rdma.

  10. Waits for a start storage control message response from comch_to_rdma.

initiator_preperation_stage-version-2-modificationdate-1730146144380-api-v2.png

Data Path Stage

The data path state serves as both an example and a built-in benchmark and uses only data path objects. No control path objects or code are used during this stage.

The benchmark begins by submitting all tasks as quickly as possible to start all the transactions, then the progress engine (PE) is polled as quickly as possible. Each thread executes the same data path function. As each task completes, it decrements the transaction reference count. Once this value reaches 0, the transaction can start again. This is required as there are no temporal guarantees between DOCA Comch producer and consumer event callbacks. It is possible to be notified of the consumer completion before being notified of the producer send completion. Once a thread has completed its required number of transactions (the total transaction run limit as specified by: --run-limit-operation-count divided by the number of threads), that thread exits. Once all threads have joined, the application proceeds to send a stop IO message and moves onto the teardown phase.

initiator_data_path_stage-version-3-modificationdate-1730146199727-api-v2.png

Teardown Stage

To teardown, the application performs the following:

  1. Displays execution statistics.

  2. Sends a "destroy objects" control message to comch_to_rdma.

  3. Waits for a destroy objects control message response from comch_to_rdma.

  4. Destroys data path objects.

  5. Destroys control path objects.

  6. Destroys any other allocated memory/objects.

This application leverages the following DOCA libraries:

This application is compiled as part of the set of storage zero copy applications. For compilation instructions, refer to NVIDIA DOCA Storage Zero Copy.

Application Execution

Info

This application can only be run on the host.

DOCA Storage Zero Copy Initiator Comch is provided in source form. Therefore, a compilation is required before the application can be executed.

  • Application usage instructions:

    Copy
    Copied!
                

    Usage: doca_storage_zero_copy_initiator_comch [DOCA Flags] [Program Flags]   DOCA Flags:   -h, --help                        Print a help synopsis   -v, --version                     Print program version information   -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>   --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>   -j, --json <path>                 Parse all command flags from an input json file   Program Flags:   -d, --device                      Device identifier   --operation                       Operation to perform. One of: read|write   --run-limit-operation-count       Run N operations then stop   --cpu                             CPU core to which the process affinity can be set   --per-cpu-buffer-count            Number of memory buffers to create. Default: 64   --buffer-size                     Size of each created buffer. Default: 4096   --validate-writes                 Enable validation of writes operations by reading them back afterwards. Default: false   --command-channel-name            Name of the channel used by the doca_comch_client. Default: storage_zero_copy_comch   --control-timeout                 Time (in seconds) to wait while performing control operations. Default: 10   --batch-size                      Batch size: Default: ${per-cpu-buffer-count} / 2

    Info

    This usage printout can be printed to the command line using the -h (or --help) options:

    Copy
    Copied!
                

    ./doca_storage_zero_copy_initiator_comch -h

    For additional information, refer to section "Command Line Flags".

  • CLI example for running the application on the host:

    Copy
    Copied!
                

    ./doca_storage_zero_copy_initiator_comch -d 3b:00.0 --operation read --run-limit-operation-count 10000000 --cpu 5

    Info

    The DOCA device PCIe address, 3b:00.0, should match the address of the desired PCIe device.

  • The application also supports a JSON-based deployment mode, in which all command-line arguments are provided through a JSON file:

    Copy
    Copied!
                

    ./doca_storage_zero_copy_initiator_comch --json [json_file]

    For example:

    Copy
    Copied!
                

    ./doca_storage_zero_copy_initiator_comch --json doca_storage_reference_zero_copy_host_params.json

    Note

    Before execution, ensure that the used JSON file contains the correct configuration parameters, and especially the PCIe addresses necessary for the deployment.

Command Line Flags

Flag Type

Short Flag

Long Flag/JSON Key

Description

JSON Content

General flags

h

help

Print a help synopsis

N/A

v

version

Print program version information

N/A

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

Copy
Copied!
            

"log-level": 60

N/A

sdk-log-level

Set the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

Copy
Copied!
            

"sdk-log-level": 40

j

json

Parse all command flags from an input JSON file

N/A

Program flags

d

device

DOCA device identifier. One of:

  • PCIe address: 3b:00.0

  • InfiniBand name: mlx5_0

  • Network interface name: en3f0pf0sf0

Note

This flag is mandatory.

Copy
Copied!
            

"device": "3b:00.0"

N/A

operation

Operation to perform either read or write

Note

This flag is mandatory.

Copy
Copied!
            

"operation": "read"

N/A

--run-limit-operation-count

Run N operations (transactions) then stop

Note

This flag is mandatory.

Copy
Copied!
            

"run-limit-operation-count": 10000000

N/A

--cpu

Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0.

Note

The user can specify this argument multiple times to create more threads.

Note

This flag is mandatory.

Copy
Copied!
            

"cpu": 6

N/A

--per-cpu-buffer-count

Number of buffers (all buffers execute in parallel) to use per CPU

Copy
Copied!
            

"per-cpu-buffer-count": 64

N/A

--buffer-size

Size of buffer to use for data transfers. Should be a value representative of a disk block size.

Copy
Copied!
            

"buffer-size": 4096

N/A

--validate-writes

Run a functional test instead of a performance test. Only compatible with write operation mode.

Copy
Copied!
            

"validate-writes": true

N/A

--command-channel-name

Allows customizing the server name used for this application instance if multiple comch servers exist on the same device

Copy
Copied!
            

"command-channel-name": "storage_zero_copy_comch"

N/A

--control-timeout

Timeout (in seconds) for a control operation to complete. If any control operation exceeds this time, the application aborts.

Copy
Copied!
            

"control-timeout": 10

N/A

--batch-size

Batch size to use when submitting tasks using the batched API

Copy
Copied!
            

"batch-size": 8


Troubleshooting

Refer to the NVIDIA DOCA Troubleshooting Guide for any issue encountered with the installation or execution of the DOCA applications.

Control Thread Flow

  1. Parse application arguments:

    Copy
    Copied!
                

    auto const cfg = parse_cli_args(argc, argv);

    1. Prepare the parser (doca_argp_init).

    2. Register parameters (doca_argp_param_create).

    3. Parse the arguments (doca_argp_start).

    4. Destroy the parser (doca_argp_destroy).

  2. Display the configuration:

    Copy
    Copied!
                

    print_config(cfg);

  3. Create application instance

    Copy
    Copied!
                

    g_app.reset(storage::zero_copy::make_host_application(cfg));

  4. Run the application:

    Copy
    Copied!
                

    g_app->run()

    1. Find and open the specified device:

      Copy
      Copied!
                  

      m_dev = storage::common::open_device(m_cfg.device_id);

    2. Create control path progress engine:

      Copy
      Copied!
                  

      doca_pe_create(&m_ctrl_pe);

    3. Create comch control objects:

      Copy
      Copied!
                  

      create_comch_control();

    4. Connect to comch server:

      Copy
      Copied!
                  

      connect_comch_control();

    5. Configure storage:

      Copy
      Copied!
                  

      configure_storage();

      1. Allocate local memory region.

      2. Create doca_mmap.

      3. Send configure data path control message to comch_to_rdma.

      4. Wait for a configure data path control message response from comch_to_rdma.

    6. Prepare data path

      Copy
      Copied!
                  

      prepare_data_path();

      1. Create per thread data context:

        1. Create IO messages.

        2. Create transaction objects.

        3. Create progress engine.

        4. Create mmap for IO message buffers.

        5. Create Comch producer.

        6. Create Comch consumer.

      2. Send start data path connections control message to comch_to_rdma.

      3. Wait for a start data path connections control message response from comch_to_rdma.

      4. Poll progress engine until:

        1. remote consumer ID values have been received.

        2. All consumers are running.

        3. All producers are running.

    7. Create tasks:

      Copy
      Copied!
                  

      m_thread_contexts[ii].create_tasks(m_raw_io_data + (ii * per_thread_task_count * m_cfg.buffer_size), m_cfg.buffer_size, m_remote_consumer_ids[ii], op_type, m_cfg.batch_size);

    8. Create threads:

      Copy
      Copied!
                  

      if (op_type == io_message_type::read) { m_thread_contexts[ii].thread = std::thread{&thread_hot_data::non_validated_test, std::addressof(m_thread_contexts[ii].hot_context)}; } else if (op_type == io_message_type::write) { if (m_cfg.validate_writes) { m_thread_contexts[ii].thread = std::thread{&thread_hot_data::validated_test, std::addressof(m_thread_contexts[ii].hot_context)}; } else { m_thread_contexts[ii].thread = std::thread{&thread_hot_data::non_validated_test, std::addressof(m_thread_contexts[ii].hot_context)}; } }

    9. Start the data path:

      Copy
      Copied!
                  

      wait_for_control_response(send_control_message(control_message_type::start_storage));

    10. Record the start time.

    11. Submit initial DOCA Comch consumer tasks.

    12. Start data path threads.

    13. Wait for all threads to complete.

    14. Record the end time.

    15. Stop storage.

    16. Shutdown.

  5. Display statistics:

    Copy
    Copied!
                

    printf("+================================================+\n"); printf("| Stats\n"); printf("+================================================+\n"); printf("| Duration (seconds): %2.06lf\n", duration_secs_float); printf("| Operation count: %u\n", stats.operation_count); printf("| Data rate: %.03lf GiB/s\n", GiBs / duration_secs_float); printf("| IO rate: %.03lf MIOP/s\n", miops); printf("| PE hit rate: %2.03lf%% (%lu:%lu)\n", pe_hit_rate_pct, stats.pe_hit_count, stats.pe_miss_count); printf("| Latency:\n"); printf("| \tMin: %uus\n", stats.latency_min); printf("| \tMax: %uus\n", stats.latency_max); printf("| \tMean: %uus\n", stats.latency_mean); printf("+================================================+\n");

Performance Data Path Thread Flow

  1. Start transactions:

    Copy
    Copied!
                

    for (uint32_t ii = 0; ii != transactions_size; ++ii) start_transaction(transactions[ii], std::chrono::steady_clock::now());

  2. Run until N operations have been completed:

    Copy
    Copied!
                

    while (run_flag) { doca_pe_progress(data_pe) ? ++(pe_hit_count) : ++(pe_miss_count); }

Functional Data Path Thread Flow

  1. Determine the number of iterations to execute (each iteration is up to --per-cpu-buffer-count transactions):

    Copy
    Copied!
                

    uint32_t const iteration_count = (remaining_tx_ops / transactions_size) + ((remaining_tx_ops % transactions_size) == 0 ? 0 : 1);

  2. For each iteration:

    1. Set data in local memory region to a fixed pattern.

    2. Set all transactions to write mode:

      Copy
      Copied!
                  

      void thread_hot_data::set_operation(io_message_type operation) { for (uint32_t ii = 0; ii != transactions_size; ++ii) { auto *io_message = const_cast<char *>(storage::common::get_buffer_bytes( doca_comch_producer_task_send_get_buf(transactions[ii].request))); } }

    3. Start all transactions.

    4. Poll the PE until all transactions complete.

    5. Set data in local memory region to an alternative fixed pattern.

    6. Set all transactions to read mode.

    7. Start all transactions.

    8. Poll the PE until all transactions complete.

    9. Validate that all data in local memory region has been modified and reflects the original data pattern and not the alternative pattern.

  • /opt/mellanox/doca/applications/storage/

© Copyright 2024, NVIDIA. Last updated on Nov 19, 2024.