DOCA Bench allows users to specify a series of operations to be performed and then scale that workload across multiple CPU cores/threads to get an estimation of how that workload performs and some insight into which stage(s), if any, cause performance problems for them. The user can then modify various configuration properties to explore how issues can be tuned to better serve their need.

When running, DOCA Bench creates a number of execution threads with affinities to the specific CPU specified by the user. Each thread creates, uniquely for themselves, a jobs pool (with job data initialized by a data provider) and a pipeline of workload steps.

There are many factors involved when carrying out performance tests, one of these is the CPU selection:

The user should consider NUMA regions when selecting which cores to use, as using a CPU which is distant from the device under test can impact the performance achievable

The user may also wish to avoid core 0 as this is typically the default core for kernel interrupt handlers.

Note CPU core selection has an impact on the total memory footprint of the test. See section "Test Memory Footprint" for more details.

Default value: 0x02

Core mask is the simplest way to specify which cores to use but is limited in that it can only specify up to 32 CPUs (0-31). Usage example: --core-mask 0xF001 selects CPU cores 0, 12, 13, 14, and 15.

Core list can specify any/all CPU cores in a given system as a list, range, or combination of the two. Usage example: --core-list 0,3,6-10 selects CPU cores 0, 3, 6, 7, 8, 9, and 10.

The user can select the first N cores from a given core set (list or mask) if desired. Usage example: --core-count N .

Info Sweep testing is supported. See section "Sweep Tests" for more details.





To test the impacts of contention within a single CPU core, the user can specify this value so that instead of only one thread being created per core, N threads are created with their affinity mask set to the given core for each core selected. For example, 3 cores and 2 threads per core create 6 threads total.

Info Sweep testing is supported. See section "Sweep Tests" for more details.

The test requires the use of at least one BlueField to execute. With remote system testing, a second device may be required.

Specify the device to use from the perspective of the system under test. The value can be for any one of either the device PCIe address (e.g., 03:00.0 ), the device IB device name (e.g., mlx5_0 ), or the device interface name (e.g., ens4f0 ).

This option is used only when performing remote memory operations between a BlueField device and its host using DOCA Comch. This is typically automated by the companion connection string but exists for some developer debug use-cases.

Info This option used to be important before the companion connection string property was introduced but now is rarely used.

DOCA Bench supports multiple methods of acquiring data to use to initialize job buffers. The user can also configure the output/intermediate buffers associated with each job.

Info Input data and buffer size configuration has an impact on the total memory footprint of the test. See section "Test Memory Footprint" for more details.

DOCA Bench supports several different input data sources:

file

file-set

random-data

The file data provider produces uniform/non-structured data buffers by using a single input file. The input data is stripped and or repeated to fill each data buffer as required, returning back to the start of the file each time it is exhausted to collect more data. This is desirable when the performance of the component(s) under test is meant to show different performance characteristics depending on the input data supplied.

For example, doca_dma and doca_sha would execute in constant time regardless of the input data. Whereas doca_compress would be faster with data with more duplication and slower for truly random data and would produce different output depending on the input data.

Given a small input data (i.e., smaller than the data buffer size), the file contents are repeated until the buffer is filled and then continue onto the next buffer(s). So, if the input file contained the data 012345 and the user requested two 20-byte buffers, the buffers would appear as follows:

01234501234501234501

23450123450123450123

Given a large input data (i.e., greater than the data buffer size), the file contents are distributed across the data buffers. If the the input file contained the data 0123456789abcdef and the user requested three 12-byte buffers, the buffers would appear as follows:

0123456789ab

cdef01234567

89abcdef0123

The file set data provider produces structured data. The file set input file itself is a file containing one or more filenames (relative to the input "command working directory (cwd)" not relative to the file set file). Each file listed inside the file set would have its entire contents used as a job buffer. This is useful for operations where the data must be a complete valid data block for the operation to succeed like decompression with doca_compress or decryption with doca_aes .

Given a file set in the "command working directory (cwd)" referring to data_1.bin and data_2.bin (one file name per line), and data_1.bin contains 33 bytes and data_2.bin contains 69 bytes, then the data required by the buffers would be filled with these two files in a round-robin manner until the buffers are full . Unlike uniform (non-structured) data each task can have different lengths.

The random data data provider provides uniform (non-structured) data from a random data source. Each buffer will have unique (pseudo) random bytes of content.

Default value: 128

Each thread in DOCA Bench has its own allocation of job data buffers to avoid memory contention issues. Users may select how many jobs should be created per thread using this parameter.

Info Sweep testing is supported. See section "Sweep Tests" for more details.





For data providers which use an input file, the filename can be specified here. The filename is relative to the input_cwd .

Info Sweep testing is supported. See section "Sweep Tests" for more details.





Specify the size of uniform input buffers (in bytes) that should be created.

Note Does not apply and should not be specified when using structured data input sources.

Info Sweep testing is supported. See section "Sweep Tests" for more details.





Default value: 16384

Specify the size of output/intermediate buffers (in bytes). Each job has 3 buffers: immutable input buffer and two output/intermediate buffers. This allows for a pipeline to mutate the data an infinite number of times throughout the pipeline, while allowing for it to be reset and re-used at the end and allowing any step to use the new mutated data created by the previous step.

To ease configuration management, the user may opt to use a separate folder for the input data for a given scenario outside of the DOCA build/install directory.

Tip It is recommended to use relative file paths for the input files.

Considering a user executing DOCA Bench from /home/bob/doca/build , values specified in --data-provider-input-file and filenames within a file set would search relative to the shell's "command working directory (cwd)": /home/bob/doca/build . Their command might look something like:

Copy Copied! doca_bench --data-provider file-set --data-provider-input-file my_file_set.txt

And assuming my_file_set.txt contains data_1.bin , the files that would be loaded by DOCA Bench after path resolution would be:

/home/bob/doca/build/my_file_set.txt

/home/bob/doca/build/data_1.bin

Considering the user executed that same test from one level up. Something like:

Copy Copied! build/doca_bench --data-provider file-set --data-provider-input-file build/my_file_set.txt

The files to be loaded would be:

/home/bob/doca/build/my_file_set.txt

/home/bob/doca/data_1.bin

Notice how both files were loaded relative to the "command working directory (cwd)" and the data file was not loaded relative to the file set.

The user can solve this easily by keeping all input files in a single directory and then referring to that directory using the parameter input-cwd . In this case, the command like may look something like:

Copy Copied! build/doca_bench --data-provider file-set --data-provider-input-file my_file_set.txt --input-cwd build

Note that the value for --data-provider-input-file also changed to be relative to the new "command working directory (cwd)" .

The files loaded this time are back to being what is expected:

/home/bob/doca/build/my_file_set.txt

/home/bob/doca/build/data_1.bin

DOCA Bench supports multiple test modes and run execution limits to allow the user to configure the test type and duration.

Default value: throughput

Select which type of test is to be performed.

Throughput mode is optimized to increase the volume of data processed in a given period with little or no regard for latency impact. Throughput mode tries to keep each component under test as busy as possible. A summary of the bandwidth and job execution rate are provided as output.

Bulk latency mode strikes a balance between throughout and latency, submitting a batch of jobs and waiting for them all to complete to measure the latency of each job. This mode uses a bucketing mechanism to allow DOCA Bench to handle many millions of jobs worth of results. DOCA Bench keeps a count of the number of jobs that complete within each bucket to allow it to run for long periods of time. A summery of the distribution of results with an ASCII histogram of the results are provided as output. The latency reported is the time taken between the first job submission (for a batch of jobs) until the final job response is received (for that same batch of jobs).

Precision latency mode executes one job at a time to allow DOCA Bench to calculate the minimum possible latency of the jobs. This causes the components which can process many jobs in parallel to be vastly underutilized and so greatly reduces bandwidth. As this mode records every result individually, it should not be used to execute more than several thousand jobs. Precision latency mode requires 8 bytes of storage for each result, so be mindful of the memory overhead of the number of jobs to be executed.

A statistical analysis including minimum, maximum, mean, median and some percentiles of the latency value are provided as output.

Default value: 100ms,10ms

Only applicable to bulk-latency mode. Allows the user to specify the starting value of the buckets, and the width of each bucket. There are 100 buckets of the given size and an under flow and overflow bucket for results that fall outside of the central range.

For example:

Copy Copied! --latency-bucket-range 10us,100us

This would start with the lowest bucket measuring <10μs response times, then 100 buckets which are 100μs wide, and a final bucket for results taking longer than >10010μs.

DOCA supports two methods of waiting on completion of tasks:

Busy-wait (or polling) mode

Notification-driven mode Info Refer to "DOCA SDK Architecture" documentation for more information.

By default, DOCA Bench uses the busy-wait to ensure maximum bandwidth (and low latency) for any given pipeline and its tasks with high utilization of any allocated CPU resources.

As with all high-performance software, utilizing GGAs or hardware accelerators, performance is usually CPU-bound at smaller packet sizes (i.e., at smaller payload sizes, the CPU spends a long time generating tasks and dealing with completions). For larger packet sizes, the CPU submits less tasks, as each task contains more data, therefore it may easily submit more data than the GGA or hardware accelerator can accept, resulting in periods where the CPU is busy-waiting on completions before being able to submit further tasks.

Info To execute any tests using an "notification-driven mode", use the options detailed in the following subsections.

This option causes DOCA Bench to use the "notification-drive mode" method of waiting on task completion.

Note At smaller packet sizes, the benchmark may still be CPU bound.





If specified, this option reports CPU statistics for any CPU cores DOCA Bench is executing on. This provides guidance on how much CPU time is returned, and thus available to other processes or threads, should the "notification-driven" mode be active.

Note Short duration tests may not result in sufficient produced data to generate CPU usage statistics.

The statistics provided include min, max, median, and mean values for the CPU usage. Also included are a number of percentile results, showing 90th, 95th, and a number of 99th percentile values. Example output:

Copy Copied! CPU Usage stats min: 25 % max: 50 % median: 50 % mean: 45.8333 % 90th %ile: 50 % 95th %ile: 50 % 99th %ile: 50 % 99 .9th %ile: 50 % 99 .99th %ile: 50 %

By default, a test runs forever. This is typically undesirable so the user can specify a limit to the test.

Note Precision-latency mode only supports job limited execution.

Runs the test for N seconds as specified by the user.

Runs the test until at least N jobs have been submitted, then allowing in-flight jobs to complete before exiting. More jobs than N may be executed based on batch size.

Runs the test until at least N bytes of data have been submitted, then allowing in-flight jobs to complete before exiting. More data may be processed than desired if the limit is not a multiple of the job input buffer size.

DOCA Bench supports submission of jobs in batches to improve performance.

Default value: none

Specifies the batching mode to use. The following options are available:

--batch-mode none – no batching, ring the doorbell after each job is submitted

--batch-mode batch-submit – submit a batch of jobs and then ring the doorbell. The number of jobs to be submitted is specified by the --batch-size parameter. If the number of jobs left to be submitted is fewer than batch-size , then the doorbell is rung after the last job is submitted.

batch-submit-minimal-reports – submit a batch of jobs and ring the doorbell, but defer the completion callback of the job until a later time. Reduces the number of completion notifications received after submission of a batch of jobs by receiving a single completion notification for the entire batch.

Default value: 1

Specifies the number of jobs to be included in each batch before ringing the doorbell.

Note Batching is currently only supported for the doca_sha and doca_dma step types and can only be specified in the throughput mode of operation.

Gather support involved breaking incoming input data from a single buffer into multiple buffers, which are "gathered" into a single gather list. Currently only gather is supported.

Default value: 1

Specifies the partitioning of input data from a single buffer into a gather list. The value can be specified in two flavors:

--gather-value 4 – splits input buffers into 4 parts as evenly as possible with odd bytes in the last segment

--gather-value 4KiB – splits buffers after each 4KB of data. See doca_bench/utility/byte_unit.hpp for the list of possible units.

Default value: 1

Specifies the partitioning of output data into a scatter list. The value can be specified in two flavors:

--scatter-value 4 – splits output buffers into 4 parts as evenly as possible with odd bytes in the last segment

--scatter-value 4KiB – splits buffers after each 4KB of data. See doca_bench/utility/byte_unit.hpp for the list of possible units.

By default, DOCA Bench emits the results of an iteration once it completes. The user can ask for transient snapshots of the stats as the test progresses by providing the --rt-stats-interval argument with a value representing the number of milliseconds between stat prints. The end-result of the run is still displayed as normal.

Note This may produce a large amount of console output.





DOCA Bench can produce an output file as part of its execution which can contain stats and the configuration values used to produce that stat. This is enabled by specifying the --csv-output-file argument with a file path as the value. Providing a value for this argument enables CSV stats output (in addition to the normal console output). When performing a sweep test, one line per iteration of the sweep test is populated.

By default, the CSV output contains every possible value. The user can tune this by applying a filter.

Provide one or more filters (positive or negative) to tune which stats are displayed. The value for this argument is a comma-separated list of filter strings. Negative filters start with a minus sign (' - ').

Copy Copied! --csv-stats "stats.*"

Note The quotes around the * prevent the shell from interpreting it as a wild card for filenames in the command.





Copy Copied! --csv-stats "stats.*,-attribute*

Default: false

When enabled, DOCA Bench appends to a CSV file if it exists or creates a new one. It is assumed that all invocation uses the exact same set of output values. This is not verified by DOCA Bench. The user must ensure that all tests that append to the CSV use the same set of output values.

A special case which creates a non-standard CSV file. All values that are not supported by sweep tests are reported only once first, then a new line of headers for values emitted during the test, then a row for each test result. This is reserved for an internal use case and should not be relied upon by anyone else.

Instructs DOCA Bench to collect some detailed system information as part of the test startup procedure which are then made available for output in the CSV. These also gather the same details from the companion side if the companion is in use.

Warning This collection can take a long time (up to a few minutes in some circumstances) to complete, so it is not recommended unless you know you need it.

Some libraries (e.g., doca_dma ) support the use of remote memory. To enable this, the user can specify one or both of the remote memory flags --use-remote-input-buffers and --use-remote-output-buffers . This tells DOCA Bench to use the companion to create a remote mmap. This remote mmap is then used to create buffers that are submitted to the component under test.

Note These flags should be used with caution and an understanding that if the underlying components under test can support this scenario, there is no automated checking. It is user responsibility to ensure these are used appropriately.

Specifies that the memory used for the initial immutable job input buffers into a pipeline should be backed by an mmap on the remote side.

Note Requires the companion app to be configured.





Specifies that all output and translation buffers in use are backed by an mmap on the remote side.