DOCA Perftest
DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.
DOCA Perftest enables evaluation of RDMA performance in both simple client-server setups and large-scale cluster scenarios. It offers fine-grained control over benchmarking parameters and supports diverse RDMA verbs and configurations.
Key features:
Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs
Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA
Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes
Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file
Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface
Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks
DOCA Perftest streamlines the performance validation process for RDMA-based applications.
DOCA Perftest relies on the following components, provided with the DOCA SDK:
rdma-core
libibverbs
OpenMPI
doca-verbs
(optional)
DOCA version 3.0.0 or later
Optional:
Passwordless SSH setup across all servers for multi-node testing
CUDA 12.8 or later, required for GPUDirect RDMA benchmarks
DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:
Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests.
This mode is ideal for testing individual parameters or small-scale deployments.
Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations.
This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.
Simple Point-To-Point Benchmarks
To run a basic RDMA performance test, execute doca_perftest
on both the server and client nodes using command-line parameters.
Example usage:
Client (requestor):
doca_perftest -N
3
-d mlx5_0 -c UC -v send -m bw -s1024
-D10
-q50
--server_name remote-server-name
Parameter breakdown:
-N 3
– Launches 3 synchronized processes with automatic core selection(Use
-C
to specify cores explicitly.)-d mlx5_0
– Uses the local InfiniBand devicemlx5_0
-c UC
– Sets the RDMA connection type to Unreliable Connection (UC)-v send
– Uses the send RDMA verb-m bw
– Measures bandwidth-s 1024
– Sets the message size to 1024 bytes-D 10
– Sets the test duration to 10 seconds-q 50
– Allocates 50 queue pairs per process--server_name remote-server-name
– Specifies the remote server hostnameInfoIf passwordless SSH is configured, the client automatically invokes the server instance. If the server is already running manually, the client detects it and skips the invocation.
The following options are available via -h
/ --help
:
doca_perftest -h
Argument | Description | Default |
| Display help message | |
| Display version information | |
| Specify the log level: |
|
| Specify input JSON file path. If specified, all other options will be ignored (except | |
| Specify the max time in seconds for server to wait for client connection |
|
| Depends on |
|
| Specify CPU cores to use. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed. | |
| Specify connection type: | |
| Specify traffic class for the QP attributes. |
|
| Specify RDMA verb type: supported: |
|
| Specify metric type: |
|
| Specify message size (in bytes). |
|
| Specify inline size, up to 1024 bytes. |
|
| Specify the depth of the send queue. |
|
| Set the maximum number of WC's (cookies) in a single poll. |
|
| Specify cookie moderation value (up to 8k messages). |
|
| Use old post send instead of |
|
| Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes out-of-order. Supported: |
|
| Specify the number of outstanding reads. |
|
| Specifies the IB service level (0–15) used for selecting the virtual lane and prioritizing traffic. |
|
| Specify the number of QPs to use. |
|
| Enable bidirectional traffic, where both server and client act as both a requestor and responder | |
| Print the average number of cookies per poll for each process. | No print |
| Print a histogram of QPs workload fairness. | No print |
| Specify the number of iterations for each QP. |
|
| Specify the overall traffic duration in seconds. | Iteration |
| Specify specific output: | No output |
| Specify the server's hostname or IPv4 address on the client side. | |
| Specify the server's port. |
|
| Specify the local device name. By default, First device in the system's will be taken. | |
| Specifies local GID index | |
| Specify warmup time in seconds. |
|
| Disable PCIe relaxed ordering. | Enabled if possible |
| Latency only! Save raw data to a JSON file: | No save |
| Specify to print the config and output to a JSON format file. Can be followed by a path, like: | No JSON print |
| Specify executor name for JSON output. | |
| Specify session description for JSON output. | |
| Specify DUT file path. | |
| Specify the number of seconds to wait before destroying allocated resources |
|
| Specify MTU size in bytes: |
|
| Specify where to allocate the memory for the |
|
| Specify processing hints for TPH: 0=Bidirectional, 1=Requester, 2=Target(Complete), 3=Target with priority | |
| Optional. Specify core ID to use for TPH. Must be set with | |
| Optional. Specify TPH memory type persistent or volatile. Must Be set with | |
| Specify the CUDA device ID |
|
| Specify the CUDA library path |
|
| Specify RDMA verbs driver: |
|
| Control server launch mode: |
|
| Specify CPU cores to use for the server process. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed. | Same as client |
| Specify the device name for the server process. | Same as client device |
| Specify memory type for the server process: | Same as client |
| Specify an alternate path for the server executable. | Current directory |
| Enable path selection with hints file. File must contain exactly 32 bytes of hints data. If no file path is provided, default hints data (all zeros) will be used. |
|
The options
-N
/--num_processes
and-C
/--cores
are mutually exclusive. If neither is set, 1 process is used and a single core is auto-selected. ⤶ ⤶The options
-i
/--iterations
and-D
/--duration
are mutually exclusive. If neither is set, 1 process is used and a single core is auto-selected. ⤶ ⤶
Complex Scenarios
For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:
doca_perftest -in path_to_scenario_file.json
Capabilities:
Automatically deploys and coordinates execution across all defined hosts
Synchronized test initiation across all nodes
Collects and aggregates results on the invoking node
Use cases:
Cluster-wide RDMA performance testing
Multi-benchmark test suites
Automation and repeatability of complex test setups
Notes:
JSON examples can be found in
/usr/share/doc/doca-perftest/examples
It is recommended to start from an example and customize as needed
Bandwidth
Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.
Metrics collected:
Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.
Bandwidth (Gb/s) – Total data transfer rate, calculated as
bandwidth = message_rate × message_size
. Measures sustained throughput.
How it is measured:
The result is aggregated across all active test processes
Concurrency is controlled using:
-co
(CLI)"cores"
field (JSON)
The test duration is averaged across processes for consistency
Interpretation tips:
High message rate, low bandwidth → Likely using small message sizes
High bandwidth, moderate message rate → Larger messages or fewer CQEs
Results help determine network saturation, queue depth tuning, and core scaling
Latency
Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.
RDMA verb modes:
Send/Receive – Measures one-way latency (client → server)
Write – Measures round-trip latency (client → server → client), i.e., ping-pong
Metrics collected:
Minimum latency – Fastest observed transmission
Maximum latency – Longest time taken, reflects worst-case delays
Mean latency – Arithmetic average across all iterations
Median latency – Middle value; less affected by outliers than the mean
Standard deviation – Indicates variability; smaller is more consistent
99% tail latency – 99% of messages completed within this latency
99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this
How it is measured:
Uses RDMA verbs in a loop with tightly synchronized messaging
Time measurements are taken on the sender side for accuracy
Results are aggregated and reported per test
Interpretation tips:
Low mean/median, but high max/tail → Potential jitter or queuing delays
Low standard deviation → Reliable, predictable performance
Use tail latency (99%, 99.9%) for SLA validation and real-time applications
Important notes:
DOCA Perftest offers improved Write latency accuracy compared to the legacy
perftest
toolDifferences in latency measurement logic may exist between tools—compare results carefully