DOCA Perftest
DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.
DOCA Perftest is a flexible RDMA benchmarking application that scales from simple client-server tests to distributed, multi-node cluster benchmarks. It provides extensive configurability and supports modern RDMA features, including GPUDirect RDMA integration.
Key features:
Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs
Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA
Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes
Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file
Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface
Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks
DOCA Perftest streamlines the performance validation process for RDMA-based applications.
DOCA Perftest relies on the following components, provided with the DOCA SDK:
rdma-core
libibverbs
OpenMPI
DOCA version 3.0.0 or later
Optional:
Passwordless SSH setup across all servers for multi-node testing
CUDA 12.8 or later, required for GPUDirect RDMA benchmarks
DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:
Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests. This mode is ideal for testing individual parameters or small-scale deployments.
Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations. This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.
Simple Benchmarks
To perform a basic benchmark test, run doca_perftest
on both the server and client nodes using CLI parameters.
Example:
Server (responder):
doca_perftest -co 1,2 -d mlx5_0 -ct UC -
v
send -m bw -s 1024 -D 10
Client (requestor):
doca_perftest -co
1
,2
-d mlx5_0 -ct UC -v send -m bw -s1024
-D10
-sn remote-server-name
Parameter breakdown:
-co 1,2
– Run synchronously on CPU cores 1 and 2-d mlx5_0
– Use the local IB devicemlx5_0
-ct UC
– Use Unreliable Connection (UC) type-v send
– Usesend
RDMA verb-m bw
– Measure bandwidth (BW)-s 1024
– Message size is 1024 bytes-D 10
– Set test duration to 10 seconds-sn remote-server-name
– (Client only) Target server hostname
The following options are available via -h
/ --help
:
doca_perftest -h
Argument | Description | Default Value |
| JSON input file for complex scenarios. If specified, overrides all other options (except help/debug). | |
| MANDATORY. Comma-separated list or range of CPU cores. | |
| Connection type: | |
| RDMA verb: |
|
| Metric type: |
|
| Message size in bytes. |
|
| Inline size, up to 1024 bytes. |
|
| Send queue depth. |
|
| Max number of WCs (cookies) per poll. |
|
| Cookie moderation value (up to 8k messages). |
|
| Use legacy post-send mode instead of |
|
| Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes OOO: |
|
| Number of outstanding reads. |
|
| Number of QPs. |
|
| Print average number of cookies polled per process. | Disabled |
| Print workload fairness histogram for QPs. | Disabled |
| Number of iterations per QP. Mutually exclusive with |
|
| Traffic duration in seconds. Mutually exclusive with | Iteration-based |
| Specify specific output: | No filtering |
| Server IPv4 address. | Hardcoded |
| Server hostname (client side only). Mutually exclusive with | |
| Server port. |
|
| IB device name. | First available device |
| Warmup time in seconds. |
|
| Disable PCI relaxed ordering. | Enabled (if possible) |
| Latency-only. Save raw latency to JSON file. Path optional. | Disabled |
| Print config/output in JSON format to file. Optional path. | Disabled |
| Executor name (used in JSON output). | |
| Session description (used in JSON output). | |
| Use CUDA memory (GPUDirect RDMA). | Host memory |
Complex Scenarios
For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:
doca_perftest -in path_to_scenario_file.json
Capabilities:
Automatically deploys and coordinates execution across all defined hosts
Synchronized test initiation across all nodes
Collects and aggregates results on the invoking node
Use cases:
Cluster-wide RDMA performance testing
Multi-benchmark test suites
Automation and repeatability of complex test setups
Notes:
JSON examples can be found in
/usr/share/doc/doca-perftest/examples
It is recommended to start from an example and customize as needed
Bandwidth
Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.
Metrics collected:
Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.
Bandwidth (Gb/s) – Total data transfer rate, calculated as
bandwidth = message_rate × message_size
. Measures sustained throughput.
How it is measured:
The result is aggregated across all active test processes
Concurrency is controlled using:
-co
(CLI)"cores"
field (JSON)
The test duration is averaged across processes for consistency
Interpretation tips:
High message rate, low bandwidth → Likely using small message sizes
High bandwidth, moderate message rate → Larger messages or fewer CQEs
Results help determine network saturation, queue depth tuning, and core scaling
Latency
Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.
RDMA verb modes:
Send/Receive – Measures one-way latency (client → server)
Write – Measures round-trip latency (client → server → client), i.e., ping-pong
Metrics collected:
Minimum latency – Fastest observed transmission
Maximum latency – Longest time taken, reflects worst-case delays
Mean latency – Arithmetic average across all iterations
Median latency – Middle value; less affected by outliers than the mean
Standard deviation – Indicates variability; smaller is more consistent
99% tail latency – 99% of messages completed within this latency
99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this
How it is measured:
Uses RDMA verbs in a loop with tightly synchronized messaging
Time measurements are taken on the sender side for accuracy
Results are aggregated and reported per test
Interpretation tips:
Low mean/median, but high max/tail → Potential jitter or queuing delays
Low standard deviation → Reliable, predictable performance
Use tail latency (99%, 99.9%) for SLA validation and real-time applications
Important notes:
DOCA Perftest offers improved Write latency accuracy compared to the legacy
perftest
toolDifferences in latency measurement logic may exist between tools—compare results carefully