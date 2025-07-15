On This Page
DOCA Perftest
DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.
DOCA Perftest is a flexible RDMA benchmarking application that scales from simple client-server tests to distributed, multi-node cluster benchmarks. It provides extensive configurability and supports modern RDMA features, including GPUDirect RDMA integration.
Key features:
Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs
Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA
Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes
Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file
Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface
Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks
DOCA Perftest streamlines the performance validation process for RDMA-based applications.
DOCA Perftest relies on the following components, provided with the DOCA SDK:
rdma-core
libibverbs
OpenMPI
DOCA version 3.0.0 or later
Optional:
Passwordless SSH setup across all servers for multi-node testing
CUDA 12.8 or later, required for GPUDirect RDMA benchmarks
DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:
Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests. This mode is ideal for testing individual parameters or small-scale deployments.
Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations. This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.
Simple Benchmarks
To perform a basic benchmark test, run
doca_perftest on both the server and client nodes using CLI parameters.
Example:
Server (responder):
doca_perftest -co 1,2 -d mlx5_0 -ct UC -
vsend -m bw -s 1024 -D 10
Client (requestor):
doca_perftest -co
1,
2-d mlx5_0 -ct UC -v send -m bw -s
1024-D
10-sn remote-server-name
Parameter breakdown:
-co 1,2– Run synchronously on CPU cores 1 and 2
-d mlx5_0– Use the local IB device
mlx5_0
-ct UC– Use Unreliable Connection (UC) type
-v send– Use
sendRDMA verb
-m bw– Measure bandwidth (BW)
-s 1024– Message size is 1024 bytes
-D 10– Set test duration to 10 seconds
-sn remote-server-name– (Client only) Target server hostname
The following options are available via
-h /
--help:
doca_perftest -h
Argument
Description
Default Value
JSON input file for complex scenarios. If specified, overrides all other options (except help/debug).
MANDATORY. Comma-separated list or range of CPU cores.
Connection type:
RDMA verb:
Metric type:
Message size in bytes.
Inline size, up to 1024 bytes.
Send queue depth.
Max number of WCs (cookies) per poll.
Cookie moderation value (up to 8k messages).
Use legacy post-send mode instead of
Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes OOO:
Number of outstanding reads.
Number of QPs.
Print average number of cookies polled per process.
Disabled
Print workload fairness histogram for QPs.
Disabled
Number of iterations per QP. Mutually exclusive with
Traffic duration in seconds. Mutually exclusive with
Iteration-based
Specify specific output:
No filtering
Server IPv4 address.
Hardcoded
Server hostname (client side only). Mutually exclusive with
Server port.
IB device name.
First available device
Warmup time in seconds.
Disable PCI relaxed ordering.
Enabled (if possible)
Latency-only. Save raw latency to JSON file. Path optional.
Disabled
Print config/output in JSON format to file. Optional path.
Disabled
Executor name (used in JSON output).
Session description (used in JSON output).
Use CUDA memory (GPUDirect RDMA).
Host memory
Complex Scenarios
For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:
doca_perftest -in path_to_scenario_file.json
Capabilities:
Automatically deploys and coordinates execution across all defined hosts
Synchronized test initiation across all nodes
Collects and aggregates results on the invoking node
Use cases:
Cluster-wide RDMA performance testing
Multi-benchmark test suites
Automation and repeatability of complex test setups
Notes:
JSON examples can be found in
/usr/share/doc/doca-perftest/examples
It is recommended to start from an example and customize as needed
Bandwidth
Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.
Metrics collected:
Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.
Bandwidth (Gb/s) – Total data transfer rate, calculated as
bandwidth = message_rate × message_size. Measures sustained throughput.
How it is measured:
The result is aggregated across all active test processes
Concurrency is controlled using:
-co(CLI)
"cores"field (JSON)
The test duration is averaged across processes for consistency
Interpretation tips:
High message rate, low bandwidth → Likely using small message sizes
High bandwidth, moderate message rate → Larger messages or fewer CQEs
Results help determine network saturation, queue depth tuning, and core scaling
Latency
Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.
RDMA verb modes:
Send/Receive – Measures one-way latency (client → server)
Write – Measures round-trip latency (client → server → client), i.e., ping-pong
Metrics collected:
Minimum latency – Fastest observed transmission
Maximum latency – Longest time taken, reflects worst-case delays
Mean latency – Arithmetic average across all iterations
Median latency – Middle value; less affected by outliers than the mean
Standard deviation – Indicates variability; smaller is more consistent
99% tail latency – 99% of messages completed within this latency
99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this
How it is measured:
Uses RDMA verbs in a loop with tightly synchronized messaging
Time measurements are taken on the sender side for accuracy
Results are aggregated and reported per test
Interpretation tips:
Low mean/median, but high max/tail → Potential jitter or queuing delays
Low standard deviation → Reliable, predictable performance
Use tail latency (99%, 99.9%) for SLA validation and real-time applications
Important notes:
DOCA Perftest offers improved Write latency accuracy compared to the legacy
perftesttool
Differences in latency measurement logic may exist between tools—compare results carefully