DOCA Perftest - NVIDIA Docs

DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.

Introduction

DOCA Perftest is a flexible RDMA benchmarking application that scales from simple client-server tests to distributed, multi-node cluster benchmarks. It provides extensive configurability and supports modern RDMA features, including GPUDirect RDMA integration.

Key features:

Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs
Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA
Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes
Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file
Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface
Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks

DOCA Perftest streamlines the performance validation process for RDMA-based applications.

Dependencies

DOCA Perftest relies on the following components, provided with the DOCA SDK:

rdma-core
libibverbs
OpenMPI

Prerequisites

DOCA version 3.0.0 or later
Optional:
- Passwordless SSH setup across all servers for multi-node testing
- CUDA 12.8 or later, required for GPUDirect RDMA benchmarks

Usage

DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:

Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests. This mode is ideal for testing individual parameters or small-scale deployments.
Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations. This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.

Simple Benchmarks

To perform a basic benchmark test, run doca_perftest on both the server and client nodes using CLI parameters.

Example:

Server (responder):

Copy
Copied!

            
            doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10

Client (requestor):

Copy
Copied!

            
            doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10 -sn remote-server-name

Parameter breakdown:

-co 1,2 – Run synchronously on CPU cores 1 and 2
-d mlx5_0 – Use the local IB device mlx5_0
-ct UC – Use Unreliable Connection (UC) type
-v send – Use send RDMA verb
-m bw – Measure bandwidth (BW)
-s 1024 – Message size is 1024 bytes
-D 10 – Set test duration to 10 seconds
-sn remote-server-name – (Client only) Target server hostname

The following options are available via -h / --help:

Copy
Copied!

            
            doca_perftest -h

Argument	Description	Default Value
`-in` / `--input_file`	JSON input file for complex scenarios. If specified, overrides all other options (except help/debug).
`-co` / `--cores`	MANDATORY. Comma-separated list or range of CPU cores.
`-ct` / `--connection_type`	Connection type: `RC` or `UC`.
`-v` / `--verb`	RDMA verb: `read`, `write`, `send`, or `writeImm`.	`write`
`-m` / `--metric`	Metric type: `bw` (bandwidth) or `lat` (latency).	`bw`
`-s` / `--msg_size`	Message size in bytes.	`64k`
`-l` / `--inline_size`	Inline size, up to 1024 bytes.	`0` or message size (if `< 220`)
`-tx` / `--tx_depth`	Send queue depth.	`128`
`--poll_batch_size`	Max number of WCs (cookies) per poll.	`16`
`-mod`	Cookie moderation value (up to 8k messages).	`1`
`-os` / `--old_post_send`	Use legacy post-send mode instead of `IB_WR_API`.	`IB_WR_API`
`-er` / `--enhanced_reorder`	Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes OOO: `auto` or `disable`	`auto`
`-or` / `--out_reads`	Number of outstanding reads.	`1`
`-qp`	Number of QPs.	`1`
`--poll_stat`	Print average number of cookies polled per process.	Disabled
`--qp_histogram`	Print workload fairness histogram for QPs.	Disabled
`-i` / `--iterations`	Number of iterations per QP. Mutually exclusive with `-D`.	`5000`
`-D` / `--duration`	Traffic duration in seconds. Mutually exclusive with `-i`.	Iteration-based
`-o` / `--output`	Specify specific output: `BW`, `Lat`, or `MR`. Suppresses all other outputs.	No filtering
`-ip`	Server IPv4 address.	Hardcoded
`-sn` / `--server_name`	Server hostname (client side only). Mutually exclusive with `-ip`.
`-sp` / `--server_port`	Server port.	`18555`
`-d` / `--device_name`	IB device name.	First available device
`-w` / `--warmup`	Warmup time in seconds.	`2`
`--disable_pcir`	Disable PCI relaxed ordering.	Enabled (if possible)
`--save_raw_data`	Latency-only. Save raw latency to JSON file. Path optional.	Disabled
`-j` / `--json`	Print config/output in JSON format to file. Optional path.	Disabled
`-u` / `--user`	Executor name (used in JSON output).
`-sd` / `--session_desc`	Session description (used in JSON output).
`--cuda <cuda device id>`	Use CUDA memory (GPUDirect RDMA).	Host memory

Complex Scenarios

For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:

Copy
Copied!

            
            doca_perftest -in path_to_scenario_file.json

Capabilities:

Automatically deploys and coordinates execution across all defined hosts
Synchronized test initiation across all nodes
Collects and aggregates results on the invoking node

Use cases:

Cluster-wide RDMA performance testing
Multi-benchmark test suites
Automation and repeatability of complex test setups

Notes:

JSON examples can be found in /usr/share/doc/doca-perftest/examples
It is recommended to start from an example and customize as needed

Benchmark Results

Bandwidth

Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.

Metrics collected:

Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.
Bandwidth (Gb/s) – Total data transfer rate, calculated as bandwidth = message_rate × message_size. Measures sustained throughput.

How it is measured:

The result is aggregated across all active test processes
Concurrency is controlled using:
- -co (CLI)
- "cores" field (JSON)
The test duration is averaged across processes for consistency

Interpretation tips:

High message rate, low bandwidth → Likely using small message sizes
High bandwidth, moderate message rate → Larger messages or fewer CQEs
Results help determine network saturation, queue depth tuning, and core scaling

Latency

Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.

RDMA verb modes:

Send/Receive – Measures one-way latency (client → server)
Write – Measures round-trip latency (client → server → client), i.e., ping-pong

Metrics collected:

Minimum latency – Fastest observed transmission
Maximum latency – Longest time taken, reflects worst-case delays
Mean latency – Arithmetic average across all iterations
Median latency – Middle value; less affected by outliers than the mean
Standard deviation – Indicates variability; smaller is more consistent
99% tail latency – 99% of messages completed within this latency
99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this

How it is measured:

Uses RDMA verbs in a loop with tightly synchronized messaging
Time measurements are taken on the sender side for accuracy
Results are aggregated and reported per test

Interpretation tips:

Low mean/median, but high max/tail → Potential jitter or queuing delays
Low standard deviation → Reliable, predictable performance
Use tail latency (99%, 99.9%) for SLA validation and real-time applications

Important notes:

DOCA Perftest offers improved Write latency accuracy compared to the legacy perftest tool
Differences in latency measurement logic may exist between tools—compare results carefully

On This Page