What can I help you with?
DOCA Documentation v3.0.0

DOCA Perftest

DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.

DOCA Perftest is a flexible RDMA benchmarking application that scales from simple client-server tests to distributed, multi-node cluster benchmarks. It provides extensive configurability and supports modern RDMA features, including GPUDirect RDMA integration.

Key features:

  • Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs

  • Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA

  • Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes

  • Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file

  • Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface

  • Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks

DOCA Perftest streamlines the performance validation process for RDMA-based applications.

DOCA Perftest relies on the following components, provided with the DOCA SDK:

  • rdma-core

  • libibverbs

  • OpenMPI

  • DOCA version 3.0.0 or later

  • Optional:

    • Passwordless SSH setup across all servers for multi-node testing

    • CUDA 12.8 or later, required for GPUDirect RDMA benchmarks

DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:

  • Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests. This mode is ideal for testing individual parameters or small-scale deployments.

  • Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations. This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.

Simple Benchmarks

To perform a basic benchmark test, run doca_perftest on both the server and client nodes using CLI parameters.

Example:

  • Server (responder):

    Copy
    Copied!
                

    doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10

  • Client (requestor):

    Copy
    Copied!
                

    doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10 -sn remote-server-name

Parameter breakdown:

  • -co 1,2 – Run synchronously on CPU cores 1 and 2

  • -d mlx5_0 – Use the local IB device mlx5_0

  • -ct UC – Use Unreliable Connection (UC) type

  • -v send – Use send RDMA verb

  • -m bw – Measure bandwidth (BW)

  • -s 1024 – Message size is 1024 bytes

  • -D 10 – Set test duration to 10 seconds

  • -sn remote-server-name – (Client only) Target server hostname

The following options are available via -h / --help:

Copy
Copied!
            

doca_perftest -h

Argument

Description

Default Value

-in / --input_file

JSON input file for complex scenarios. If specified, overrides all other options (except help/debug).

-co / --cores

MANDATORY. Comma-separated list or range of CPU cores.

-ct / --connection_type

Connection type: RC or UC.

-v / --verb

RDMA verb: read, write, send, or writeImm.

write

-m / --metric

Metric type: bw (bandwidth) or lat (latency).

bw

-s / --msg_size

Message size in bytes.

64k

-l / --inline_size

Inline size, up to 1024 bytes.

0 or message size (if < 220)

-tx / --tx_depth

Send queue depth.

128

--poll_batch_size

Max number of WCs (cookies) per poll.

16

-mod

Cookie moderation value (up to 8k messages).

1

-os / --old_post_send

Use legacy post-send mode instead of IB_WR_API.

IB_WR_API

-er / --enhanced_reorder

Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes OOO: auto or disable

auto

-or / --out_reads

Number of outstanding reads.

1

-qp

Number of QPs.

1

--poll_stat

Print average number of cookies polled per process.

Disabled

--qp_histogram

Print workload fairness histogram for QPs.

Disabled

-i / --iterations

Number of iterations per QP. Mutually exclusive with -D.

5000

-D / --duration

Traffic duration in seconds. Mutually exclusive with -i.

Iteration-based

-o / --output

Specify specific output: BW, Lat, or MR. Suppresses all other outputs.

No filtering

-ip

Server IPv4 address.

Hardcoded

-sn / --server_name

Server hostname (client side only). Mutually exclusive with -ip.

-sp / --server_port

Server port.

18555

-d / --device_name

IB device name.

First available device

-w / --warmup

Warmup time in seconds.

2

--disable_pcir

Disable PCI relaxed ordering.

Enabled (if possible)

--save_raw_data

Latency-only. Save raw latency to JSON file. Path optional.

Disabled

-j / --json

Print config/output in JSON format to file. Optional path.

Disabled

-u / --user

Executor name (used in JSON output).

-sd / --session_desc

Session description (used in JSON output).

--cuda <cuda device id>

Use CUDA memory (GPUDirect RDMA).

Host memory


Complex Scenarios

For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:

Copy
Copied!
            

doca_perftest -in path_to_scenario_file.json

Capabilities:

  • Automatically deploys and coordinates execution across all defined hosts

  • Synchronized test initiation across all nodes

  • Collects and aggregates results on the invoking node

Use cases:

  • Cluster-wide RDMA performance testing

  • Multi-benchmark test suites

  • Automation and repeatability of complex test setups

Notes:

  • JSON examples can be found in /usr/share/doc/doca-perftest/examples

  • It is recommended to start from an example and customize as needed

Bandwidth

Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.

Metrics collected:

  • Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.

  • Bandwidth (Gb/s) – Total data transfer rate, calculated as bandwidth = message_rate × message_size. Measures sustained throughput.

How it is measured:

  • The result is aggregated across all active test processes

  • Concurrency is controlled using:

    • -co (CLI)

    • "cores" field (JSON)

  • The test duration is averaged across processes for consistency

Interpretation tips:

  • High message rate, low bandwidth → Likely using small message sizes

  • High bandwidth, moderate message rate → Larger messages or fewer CQEs

  • Results help determine network saturation, queue depth tuning, and core scaling

Latency

Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.

RDMA verb modes:

  • Send/Receive – Measures one-way latency (client → server)

  • Write – Measures round-trip latency (client → server → client), i.e., ping-pong

Metrics collected:

  • Minimum latency – Fastest observed transmission

  • Maximum latency – Longest time taken, reflects worst-case delays

  • Mean latency – Arithmetic average across all iterations

  • Median latency – Middle value; less affected by outliers than the mean

  • Standard deviation – Indicates variability; smaller is more consistent

  • 99% tail latency – 99% of messages completed within this latency

  • 99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this

How it is measured:

  • Uses RDMA verbs in a loop with tightly synchronized messaging

  • Time measurements are taken on the sender side for accuracy

  • Results are aggregated and reported per test

Interpretation tips:

  • Low mean/median, but high max/tail → Potential jitter or queuing delays

  • Low standard deviation → Reliable, predictable performance

  • Use tail latency (99%, 99.9%) for SLA validation and real-time applications

Important notes:

  • DOCA Perftest offers improved Write latency accuracy compared to the legacy perftest tool

  • Differences in latency measurement logic may exist between tools—compare results carefully

© Copyright 2025, NVIDIA. Last updated on May 5, 2025.