DOCA Documentation v3.1.0

DOCA Perftest

DOCA Perftest is an RDMA benchmarking tool designed for compute clusters, enabling fine-tuned evaluation of bandwidth, message rate, and latency across a wide range of RDMA operations.

DOCA Perftest enables evaluation of RDMA performance in both simple client-server setups and large-scale cluster scenarios. It offers fine-grained control over benchmarking parameters and supports diverse RDMA verbs and configurations.

Key features:

  • Comprehensive RDMA benchmarking – Supports bandwidth, message rate, and latency measurements for various RDMA verbs

  • Unified application – A single tool for all RDMA operations, with broad configurability and CUDA support for GPUDirect RDMA

  • Cluster-wide testing – Execute distributed benchmarks from a central host across multiple nodes

  • Flexible scenario configuration – Define complex multi-node, multi-benchmark tests using a JSON configuration file

  • Command-line simplicity – Quickly execute single-node benchmarks with a straightforward CLI interface

  • Synchronized execution – Ensures accurate timing across multi-node and multi-process benchmarks

DOCA Perftest streamlines the performance validation process for RDMA-based applications.

DOCA Perftest relies on the following components, provided with the DOCA SDK:

  • rdma-core

  • libibverbs

  • OpenMPI

  • doca-verbs (optional)

  • DOCA version 3.0.0 or later

  • Optional:

    • Passwordless SSH setup across all servers for multi-node testing

    • CUDA 12.8 or later, required for GPUDirect RDMA benchmarks

DOCA Perftest supports the following usage modes to accommodate both basic and advanced benchmarking needs:

  • Simple benchmarks – Use command-line arguments directly on the client and server nodes to perform quick bandwidth or latency tests.

    This mode is ideal for testing individual parameters or small-scale deployments.

  • Complex scenarios – Use a JSON input file to define multi-node, multi-benchmark configurations.

    This mode enables synchronized, distributed benchmarking across multiple systems with centralized result aggregation.

Simple Point-To-Point Benchmarks

To run a basic RDMA performance test, execute doca_perftest on both the server and client nodes using command-line parameters.

Example usage:

  • Client (requestor):

    Copy
    Copied!
                

    doca_perftest -N 3 -d mlx5_0 -c UC -v send -m bw -s 1024 -D 10 -q 50 --server_name remote-server-name

Parameter breakdown:

  • -N 3 – Launches 3 synchronized processes with automatic core selection

    (Use -C to specify cores explicitly.)

  • -d mlx5_0 – Uses the local InfiniBand device mlx5_0

  • -c UC – Sets the RDMA connection type to Unreliable Connection (UC)

  • -v send – Uses the send RDMA verb

  • -m bw – Measures bandwidth

  • -s 1024 – Sets the message size to 1024 bytes

  • -D 10 – Sets the test duration to 10 seconds

  • -q 50 – Allocates 50 queue pairs per process

  • --server_name remote-server-name – Specifies the remote server hostname

    Info

    If passwordless SSH is configured, the client automatically invokes the server instance. If the server is already running manually, the client detects it and skips the invocation.

The following options are available via -h / --help:

Copy
Copied!
            

doca_perftest -h

Argument

Description

Default

-h / --help

Display help message

-V / --version

Display version information

-L / --log

Specify the log level: debug/info/warning/error

error

-f / --input_file

Specify input JSON file path. If specified, all other options will be ignored (except --help / --version / --log)

-T / --connection_timeout

Specify the max time in seconds for server to wait for client connection

30v

-N / --num_processes 1

Depends on -d / --device. Specify number of processes to use.

1

-C / --cores 1

Specify CPU cores to use. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed.

-c / --connection_type

Specify connection type: RC/UC.

--traffic_class

Specify traffic class for the QP attributes.

0

-v / --verb

Specify RDMA verb type: supported: read/write/send/writeImm.

write

-m / --metric

Specify metric type: bw/lat.

bw

-s / --msg_size

Specify message size (in bytes).

64K

-I / --inline_size

Specify inline size, up to 1024 bytes.

0

-t / --tx_depth

Specify the depth of the send queue.

128

--poll_batch_size

Set the maximum number of WC's (cookies) in a single poll.

16

-Q / --cq_mod

Specify cookie moderation value (up to 8k messages).

1

--old_post_send

Use old post send instead of IB_WR_API.

IB_WR_API (New post send)

--enhanced_reorder

Enhanced reorder allows responders (Rx) to receive packets with all types of opcodes out-of-order. Supported: auto/disabled.

auto

--out_reads

Specify the number of outstanding reads.

auto

-S / --service_level

Specifies the IB service level (0–15) used for selecting the virtual lane and prioritizing traffic.

0

-q / --qp

Specify the number of QPs to use.

1

-b / --bidirectional

Enable bidirectional traffic, where both server and client act as both a requestor and responder

--poll_stat

Print the average number of cookies per poll for each process.

No print

-H / --qp_histogram

Print a histogram of QPs workload fairness.

No print

-i / --iterations 2

Specify the number of iterations for each QP.

5000

-D / --duration 2

Specify the overall traffic duration in seconds.

Iteration

-o / --output

Specify specific output: BW/LAT/MR. Remove all other prints.

No output

-n / --server_name

Specify the server's hostname or IPv4 address on the client side.

-p / --server_port

Specify the server's port.

18555

-d / --device

Specify the local device name. By default, First device in the system's will be taken.

-g / --gid_index

Specifies local GID index

-w / --warmup

Specify warmup time in seconds.

2s

--disable_pcir

Disable PCIe relaxed ordering.

Enabled if possible

--save_raw_data

Latency only! Save raw data to a JSON file: <results/raw_lat_<current time>.json. Can be followed by a path, like: ~/rawResults/testNumber12.json.

No save

-j / --json

Specify to print the config and output to a JSON format file. Can be followed by a path, like: ~/docaPtResults/example/pathFor.json. If selected without a path, the default will be ./results/doca_perftest_<testTime>_<Pid>.json.

No JSON print

--user

Specify executor name for JSON output.

--session_desc

Specify session description for JSON output.

--dut_path

Specify DUT file path.

--wait_destroy

Specify the number of seconds to wait before destroying allocated resources

0

--mtu_size

Specify MTU size in bytes: 256/512/1024/2048/4096

4096

-M / --memory_type

Specify where to allocate the memory for the MemoryRegion host/cuda/nullmr/device

host

--ph

Specify processing hints for TPH: 0=Bidirectional, 1=Requester, 2=Target(Complete), 3=Target with priority

--tph_core_id

Optional. Specify core ID to use for TPH. Must be set with --tph_mem.

--tph_mem

Optional. Specify TPH memory type persistent or volatile. Must Be set with --tph_core_id pm/vm.

-G / --cuda_dev

Specify the CUDA device ID

0

--cuda_lib_path

Specify the CUDA library path

/usr/local/cuda/lib64

-r / --rdma_driver

Specify RDMA verbs driver: ibv (ibverbs) / dv (doca_verbs)

ibv

--launch_server

Control server launch mode: auto/disable. When set to auto, the tool will automatically launch a server process. When set to disable, no server will be launched.

auto

--server_cores

Specify CPU cores to use for the server process. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed.

Same as client

--server_device

Specify the device name for the server process.

Same as client device

--server_mem_type

Specify memory type for the server process: host/cuda/nullmr/device.

Same as client

--server_exe

Specify an alternate path for the server executable.

Current directory

-P / --path_selection

Enable path selection with hints file. File must contain exactly 32 bytes of hints data. If no file path is provided, default hints data (all zeros) will be used.

none

  1. The options -N/--num_processes and -C/--cores are mutually exclusive. If neither is set, 1 process is used and a single core is auto-selected.  

  2. The options -i/--iterations and -D/--duration are mutually exclusive. If neither is set, 1 process is used and a single core is auto-selected.  

Complex Scenarios

For advanced benchmarking scenarios involving multiple hosts, users can define tests using a structured JSON input file. Example:

Copy
Copied!
            

doca_perftest -in path_to_scenario_file.json

Capabilities:

  • Automatically deploys and coordinates execution across all defined hosts

  • Synchronized test initiation across all nodes

  • Collects and aggregates results on the invoking node

Use cases:

  • Cluster-wide RDMA performance testing

  • Multi-benchmark test suites

  • Automation and repeatability of complex test setups

Notes:

  • JSON examples can be found in /usr/share/doc/doca-perftest/examples

  • It is recommended to start from an example and customize as needed

Bandwidth

Bandwidth tests measure the total data throughput and message-handling efficiency across all active test processes.

Metrics collected:

  • Message Rate (Mpps) – Total number of Completion Queue Entries (CQEs) processed per second across all processes. Indicates how efficiently the system handles many small messages.

  • Bandwidth (Gb/s) – Total data transfer rate, calculated as bandwidth = message_rate × message_size. Measures sustained throughput.

How it is measured:

  • The result is aggregated across all active test processes

  • Concurrency is controlled using:

    • -co (CLI)

    • "cores" field (JSON)

  • The test duration is averaged across processes for consistency

Interpretation tips:

  • High message rate, low bandwidth → Likely using small message sizes

  • High bandwidth, moderate message rate → Larger messages or fewer CQEs

  • Results help determine network saturation, queue depth tuning, and core scaling

Latency

Latency tests measure the time taken for a message to travel across the network. The direction and scope of the measurement depend on the RDMA verb.

RDMA verb modes:

  • Send/Receive – Measures one-way latency (client → server)

  • Write – Measures round-trip latency (client → server → client), i.e., ping-pong

Metrics collected:

  • Minimum latency – Fastest observed transmission

  • Maximum latency – Longest time taken, reflects worst-case delays

  • Mean latency – Arithmetic average across all iterations

  • Median latency – Middle value; less affected by outliers than the mean

  • Standard deviation – Indicates variability; smaller is more consistent

  • 99% tail latency – 99% of messages completed within this latency

  • 99.9% tail latency – Extreme edge-case analysis; only 0.1% of messages exceed this

How it is measured:

  • Uses RDMA verbs in a loop with tightly synchronized messaging

  • Time measurements are taken on the sender side for accuracy

  • Results are aggregated and reported per test

Interpretation tips:

  • Low mean/median, but high max/tail → Potential jitter or queuing delays

  • Low standard deviation → Reliable, predictable performance

  • Use tail latency (99%, 99.9%) for SLA validation and real-time applications

Important notes:

  • DOCA Perftest offers improved Write latency accuracy compared to the legacy perftest tool

  • Differences in latency measurement logic may exist between tools—compare results carefully

© Copyright 2025, NVIDIA. Last updated on Sep 4, 2025.