NVIDIA Docs Hub NVIDIA Networking BlueField DPUs / SuperNICs & DOCA DOCA Documentation v2.10.0 DOCA Perftest

DOCA Perftest

Contents:

NVIDIA DOCA Perftest is an RDMA benchmarking tool for compute clusters.

Introduction

DOCA Perftest is an RDMA benchmarking application designed for evaluating performance, starting from simple client-server setups, and up to cluster-level scenarios.

It provides benchmarking capabilities for bandwidth, message rate, and latency, supporting a wide range of RDMA operations with fine-grained control over test parameters.

Main Features:

Comprehensive RDMA Benchmarks – Supports bandwidth, message rate, and latency tests.
Unified Application for RDMA Operations – A single tool for all RDMA verbs, with extensive configuration options, including CUDA integration for GPUDirect RDMA.
Cluster-Wide Benchmarking – Run distributed tests across multiple nodes from a single initiating host.
Flexible Scenario Configuration – Define complex, multi-node, and multi-benchmark scenarios using a JSON input file.
Command-Line Simplicity – Run quick, single-node benchmarks via CLI for straightforward performance testing.
Synchronized Execution – Ensures multi-benchmark, multi-process, and multi-node tests start and finish simultaneously.

DOCA Perftest simplifies performance evaluation for RDMA-based applications

Dependencies

DOCA Perftest depends on the following components, released as part of the DOCA package

RDMA core
libibverbs
openMPI

Prerequisites

DOCA 3.0.0 and higher
Optional:
- Passwordless SSH connection between all servers is required for multi-node benchmarks
- CUDA 12.8 is required for GPU-Direct RDMA

Simple Benchmarks

For simple benchmarks, run DOCA Perftest via CLI on both the client and server nodes.

Example Use Case:

Server (Responder):

Copy
Copied!

            
            doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10

Client (Requestor):

Copy
Copied!

            
            doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10 -sn remote-server-name

Parameter Breakdown:

-co 1,2 – Runs synchronously on CPU cores 1 and 2.
-d mlx5_0 – Uses the local IB device mlx5_0.
-ct UC – Establishes an Unreliable Connection (UC).
-v send – Uses the Send verb for data transmission.
-m bw – Measures bandwidth (BW).
-s 1024 – Sets the message size to 1024 bytes.
-D 10 – Runs the test for 10 seconds.
-sn remote-server-name (Client only) – Specifies the remote server name to connect to.

Full list of command-line arguments

The following printout is also available on the command line interface by using the -h (or --help) option:

Copy
Copied!

            
            doca_perftest -h

Argument	Description	Default Value
`-in` / `--input_file`	Specify input JSON file path. If specified, all other options will be ignored (except `-h` / `--debug`).
`-co` / `--cores`	MANDATORY. Specify CPU cores to use. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed.
`-ct` / `--connection_type`	Specify connection type: `RC`/`UC`.
`-v` / `--verb`	Specify RDMA verb type: supported: `read`/`write`/`send`/`writeImm`.	`write`
`-m` / `--metric`	Specify metric type: `bw`/`lat`.	`bw`
`-s` / `--msg_size`	Specify message size (in bytes).	`64k`
`-l` / `--inline_size`	Specify inline size, up to 1024 bytes.	`0` (or message size for `Write`/`Send` and message size < 220)
`-tx` / `--tx_depth`	Specify the depth of the send queue.	`128`
`--poll_batch_size`	Set the maximum number of WC's (cookies) in a single poll.	`16`
`-mod`	Specify cookie moderation value - up to 8k messages.	`1`
`-os` / `--old_post_send`	Use old post send instead of `IB_WR_API`.	`IB_WR_API` (New post send)
`-nd` / `--no_ddp`	Cancel DDP.	DDP on in supported devices
`-or` / `--out_reads`	Specify the number of outstanding reads.	`1`
`-qp`	Specify the number of QPs to use.	`1`
`--poll_stat`	Print the average number of cookies per poll for each process.	No print
`--qp_histogram`	Print a histogram of QPs workload fairness.	No print
`-i` / `--iterations`	Specify the number of iterations for each QP. Mutually exclusive with `-D`.	`5000`
`-D` / `--duration`	Specify the overall traffic duration in seconds. Mutually exclusive with `-i`.	Iteration
`-o` / `--output`	Specify specific output: `BW`/`Lat`/`MR`. Remove all other prints.	No specific output
`-ip`	Specify the server's IPv4 address.	Hardcoded
`-sn` / `--server_name`	Specify the server's hostname on the client side. Mutually exclusive with `-ip`.
`-sp` / `--server_port`	Specify the server's port.	`18555`
`-d` / `--device_name`	Specify the local IB device name.	First device in the system's list
`-w` / `--warmup`	Specify warmup time in seconds.	2 secs warmup
`--disable_pcir`	Disable PCI relaxed ordering.	Enabled if possible
`--save_raw_data`	Latency only! Save raw data to a JSON file: `<results/raw_lat_<current time>.json`. Can be followed by a path, like: `~/rawResults/testNumber12.json`.	No save
`-j` / `--json`	Specify to print the config and output to a JSON format file. Can be followed by a path, like: `~/docaPtResults/example/pathFor.db`. If selected without a path, the default will be `./results/doca_perftest_<testTime>_<Pid>.db`.	No JSON print
`-u` / `--user`	Specify executor name for JSON output.
`-sd` / `--session_desc`	Specify session description for JSON output.
`--cuda <cuda device id>`	Use CUDA instead of host memory.	Host memory

Complex Scenarios

For more advanced benchmarking setups, where configuring multiple nodes and benchmarks via the command line would be impractical, a JSON input file is used.

The JSON file allows users to define multi-node and multi-benchmark scenarios, specifying parameters for each participating host. To execute a scenario, pass the JSON file to doca_perftest using the -in argument:

doca_perftest -in path_to_scenario_file.json

Key Features:

Can be invoked from any node connected to the cluster.
Automatically deploys and runs benchmarks on all specified hosts.
Ensures synchronized execution, starting traffic simultaneously across all nodes.
Aggregates results on the invoking host for centralized analysis.

Using a JSON file simplifies the management of complex tests, enabling large-scale performance evaluations with minimal manual configuration.

Info

Examples for json files can be found at /usr/share/doc/doca-perftest/examples.

We recommend starting out by making a copy of a suitable example

Benchmark Results

Bandwidth

Bandwidth results represent the aggregate performance across all concurrent test processes, as specified by the user:

Performance Metrics:

Message Rate (million packets per second) – The total number of Completion Queue Entries (CQEs) processed per second across all test processes. This metric indicates how efficiently the system can handle a high volume of small messages.
Bandwidth (Gigabits per second) – The total data transfer rate, calculated by multiplying the number of received CQEs by the configured message size (-m or msgsize). This metric evaluates the system’s ability to sustain high-throughput communication.

Measurement Considerations:

Concurrency Handling – Results reflect the sum of CQEs across all concurrent test processes, as specified by the -co command-line argument or "cores" field in the input JSON file.
Test Duration – The duration is averaged across all test processes, ensuring consistency in measuring sustained performance over time.

Latency

The latency test measures the time for a message to travel between the client and server and back. The results provide key statistical insights into network performance. The interpretation of latency depends on the RDMA verb used in the test:

Send/Receive – Measures a one-way latency of messages written by the client and received by the server.
Write – Measures the full round-trip time of a "ping-pong" RDMA operation - Client sends messages, Server writes them back.

The reported latency statistics provide deeper insights:

Minimum latency – The shortest observed time, representing the best-case scenario.
Maximum latency – The longest observed time, capturing worst-case delays.
Median latency – The middle value of all measurements, reducing the impact of extreme outliers.
Mean latency – The average latency across all iterations, providing an overall performance view.
Standard deviation – Measures the variability in latency, indicating consistency in network performance.
99% tail latency – The maximum latency experienced by 99% of messages, useful for understanding typical worst-case scenarios.
99.9% tail latency – The highest latency observed in 99.9% of cases, helping assess extreme outliers in performance-sensitive applications.