What can I help you with?
DOCA Documentation v2.10.0

DOCA Perftest

Contents:

NVIDIA DOCA Perftest is an RDMA benchmarking tool for compute clusters.

DOCA Perftest is an RDMA benchmarking application designed for evaluating performance, starting from simple client-server setups, and up to cluster-level scenarios.

It provides benchmarking capabilities for bandwidth, message rate, and latency, supporting a wide range of RDMA operations with fine-grained control over test parameters.

Main Features:

  • Comprehensive RDMA Benchmarks – Supports bandwidth, message rate, and latency tests.

  • Unified Application for RDMA Operations – A single tool for all RDMA verbs, with extensive configuration options, including CUDA integration for GPUDirect RDMA.

  • Cluster-Wide Benchmarking – Run distributed tests across multiple nodes from a single initiating host.

  • Flexible Scenario Configuration – Define complex, multi-node, and multi-benchmark scenarios using a JSON input file.

  • Command-Line Simplicity – Run quick, single-node benchmarks via CLI for straightforward performance testing.

  • Synchronized Execution – Ensures multi-benchmark, multi-process, and multi-node tests start and finish simultaneously.

DOCA Perftest simplifies performance evaluation for RDMA-based applications

DOCA Perftest depends on the following components, released as part of the DOCA package

  • RDMA core

  • libibverbs

  • openMPI

  • DOCA 3.0.0 and higher

  • Optional:

    • Passwordless SSH connection between all servers is required for multi-node benchmarks

    • CUDA 12.8 is required for GPU-Direct RDMA

Simple Benchmarks

For simple benchmarks, run DOCA Perftest via CLI on both the client and server nodes.

Example Use Case:

Server (Responder):

Copy
Copied!
            

doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10

Client (Requestor):

Copy
Copied!
            

doca_perftest -co 1,2 -d mlx5_0 -ct UC -v send -m bw -s 1024 -D 10 -sn remote-server-name

Parameter Breakdown:

  • -co 1,2 – Runs synchronously on CPU cores 1 and 2.

  • -d mlx5_0 – Uses the local IB device mlx5_0.

  • -ct UC – Establishes an Unreliable Connection (UC).

  • -v send – Uses the Send verb for data transmission.

  • -m bw – Measures bandwidth (BW).

  • -s 1024 – Sets the message size to 1024 bytes.

  • -D 10 – Runs the test for 10 seconds.

  • -sn remote-server-name (Client only) – Specifies the remote server name to connect to.

Full list of command-line arguments

The following printout is also available on the command line interface by using the -h (or --help) option:

Copy
Copied!
            

doca_perftest -h

Argument

Description

Default Value

-in / --input_file

Specify input JSON file path. If specified, all other options will be ignored (except -h / --debug).

-co / --cores

MANDATORY. Specify CPU cores to use. Use a comma-separated list or a dash for ranges. Ranges and lists can be mixed.

-ct / --connection_type

Specify connection type: RC/UC.

-v / --verb

Specify RDMA verb type: supported: read/write/send/writeImm.

write

-m / --metric

Specify metric type: bw/lat.

bw

-s / --msg_size

Specify message size (in bytes).

64k

-l / --inline_size

Specify inline size, up to 1024 bytes.

0 (or message size for Write/Send and message size < 220)

-tx / --tx_depth

Specify the depth of the send queue.

128

--poll_batch_size

Set the maximum number of WC's (cookies) in a single poll.

16

-mod

Specify cookie moderation value - up to 8k messages.

1

-os / --old_post_send

Use old post send instead of IB_WR_API.

IB_WR_API (New post send)

-nd / --no_ddp

Cancel DDP.

DDP on in supported devices

-or / --out_reads

Specify the number of outstanding reads.

1

-qp

Specify the number of QPs to use.

1

--poll_stat

Print the average number of cookies per poll for each process.

No print

--qp_histogram

Print a histogram of QPs workload fairness.

No print

-i / --iterations

Specify the number of iterations for each QP. Mutually exclusive with -D.

5000

-D / --duration

Specify the overall traffic duration in seconds. Mutually exclusive with -i.

Iteration

-o / --output

Specify specific output: BW/Lat/MR. Remove all other prints.

No specific output

-ip

Specify the server's IPv4 address.

Hardcoded

-sn / --server_name

Specify the server's hostname on the client side. Mutually exclusive with -ip.

-sp / --server_port

Specify the server's port.

18555

-d / --device_name

Specify the local IB device name.

First device in the system's list

-w / --warmup

Specify warmup time in seconds.

2 secs warmup

--disable_pcir

Disable PCI relaxed ordering.

Enabled if possible

--save_raw_data

Latency only! Save raw data to a JSON file: <results/raw_lat_<current time>.json. Can be followed by a path, like: ~/rawResults/testNumber12.json.

No save

-j / --json

Specify to print the config and output to a JSON format file. Can be followed by a path, like: ~/docaPtResults/example/pathFor.db. If selected without a path, the default will be ./results/doca_perftest_<testTime>_<Pid>.db.

No JSON print

-u / --user

Specify executor name for JSON output.

-sd / --session_desc

Specify session description for JSON output.

--cuda <cuda device id>

Use CUDA instead of host memory.

Host memory

Complex Scenarios

For more advanced benchmarking setups, where configuring multiple nodes and benchmarks via the command line would be impractical, a JSON input file is used.

The JSON file allows users to define multi-node and multi-benchmark scenarios, specifying parameters for each participating host. To execute a scenario, pass the JSON file to doca_perftest using the -in argument:

doca_perftest -in path_to_scenario_file.json

Key Features:

  • Can be invoked from any node connected to the cluster.

  • Automatically deploys and runs benchmarks on all specified hosts.

  • Ensures synchronized execution, starting traffic simultaneously across all nodes.

  • Aggregates results on the invoking host for centralized analysis.

Using a JSON file simplifies the management of complex tests, enabling large-scale performance evaluations with minimal manual configuration.

Info

Examples for json files can be found at /usr/share/doc/doca-perftest/examples.

We recommend starting out by making a copy of a suitable example

Bandwidth

Bandwidth results represent the aggregate performance across all concurrent test processes, as specified by the user:

Performance Metrics:

  • Message Rate (million packets per second) – The total number of Completion Queue Entries (CQEs) processed per second across all test processes. This metric indicates how efficiently the system can handle a high volume of small messages.

  • Bandwidth (Gigabits per second) – The total data transfer rate, calculated by multiplying the number of received CQEs by the configured message size (-m or msgsize). This metric evaluates the system’s ability to sustain high-throughput communication.

Measurement Considerations:

  • Concurrency Handling – Results reflect the sum of CQEs across all concurrent test processes, as specified by the -co command-line argument or "cores" field in the input JSON file.

  • Test Duration – The duration is averaged across all test processes, ensuring consistency in measuring sustained performance over time.

Latency

The latency test measures the time for a message to travel between the client and server and back. The results provide key statistical insights into network performance. The interpretation of latency depends on the RDMA verb used in the test:

  • Send/Receive – Measures a one-way latency of messages written by the client and received by the server.

  • Write – Measures the full round-trip time of a "ping-pong" RDMA operation - Client sends messages, Server writes them back.

The reported latency statistics provide deeper insights:

  • Minimum latency – The shortest observed time, representing the best-case scenario.

  • Maximum latency – The longest observed time, capturing worst-case delays.

  • Median latency – The middle value of all measurements, reducing the impact of extreme outliers.

  • Mean latency – The average latency across all iterations, providing an overall performance view.

  • Standard deviation – Measures the variability in latency, indicating consistency in network performance.

  • 99% tail latency – The maximum latency experienced by 99% of messages, useful for understanding typical worst-case scenarios.

  • 99.9% tail latency – The highest latency observed in 99.9% of cases, helping assess extreme outliers in performance-sensitive applications.

Info

RDMA benchmarking applications may measure latency differently and provide different results (e.g., DOCA-Perftest improves the accuracy of Write latency measurements compared to the legacy perftest tool)

© Copyright 2025, NVIDIA. Last updated on May 5, 2025.