DOCA Documentation v3.2.0

DOCA Perftest

This guide describes DOCA Perftest, an RDMA benchmarking tool designed for compute clusters that enables fine-tuned evaluation of bandwidth, message rate, and latency across various RDMA operations and complex multi-node scenarios.

NVIDIA® doca-perftest is an RDMA benchmarking utility designed to evaluate performance across a wide range of compute and networking environments—from simple client-server tests to complex, distributed cluster scenarios.

It provides fine-grained benchmarking of bandwidth, message rate, and latency, while supporting diverse RDMA operations and configurations.

Key features:

  • Comprehensive RDMA Benchmarks – Supports bandwidth, message rate, and latency testing.

  • Unified RDMA Testing Tool – A single executable for all RDMA verbs, with rich configuration options and CUDA/GPUDirect RDMA integration.

  • Cluster-Wide Benchmarking – Run distributed tests across multiple nodes, initiated from a single host, with aggregated performance results.

  • Flexible Scenario Definition – Define complex multi-node, multi-test configurations via a JSON input file.

  • Command-Line Simplicity – Quickly run local or point-to-point benchmarks directly from the CLI.

  • Synchronized Execution – Ensures all benchmarks begin and end simultaneously for consistent results.

The doca-perftest utility simplifies evaluation and comparison of RDMA performance across applications and environments.

For simple benchmarks, doca-perftest can be run directly from the command line.

When invoked on the client, the utility automatically launches the corresponding server process (requires passwordless SSH) and selects optimal CPU cores on both systems based on NUMA affinity.

Example command:

Copy
Copied!
            

# Run on client doca_perftest -d mlx5_0 -n <server-host-name>

This is equivalent to running:

Copy
Copied!
            

# On server doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10   # On client doca_perftest -d mlx5_0 -N 1 -c RC -v write -m bw -s 65536 -D 10 -n <server-host-name>

Parameter breakdown:

Parameter

Description

-d mlx5_0

Uses the device mlx5_0.

-N 1

Runs one process, automatically selecting an optimal core. (Use -C <core> to specify manually.)

-c RC

Uses a Reliable Connection (RC) transport.

-v write

Selects the Write verb for transmission.

-m bw

Measures bandwidth.

-s 65536

Sets message size to 65,536 bytes.

-D 10

Runs for 10 seconds.

-n <server-host-name>

(Client only) Specifies the remote target host.

Info

For a full list of CLI arguments, run doca_perftest -h or man doca_perftest.

Info

If passwordless SSH is not configured, you must manually run doca-perftest on both client and server, ensuring parameters match.

For large-scale or multi-benchmark configurations, doca-perftest accepts a JSON input file defining all participating nodes, benchmarks, and parameters.

Example invocation:

Copy
Copied!
            

doca_perftest -f path_to_scenario_file.json

JSON mode advantages:

  • Can be initiated from any node in the cluster (even non-participating ones).

  • Synchronizes benchmark start and stop across all nodes.

  • Aggregates all metrics on the initiating host.

  • Supports predefined traffic patterns such as ALL_TO_ALL, MANY_TO_ONE, ONE_TO_MANY, and BISECTION.

  • Fully compatible with all CLI parameters — JSON parameters inherit the same defaults.

Info

Example JSON configuration files are provided under: /usr/share/doc/doca-perftest/examples/. It is recommended to start by copying and modifying an existing example file.

Bandwidth

Bandwidth tests measure the aggregate data transfer rate and message-handling efficiency across all participating processes.

Metrics collected:

  • Message Rate (Mpps): Number of Completion Queue Entries (CQEs) processed per second.

  • Bandwidth (Gb/s): Total throughput (bandwidth = message_rate × message_size).

Measurement notes:

  • Results are aggregated across all active test processes.

  • Concurrency is controlled via -co (CLI) or the cores field (JSON).

  • Test duration is averaged across processes for consistent sampling.

Interpretation tips:

Observation

Possible Cause

High message rate, low bandwidth

Small message sizes

High bandwidth, moderate message rate

Larger messages or fewer CQEs

These results help optimize network saturation, queue depth, and core allocation strategies.

Latency

Latency tests measure the delay between message transmission and acknowledgment. The measured direction depends on the RDMA verb used.

RDMA verb modes:

Verb

Measurement Type

Send/Receive

One-way latency (Client → Server)

Write

Round-trip latency (Client → Server → Client)

Metrics collected:

  • Minimum latency – Fastest observed transaction

  • Maximum latency – Longest observed transaction

  • Mean latency – Average across all iterations

  • Median latency – Midpoint value (less influenced by outliers)

  • Standard deviation – Variability indicator

  • 99% tail latency – 99% of messages completed within this time

  • 99.9% tail latency – Outlier detection for extreme cases

Measurement notes:

  • Latency measured using tight RDMA verb loops.

  • Timing collected on the sender side for accuracy.

  • Aggregated across processes for final reporting.

Interpretation tips:

Pattern

Insight

Low mean/median, high max/tail

Indicates jitter or queue buildup

Low standard deviation

Indicates stable and predictable performance

High 99%/99.9% tail

Indicates possible SLA breaches in real-time workloads

Info

doca-perftest provides improved write latency accuracy over legacy perftest tools.

Info

Differences in latency measurement methodologies exist; compare tools carefully when validating results.


This section highlights some of the most commonly used parameters and use-cases.

Unidirectional vs Bidirectional Traffic

doca-perftest supports two traffic-flow modes that fundamentally change how data moves between nodes and how resources are allocated.

Unidirectional Traffic (Default)

  • In unidirectional mode, traffic flows in one direction only.

  • The client (requestor) initiates operations, and the server (responder) receives them.

  • This is the default mode and provides clear, predictable performance metrics.

Bidirectional Traffic

In bidirectional mode, traffic flows in both directions simultaneously. Each side acts as both requestor and responder, creating full-duplex communication.

Bidirectional tests use two traffic runners (requestor + responder) sharing resources. It may show different aggregate bandwidth than 2× unidirectional.

Run bi-directional traffic from the command line:sa

Copy
Copied!
            

# Enable bidirectional traffic doca_perftest -d mlx5_0 -n <server-name> -b

For JSON mode, use the "TrafficDirection" field and set it to "BIDIR" or "UNIDIR".

Traffic Patterns

Traffic patterns provide built-in shortcuts for complex multi-node communication scenarios.

While these configurations were always possible through detailed JSON definitions, traffic patterns dramatically simplify setup for common topologies.

Example JSONs using traffic patterns are available under /usr/share/doc/doca-perftest/examples.

Available patterns:

  • ONE_TO_ONE

  • ONE_TO_MANY

  • MANY_TO_ONE

  • ALL_TO_ALL

  • BISECTION

Note

Multicast is not supported. Each connection is point-to-point, synchronized to start simultaneously.

They collapse complex multi-node wiring into a few lines of JSON. Instead of manually listing dozens of connections, you specify a regex-like host list and a pattern (e.g., ALL_TO_ALL) and doca-perftest generates and synchronizes all connections for you.

One-to-One (O2O)

Simple point-to-point between two nodes; useful for baseline performance testing.

Copy
Copied!
            

"testNodes": [ {"hostname": "node01", "deviceName": "mlx5_0"}, {"hostname": "node02", "deviceName": "mlx5_0"} ], "trafficPattern": "ONE_TO_ONE"


One-to-Many (O2M)

Single sender to multiple receivers; the first node sends to all others.

Copy
Copied!
            

"testNodes": [ {"hostname": "sender", "deviceName": "mlx5_0"}, {"hostname": "receiver[1-10]", "deviceName": "mlx5_0"} ], "trafficPattern": "ONE_TO_MANY"

This creates 10 connections: sender→receiver1, sender→receiver2, ..., sender→receiver10.

Many-to-One (M2O)

Multiple senders to one receiver; all nodes send to the first node.

Copy
Copied!
            

"testNodes": [ {"hostname": "aggregator", "deviceName": "mlx5_0"}, {"hostname": "client[01-20]", "deviceName": "mlx5_0"} ], "trafficPattern": "MANY_TO_ONE"

This creates 20 connections: client1→aggregator, client2→aggregator, ..., client20→aggregator.

All-to-All (A2A)

Full-mesh connectivity; every node connects to every other node.

Copy
Copied!
            

"testNodes": [ {"hostname": "compute[01-16]", "deviceName": "mlx5_0"} ], "trafficPattern": "ALL_TO_ALL", "trafficDirection": "UNIDIR"

This creates 240 connections (16×15) for unidirectional, or 120 bidirectional pairs.

Bisection (B)

Divides nodes into two equal halves; the first half connects to the second half. Requires an even number of nodes.

Copy
Copied!
            

"testNodes": [ {"hostname": "rack1-[01-10]", "deviceName": "mlx5_0"}, {"hostname": "rack2-[01-10]", "deviceName": "mlx5_0"} ], "trafficPattern": "BISECTION"

This creates 10 connections: rack1-01↔rack2-01, rack1-02↔rack2-02, ..., rack1-10↔rack2-10.

Multiprocess (Cores)

doca-perftest can run synchronized multi-process tests, ensuring traffic starts simultaneously across all cores.

By default, it runs a single process on one automatically selected core.

Process and core selection:

Option

Description

-N / "num_processes"

Number of processes; cores auto-selected.

-C / "cores"

Explicitly specify core IDs or ranges.

Examples:

Copy
Copied!
            

# Run on 3 synchronized processes (cores auto-selected) doca_perftest -d mlx5_0 -n <server> -N 3   # Run on specific cores doca_perftest -d mlx5_0 -n <server> -C 5 doca_perftest -d mlx5_0 -n <server> -C 5,7 doca_perftest -d mlx5_0 -n <server> -C 5-9


Working with GPUs – Device Selection

doca-perftest can automatically select the most suitable GPU for each network device based on PCIe topology proximity. The ranking follows NVIDIA's nvidia-smi topo hierarchy: NV > PIX > PXB > PHB > NODE > SYS.

This ensures that the GPU closest to the NIC is chosen, minimizing latency and maximizing throughput.

Although auto-selection is the default behavior, users can still manually specify a GPU device using the -G argument in CLI mode, or the "cuda_dev" field in JSON mode.

Copy
Copied!
            

# Manually choose a specific GPU doca_perftest -d mlx5_0 -n server-name -G 0   # Automatically select both GPU and memory type (recommended) doca_perftest -d mlx5_0 -n server-name -M cuda   # Deprecated syntax (still supported, equivalent to cuda_auto_detect) doca_perftest -d mlx5_0 -n server-name --cuda 0


Working with GPUs – Memory Types

RDMA operations can leverage GPU memory directly, bypassing CPU involvement for maximum throughput and minimal latency.

doca-perftest supports several CUDA memory modes optimized for different hardware and driver configurations.

Auto-Detection Mode (cuda_auto_detect)

Automatically selects the best available CUDA memory type in this order:

  1. Data Direct

  2. DMA-BUF

  3. Peermem

This is the recommended mode for most users.

Automatically selects the optimal CUDA memory strategy:

Copy
Copied!
            

# Auto-detect best GPU memory type (recommended) doca_perftest -d mlx5_0 -n server-name -M cuda -G 0   # With custom CUDA library path doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 --cuda_lib_path /usr/local/cuda-12/lib64   # Deprecated but equivalent syntax doca_perftest -d mlx5_0 -n server-name --cuda 0

Info

Fallback behavior: With -M cuda_auto_detect, doca_perftest automatically tries cuda_data_directcuda_dmabufcuda_peermem in this order.


Standard CUDA Memory (cuda_peermem)

Traditional CUDA peer-memory allocation.

Supported on all CUDA-capable systems, though with slightly higher overhead compared to newer methods.

Copy
Copied!
            

# Explicitly force peermem (bypasses auto-detect) doca_perftest -d mlx5_0 -n server-name -M cuda_peermem -G 0   # Auto-detect fallback order (when using -M cuda_auto_detect): # 1) cuda_data_direct (fastest, requires HW/driver support) # 2) cuda_dmabuf # 3) cuda_peermem (universal fallback)


DMA-BUF Memory (cuda_dmabuf)

Uses the Linux DMA-BUF framework for zero-copy GPU–NIC transfers. Requires CUDA 11.7+ and kernel support.

Copy
Copied!
            

doca_perftest -d mlx5_0 -n server-name -M cuda_dmabuf-G 0


Data Direct Memory (cuda_data_direct)

Most efficient GPU memory access method using direct PCIe mappings. Requires specific hardware and driver support; provides the lowest latency and highest throughput.

Copy
Copied!
            

doca_perftest -d mlx5_0 -n server-name -M cuda_data_direct-G 0

Memory Types

Beyond GPU memory types, doca-perftest supports several memory allocation strategies for RDMA operations.

Host Memory (host)

Default mode using standard system RAM.

Copy
Copied!
            

# Default host memory usage doca_perftest -d mlx5_0 -n <server-name>   # Explicitly specify host memory doca_perftest -d mlx5_0 -n <server-name> -M host


Null Memory Region (nullmr)

Does not allocate real memory; useful for ultra-low-latency synthetic tests.

Copy
Copied!
            

# Null memory region for bandwidth testing doca_perftest -d mlx5_0 -n <server-name> -M nullmr


Device Memory (device)

Allocates memory directly on the adapter hardware (limited by on-board capacity).

Copy
Copied!
            

# Null memory region for bandwidth testing doca_perftest -d mlx5_0 -n <server-name> -M device

RDMA Drivers

Three RDMA driver backends are supported:

Note

The available drivers depend on your installed packages and hardware.

Driver

Prerequisites

Usage

IBV (libibverbs)

Installed via MLNX_OFED or inbox drivers; works on all IB/RoCE adapters

-r ibv (default)

DV (doca_verbs)

Requires doca-sdk-verbs package

-r dv


Auto-Launching Remote Server

doca-perftest can automatically launch the remote server via SSH (CLI-only).

Requires passwordless SSH and identical versions on both sides.

Copy
Copied!
            

# Auto-launch server (default) doca_perftest -d mlx5_0 -n server-name   # Disable auto-launch doca_perftest -d mlx5_0 -n server-name --launch_server disable

Server override examples:

Copy
Copied!
            

# Server uses different device than client doca_perftest -d mlx5_0 -n server-name --server_device mlx5_1   # Server uses different memory type doca_perftest -d mlx5_0 -n server-name -M host --server_mem_type cuda_auto_detect   # Server runs on specific cores doca_perftest -d mlx5_0 -n server-name -C 0-3 --server_cores 4-7   # Alternate server executable path doca_perftest -d mlx5_0 -n server-name --server_exe /tmp/other_doca_perftest_version   # Different SSH username, supported by passwordless-ssh doca_perftest -d mlx5_0 -n server-name --server_username testuser


QP Histogram

The QP histogram provides visibility into how work is distributed across multiple queue pairs during a test. This is useful for identifying load balancing issues, scheduling inefficiencies, or hardware limitations when using multiple QPs.

Enabling QP histogram:

Copy
Copied!
            

# Enable QP histogram with multiple queue pairs doca_perftest -d mlx5_0 -n server-name -q 8 -H

Example output:

Copy
Copied!
            

--------------------- QP WORK DISTRIBUTION --------------------- Qp num 0: ████████████████████████ 45.23 Gbit/sec | Relative deviation: -2.1% Qp num 1: █████████████████████████ 46.89 Gbit/sec | Relative deviation: 1.5% Qp num 2: ████████████████████████ 45.67 Gbit/sec | Relative deviation: -1.2% Qp num 3: █████████████████████████████ 48.21 Gbit/sec | Relative deviation: 4.3%


TPH

PCIe optimization providing hints to CPUs for cache management and reduced memory-access latency.

Info

Requires ConnectX-6 + hardware and a TPH-enabled kernel.

Parameters:

Option

Meaning

--ph

Processing hint: 0 = Bidirectional (default), 1 = Requester, 2 = Completer, 3 = High-priority completer

--tph_core_id

Target CPU core for TPH handling

--tph_mem

Memory type: pm = Persistent, vm = Volatile

Examples:

Copy
Copied!
            

# Invalid: Core ID without memory type doca_perftest -d mlx5_0 -n server-name --tph_core_id 0 # ERROR   # Invalid: Memory type without core ID doca_perftest -d mlx5_0 -n server-name --tph_mem pm # ERROR   # Valid: Both or neither doca_perftest -d mlx5_0 -n server-name --ph 1 # OK (hints only) doca_perftest -d mlx5_0 -n server-name --ph 1 --tph_core_id 0 --tph_mem pm # OK (full config)


doca-perftest integrates seamlessly with SLURM job schedulers, leveraging MPI for multi-node orchestration within SLURM allocations.

The following is a basic usage example with salloc:

  1. Allocate nodes via SLURM (e.g., salloc -N8).

  2. Update the JSON to include the allocated nodes. Simple bisection example:

    Copy
    Copied!
                

    "testNodes": [  {"hostname""rack1-[01-03]""deviceName""mlx5_0"},                 {"hostname""rack2-[04-07]""deviceName""mlx5_0"} ], "trafficPattern""BISECTION"

  3. Run the doca-perftest with the updated json

    Copy
    Copied!
                

    # Invalid: Core ID without memory type doca_perftest -f <updated-json>

© Copyright 2025, NVIDIA. Last updated on Nov 20, 2025