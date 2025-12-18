This section highlights some of the most commonly used parameters and use-cases.

doca-perftest supports two traffic-flow modes that fundamentally change how data moves between nodes and how resources are allocated.

In unidirectional mode, traffic flows in one direction only.

The client (requestor) initiates operations, and the server (responder) receives them.

This is the default mode and provides clear, predictable performance metrics.

In bidirectional mode, traffic flows in both directions simultaneously. Each side acts as both requestor and responder, creating full-duplex communication.

Bidirectional tests use two traffic runners (requestor + responder) sharing resources. It may show different aggregate bandwidth than 2× unidirectional.

Run bi-directional traffic from the command line:

Copy Copied! # Enable bidirectional traffic doca_perftest -d mlx5_0 -n <server-name> -b

For JSON mode, use the "TrafficDirection" field and set it to "BIDIR" or "UNIDIR" .

Traffic patterns provide built-in shortcuts for complex multi-node communication scenarios.

While these configurations were always possible through detailed JSON definitions, traffic patterns dramatically simplify setup for common topologies.

Example JSONs using traffic patterns are available under /usr/share/doc/doca-perftest/examples .

Available patterns:

ONE_TO_ONE

ONE_TO_MANY

MANY_TO_ONE

ALL_TO_ALL

BISECTION

Note Multicast is not supported. Each connection is point-to-point, synchronized to start simultaneously.

They collapse complex multi-node wiring into a few lines of JSON. Instead of manually listing dozens of connections, you specify a regex-like host list and a pattern (e.g., ALL_TO_ALL ) and doca-perftest generates and synchronizes all connections for you.

Simple point-to-point between two nodes; useful for baseline performance testing.

Copy Copied! "testNodes" : [ { "hostname" : "node01" , "deviceName" : "mlx5_0" }, { "hostname" : "node02" , "deviceName" : "mlx5_0" } ], "trafficPattern" : "ONE_TO_ONE"





Single sender to multiple receivers; the first node sends to all others.

Copy Copied! "testNodes" : [ { "hostname" : "sender" , "deviceName" : "mlx5_0" }, { "hostname" : "receiver[1-10]" , "deviceName" : "mlx5_0" } ], "trafficPattern" : "ONE_TO_MANY"

This creates 10 connections: sender→receiver1, sender→receiver2, ..., sender→receiver10.

Multiple senders to one receiver; all nodes send to the first node.

Copy Copied! "testNodes" : [ { "hostname" : "aggregator" , "deviceName" : "mlx5_0" }, { "hostname" : "client[01-20]" , "deviceName" : "mlx5_0" } ], "trafficPattern" : "MANY_TO_ONE"

This creates 20 connections: client1→aggregator, client2→aggregator, ..., client20→aggregator.

Full-mesh connectivity; every node connects to every other node.

Copy Copied! "testNodes" : [ { "hostname" : "compute[01-16]" , "deviceName" : "mlx5_0" } ], "trafficPattern" : "ALL_TO_ALL" , "trafficDirection" : "UNIDIR"

This creates 240 connections (16×15) for unidirectional, or 120 bidirectional pairs.

Divides nodes into two equal halves; the first half connects to the second half. Requires an even number of nodes.

Copy Copied! "testNodes" : [ { "hostname" : "rack1-[01-10]" , "deviceName" : "mlx5_0" }, { "hostname" : "rack2-[01-10]" , "deviceName" : "mlx5_0" } ], "trafficPattern" : "BISECTION"

This creates 10 connections: rack1-01↔rack2-01, rack1-02↔rack2-02, ..., rack1-10↔rack2-10.

doca-perftest can run synchronized multi-process tests, ensuring traffic starts simultaneously across all cores.

By default, it runs a single process on one automatically selected core.

Process and core selection:

Option Description -N / "num_processes" Number of processes; cores auto-selected. -C / "cores" Explicitly specify core IDs or ranges.

Examples:

Copy Copied! doca_perftest -d mlx5_0 -n <server> -N 3 doca_perftest -d mlx5_0 -n <server> -C 5 doca_perftest -d mlx5_0 -n <server> -C 5,7 doca_perftest -d mlx5_0 -n <server> -C 5-9





doca-perftest can automatically select the most suitable GPU for each network device based on PCIe topology proximity. The ranking follows NVIDIA's nvidia-smi topo hierarchy: NV > PIX > PXB > PHB > NODE > SYS.

This ensures that the GPU closest to the NIC is chosen, minimizing latency and maximizing throughput.

Although auto-selection is the default behavior, users can still manually specify a GPU device using the -G argument in CLI mode, or the "cuda_dev" field in JSON mode.

Copy Copied! doca_perftest -d mlx5_0 -n server-name -G 0 doca_perftest -d mlx5_0 -n server-name -M cuda doca_perftest -d mlx5_0 -n server-name --cuda 0





RDMA operations can leverage GPU memory directly, bypassing CPU involvement for maximum throughput and minimal latency.

doca-perftest supports several CUDA memory modes optimized for different hardware and driver configurations.

Automatically selects the best available CUDA memory type in this order:

Data Direct DMA-BUF Peermem

This is the recommended mode for most users.

Automatically selects the optimal CUDA memory strategy:

Copy Copied! doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 doca_perftest -d mlx5_0 -n server-name -M cuda -G 0 --cuda_lib_path /usr/local/cuda-12/lib64 doca_perftest -d mlx5_0 -n server-name --cuda 0

Info Fallback behavior: With -M cuda_auto_detect , doca_perftest automatically tries cuda_data_direct → cuda_dmabuf → cuda_peermem in this order.





Traditional CUDA peer-memory allocation.

Supported on all CUDA-capable systems, though with slightly higher overhead compared to newer methods.

Copy Copied! # Explicitly force peermem (bypasses auto-detect) doca_perftest -d mlx5_0 -n server-name -M cuda_peermem -G 0 # Auto-detect fallback order (when using -M cuda_auto_detect): # 1 ) cuda_data_direct (fastest, requires HW/driver support) # 2 ) cuda_dmabuf # 3 ) cuda_peermem (universal fallback)





Uses the Linux DMA-BUF framework for zero-copy GPU–NIC transfers. Requires CUDA 11.7+ and kernel support.

Copy Copied! doca_perftest -d mlx5_0 -n server-name -M cuda_dmabuf-G 0





Most efficient GPU memory access method using direct PCIe mappings. Requires specific hardware and driver support; provides the lowest latency and highest throughput.

Copy Copied! doca_perftest -d mlx5_0 -n server-name -M cuda_data_direct-G 0

Beyond GPU memory types, doca-perftest supports several memory allocation strategies for RDMA operations.

Default mode using standard system RAM.

Copy Copied! # Default host memory usage doca_perftest -d mlx5_0 -n <server-name> # Explicitly specify host memory doca_perftest -d mlx5_0 -n <server-name> -M host





Does not allocate real memory; useful for ultra-low-latency synthetic tests.

Copy Copied! doca_perftest -d mlx5_0 -n <server-name> -M nullmr





Allocates memory directly on the adapter hardware (limited by on-board capacity).

Copy Copied! doca_perftest -d mlx5_0 -n <server-name> -M device

Three RDMA driver backends are supported:

Note The available drivers depend on your installed packages and hardware.

Driver Prerequisites Usage IBV (libibverbs) Installed via MLNX_OFED or inbox drivers; works on all IB/RoCE adapters -r ibv (default) DV (doca_verbs) Requires doca-sdk-verbs package -r dv

doca-perftest can automatically launch the remote server via SSH (CLI-only).

Requires passwordless SSH and identical versions on both sides.

Copy Copied! doca_perftest -d mlx5_0 -n server-name doca_perftest -d mlx5_0 -n server-name --launch_server disable

Server override examples:

Copy Copied! doca_perftest -d mlx5_0 -n server-name --server_device mlx5_1 doca_perftest -d mlx5_0 -n server-name -M host --server_mem_type cuda_auto_detect doca_perftest -d mlx5_0 -n server-name -C 0-3 --server_cores 4-7 doca_perftest -d mlx5_0 -n server-name --server_exe /tmp/other_doca_perftest_version doca_perftest -d mlx5_0 -n server-name --server_username testuser





The QP histogram provides visibility into how work is distributed across multiple queue pairs during a test. This is useful for identifying load balancing issues, scheduling inefficiencies, or hardware limitations when using multiple QPs.

Enabling QP histogram:

Copy Copied! doca_perftest -d mlx5_0 -n server-name -q 8 -H

Example output:

Copy Copied! --------------------- QP WORK DISTRIBUTION --------------------- Qp num 0: ████████████████████████ 45.23 Gbit/sec | Relative deviation: -2.1% Qp num 1: █████████████████████████ 46.89 Gbit/sec | Relative deviation: 1.5% Qp num 2: ████████████████████████ 45.67 Gbit/sec | Relative deviation: -1.2% Qp num 3: █████████████████████████████ 48.21 Gbit/sec | Relative deviation: 4.3%





PCIe optimization providing hints to CPUs for cache management and reduced memory-access latency.

Info Requires ConnectX-6 + hardware and a TPH-enabled kernel.

Parameters:

Option Meaning --ph Processing hint: 0 = Bidirectional (default), 1 = Requester, 2 = Completer, 3 = High-priority completer --tph_core_id Target CPU core for TPH handling --tph_mem Memory type: pm = Persistent, vm = Volatile

Examples:

Copy Copied! doca_perftest -d mlx5_0 -n server-name --tph_core_id 0 doca_perftest -d mlx5_0 -n server-name --tph_mem pm doca_perftest -d mlx5_0 -n server-name --ph 1 doca_perftest -d mlx5_0 -n server-name --ph 1 --tph_core_id 0 --tph_mem pm



