DOCA Telemetry Adp Retx
This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField® and NVIDIA® ConnectX® families of networking platforms.
The doca_telemetry_adp_retx library provides statistics on Adaptive Retransmission Algorithm timeouts that have been configured on a given DOCA device, corresponding to an NVIDIA® BlueField® or NVIDIA® ConnectX® network card.
The library includes mechanisms for configuring and reading Adaptive Retransmissions in a histogram format. Each histogram read provides a series of bins, where each bin corresponds to a specific time range. The value of the bin is a count of the retransmissions that occurred due to a timeout falling within that time range.
The histogram can return information about events on all QPs of functions associated with the DOCA device, or it can be configured to track the QPs of a single VHCA ID.
DOCA Telemetry Adp Retx is supported at an alpha level.
To use DOCA Telemetry Adp Retx, the following prerequisites must be met:
fwctldriver installed and loaded (see instructions in NVIDIA MLNX_OFED Documentation v24.07-0.6.1.0)NoteTo verify whether the
fwctldriver is successfully loaded:$
ls/sys/class/fwctl/Expected output:
fwctl0 fwctl1
If the directory
/sys/class/fwctldoes not exist or is empty, follow these steps:Search for the
fwctlpackage:$ apt search fwctl
The output may indicate either
fwctl-dkmsorfwctl-modules.Install the appropriate package:
$
sudoaptinstallfwctl-dkmsOr:
$
sudoaptinstallfwctl-modulesLoad the
mlx5_fwctlmodule:$
sudomodprobe mlx5_fwctlConfirm the module is loaded:
$ lsmod |
grepfwctlExpected output:
mlx5_fwctl 20480 0 fwctl 24576 1 mlx5_fwctl mlx5_core 2211840 2 mlx5_fwctl,mlx5_ib mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,nvme,mlxfw,mlx5_fwctl,iw_cm,nvme_core,nvme_fabrics,ib_umad,fwctl,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Firmware version greater than 28.47.1000 for ConnectX-7, 40.47.1000 for ConnectX-8, or 32.47.1000 for BlueField-3
DOCA Telemetry-based applications can run on either the host machine (ConnectX-7 or BlueField-3 and newer) or on the DPU (BlueField-3 and newer).
The doca_telemetry_adp_retx library provides statistics on Adaptive Retransmission configured devices, including the number of retransmissions and their timeout ranges in a histogram format.
To interact with a device (typically corresponding to a specific NIC port), you must create a doca_telemetry_adp_retx context using doca_telemetry_adp_retx_create().
Device Support
A DOCA device is required for the library to operate. For guidance on selecting a device, refer to the " DOCA Core Device Discovery " documentation.
Device support for doca_telemetry_adp_retx and its features can be checked with the following capability calls:
doca_telemetry_adp_retx_cap_is_supported()doca_telemetry_adp_retx_cap_histogram_is_supported()
The maximum number of bins and the supported time units can be queried using:
doca_telemetry_adp_retx_cap_get_hist_max_bins()doca_telemetry_adp_retx_cap_get_hist_time_units()
Histogram Configuration
The histogram divides retransmission events into bins, each representing a time range. If a retransmission timeout falls within a bin's range, that bin's counter is incremented. The number of bins and their time ranges are configurable.
The bin widths and timespans are determined by five main configuration options:
API Configuration | Description |
| Number of bins to use in the histogram |
| Width (in time units) of the first bin |
| Width (in time units) of the second bin; also used as the base for calculating subsequent bins |
| The time unit for bin0 and bin1 widths (e.g., |
| The calculation mode for bins after bin1: either |
Example:
Fixed Mode: 4 bins,
bin0_width=50,bin1_width=100,time_unit=msec,width_mode=fixed.Bin 0: 0-50 msec
Bin 1: 50-150 msec (base + 100)
Bin 2: 150-250 msec (base + 100)
Bin 3: 250-350 msec (base + 100)
Double Mode: 5 bins,
bin0_width=50,bin1_width=100,time_unit=msec,width_mode=double.Bin 0: 0-50 msec
Bin 1: 50-150 msec (base + 100)
Bin 2: 150-350 msec (base + 200)
Bin 3: 350-750 msec (base + 400)
Bin 4: 750-1550 msec (base + 800)
Further options control how the histogram is populated:
API Configuration | Description |
| Populates the histogram with retransmissions from a single VHCA ID only |
| Clears (resets to 0) the histogram bin counters after each read |
| Enables the counters. This must be set for the histogram to start gathering statistics. |
After configuration, the histogram is loaded and begins running on the device when doca_telemetry_adp_retx_start() is called. The bin counters can then be read from the device.
doca_telemetry_adp_retx contexts do not have sole ownership or a locking mechanism on the device histogram. It is possible for another process to update the histogram's configuration while your context is in the execution phase, which can lead to misinterpretation of the bin counters.
The user is responsible for ensuring sole ownership of the histogram and verifying data integrity. An API function is provided to help detect these external changes.
The following functions are used during the execution phase:
API Datapath Functions | Description |
| Reads the configured N histogram bin counters as an array of N 64-bit values |
| Indicates if the device's active histogram configuration matches the one defined in the context |
This section outlines the states of the doca_telemetry_adp_retx context.
Idle
The context has been created and is Idle.
In this state, it is expected for the application to:
Destroy the context.
Start the context for processing.
Allowed operations:
Configuring the context according to section "Configuration".
It is possible to reach this state as follows:
Previous State | Transition Action |
None | Create the context |
Running | Call stop |
Running
In this state it is expected for the application to:
Stop the context.
Allowed operations:
Reading data from the device according to section "Execution".
It is possible to reach this state as follows:
Previous State | Transition Action |
Idle | Successfully start the context |
DOCA Telemetry Adp Retx supports only CPU-based datapaths.
The doca_telemetry_adp_retx sample demonstrates how to configure the histogram from command-line arguments, run for a set period, and then print the values of the configured bin counters. This sample is also available on GitHub.
Running the Sample
Before you begin, refer to the following documents:
DOCA Installation Guide for Linux: For details on installing BlueField-related software.
NVIDIA BlueField Platform Software Troubleshooting Guide: For any issues with installation, compilation, or execution.
To build a given sample:
# Update path
ifyou downloaded from GitHub cd /opt/mellanox/doca/samples/doca_telemetry/telemetry_adp_retx meson /tmp/build ninja -C /tmp/buildThe binary
doca_telemetry_adp_retxis created under/tmp/build/.Sample usage:
Usage: doca_telemetry_adp_retx [DOCA Flags] [Program Flags] DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all command flags from an input json file Program Flags: -p, --pci-addr DOCA device PCI device address -u, --time-unit Time unit to use - 'nsec', 'usec', 'usec_100', or 'msec' -w, --width-mode Bin width mode to use - 'fixed', or 'double' -n, --number-bins The number of bins to configure the histogram for -vid, --vhca-id VHCA ID to get histogram events from -b0, --bin-0-width Width of bin 0 to configure histogram -b1, --bin-1-width Width of bin 1 to configure histogram -t, --wait-time Time in seconds to wait before reading histogram bins
The sample includes:
Locates and opens a DOCA device.
Creates a
doca_telemetry_adp_retxinstance.Queries the device for histogram support, max bins, and time unit capabilities.
Configures the histogram with the values provided via command line (number of bins, bin widths, time unit, width mode, VHCA ID, clear on read, and counter enable).
Waits for the specified time, then reads and displays the value of each bin.
Destroys the
doca_telemetry_adp_retxcontext.