DOCA Telemetry PCI
This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField and NVIDIA® ConnectX® families of networking platforms.
The doca_telemetry_pci library provides access to PCIe status and performance information from BlueField or ConnectX networking platforms.
DOCA Telemetry PCI is supported at alpha level.
To use DOCA Telemetry PCI, the following prerequisites must be met:
fwctldriver installed and loaded (see instructions in NVIDIA MLNX_OFED Documentation v24.07-0.6.1.0)NoteTo verify whether the
fwctldriver is successfully loaded:$
ls/sys/class/fwctl/Expected output:
fwctl0 fwctl1
If the directory
/sys/class/fwctldoes not exist or is empty, follow these steps:Search for the
fwctlpackage:$ apt search fwctl
The output may indicate either
fwctl-dkmsorfwctl-modules.Install the appropriate package:
$
sudoaptinstallfwctl-dkmsOr:
$
sudoaptinstallfwctl-modulesLoad the
mlx5_fwctlmodule:$
sudomodprobe mlx5_fwctlConfirm the module is loaded:
$ lsmod |
grepfwctlExpected output:
mlx5_fwctl 20480 0 fwctl 24576 1 mlx5_fwctl mlx5_core 2211840 2 mlx5_fwctl,mlx5_ib mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,nvme,mlxfw,mlx5_fwctl,iw_cm,nvme_core,nvme_fabrics,ib_umad,fwctl,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Firmware version 28.43.1000 for ConnectX-7 or 32.43.1000 for BlueField-3
DOCA Telemetry-based applications can run on either the host machine (ConnectX-7 or BlueField-3 and newer) or the DPU target (BlueField-3 and newer).
DOCA Telemetry PCI provides insights into PCIe devices, including:
Management information: PCIe link and speed details, power usage, function count, error detection flags, and more.
PCIe performance counters: Data transfer rates, error rates, stall counters, L0 recovery count, and other performance metrics.
PCIe latency histogram: Helps understand the duration of PCIe operations.
To interact with a device, typically corresponding to a specific NIC port, create a DOCA Telemetry PCI context using doca_telemetry_pci_create().
Device Support
DOCA Telemetry PCI requires a device to operate. For picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change, it is recommended to check your device using the required set of PCIe telemetry options you desire before opening it to be confident the operations you desire and available. The set of available capability checks for DOCA Telemetry PCI are out lined below:
Functionality | Method |
PCI Management Information |
|
PCI Performance Counters Group 1 |
|
PCI Performance Counters Group 2 |
|
PCI Latency Histogram |
|
Within the structures provided during the execution phase some fields are only populated which a further sub-capability is also supported:
Functionality | Field(s) | Method |
PCI Management Information |
|
|
|
| |
PCI Performance Counters Group 1 |
|
|
|
| |
|
| |
|
|
Retrieving PCIe Management Information
Using a running doca_telemetry_pci context which supports PCIe management information the user can call doca_telemetry_pci_read_management_info as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read management info
result = doca_telemetry_pci_cap_management_info_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read management info
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
struct doca_telemetry_pci_management_info management_info = {};
result = doca_telemetry_pci_read_management_info(pci_telem, dpn, &management_info);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Performance Counters Group 1
Using a running doca_telemetry_pci context which supports performance counters group 1 the user can call doca_telemetry_pci_read_perf_counters_1 as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 1
result = doca_telemetry_pci_cap_perf_counters_1_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read perf counters group 1
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
struct doca_telemetry_pci_perf_counters_1 counters= {};
result = doca_telemetry_pci_read_perf_counters_1(pci_telem, dpn, &counters);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Latency Histogram
Using a running doca_telemetry_pci context which supports latency histogram the user must first call doca_telemetry_pci_get_latency_histogram_dimensions to learn the correct dimmensions of the histogram. They can then allocate an array of histogram values and then finally they can call doca_telemetry_pci_read_latency_histogram as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 2
result = doca_telemetry_pci_cap_latency_histogram_is_supported(devinfo);
if(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Create PCI telemetry
struct doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Learn the histograms dimmensions
struct doca_telemetry_pci_dpn dpn = {0, 0, 0};
uint32_t bucket_count;
uint32_t bucket_width_ns;
result = doca_telemetry_pci_get_latency_histogram_dimensions(pci_telem, dpn, &bucket_count, &bucket_width_ns);
if(result != DOCA_SUCCESS)
// Handle failure to get histogram dimmensions
// Allocate memory to hold histogram data
uint64_t* buckets_arr = malloc(bucket_count * sizeof(uint64_t));
if( buckets_arr == NULL)
// Handle failure to allocate memory
// Fetch histogram data
result = doca_telemetry_pci_read_latency_histogram(pci_telem, dpn, buckets_arr);
if(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
free(buckets_arr);
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
DOCA Telemetry PCI supports only CPU-based datapaths.