DOCA Telemetry PCI
This guide provides instructions for building and developing applications that require telemetry data collection from NVIDIA® BlueField and NVIDIA® ConnectX® families of networking platforms.
The doca_telemetry_pci
library provides access to PCIe status and performance information from NVIDIA® BlueField or ConnectX® networking platforms.
DOCA Telemetry PCI is supported at alpha level.
To use DOCA Telemetry PCI, the following prerequisites must be met:
fwctl
driver installed and loaded (see instructions in NVIDIA MLNX_OFED Documentation v24.07-0.6.1.0)NoteTo verify whether the
fwctl
driver is successfully loaded:$
ls
/sys/class/fwctl/Expected output:
fwctl0 fwctl1
If the directory
/sys/class/fwctl
does not exist or is empty, follow these steps:Search for the
fwctl
package:$ apt search fwctl
The output may indicate either
fwctl-dkms
orfwctl-modules
.Install the appropriate package:
$
sudo
aptinstall
fwctl-dkmsOr:
$
sudo
aptinstall
fwctl-modulesLoad the
mlx5_fwctl
module:$
sudo
modprobe mlx5_fwctlConfirm the module is loaded:
$ lsmod |
grep
fwctlExpected output:
mlx5_fwctl 20480 0 fwctl 24576 1 mlx5_fwctl mlx5_core 2211840 2 mlx5_fwctl,mlx5_ib mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,nvme,mlxfw,mlx5_fwctl,iw_cm,nvme_core,nvme_fabrics,ib_umad,fwctl,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Firmware version 28.43.1000 for ConnectX-7 or 32.43.1000 for BlueField-3
DOCA Telemetry-based applications can run on either the host machine (ConnectX-7 or BlueField-3 and newer) or the DPU target (BlueField-3 and newer).
DOCA Telemetry PCI provides insights into PCI devices, including:
Management information: PCIe link and speed details, power usage, function count, error detection flags, and more.
PCI performance counters: Data transfer rates, error rates, stall counters, L0 recovery count, and other performance metrics.
PCI latency histogram: Helps understand the duration of PCI operations.
To interact with a device, typically corresponding to a specific NIC port, create a DOCA Telemetry PCI context using doca_telemetry_pci_create().
Device Support
DOCA Telemetry PCI requires a device to operate. For picking a device, refer to "DOCA Core Device Discovery".
As device capabilities may change, it is recommended to check your device using the required set of PCI telemetry options you desire before opening it to be confident the operations you desire and available. The set of available capability checks for DOCA Telemetry PCI are out lined below:
Functionality | Method |
PCI Management Information |
|
PCI Performance Counters Group 1 |
|
PCI Performance Counters Group 2 |
|
PCI Latency Histogram |
|
Within the structures provided during the execution phase some fields are only populated which a further sub-capability is also supported:
Functionality | Field(s) | Method |
PCI Management Information |
|
|
|
| |
PCI Performance Counters Group 1 |
|
|
|
| |
|
| |
|
|
Retrieving PCI Management Information
Using a running doca_telemetry_pci
context which supports PCI management information the user can call doca_telemetry_pci_read_management_info
as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read management info
result = doca_telemetry_pci_cap_management_info_is_supported(devinfo);
if
(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct
doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read management info
struct
doca_telemetry_pci_dpn dpn = {0, 0, 0};
struct
doca_telemetry_pci_management_info management_info = {};
result = doca_telemetry_pci_read_management_info(pci_telem, dpn, &management_info);
if
(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Performance Counters Group 1
Using a running doca_telemetry_pci
context which supports performance counters group 1 the user can call doca_telemetry_pci_read_perf_counters_1
as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 1
result = doca_telemetry_pci_cap_perf_counters_1_is_supported(devinfo);
if
(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Check any sub capabilities if you require those fields
// Create PCI telemetry
struct
doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read perf counters group 1
struct
doca_telemetry_pci_dpn dpn = {0, 0, 0};
struct
doca_telemetry_pci_perf_counters_1 counters= {};
result = doca_telemetry_pci_read_perf_counters_1(pci_telem, dpn, &counters);
if
(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Performance Counters Group 2
Using a running doca_telemetry_pci
context which supports performance counters group 2 the user can call doca_telemetry_pci_read_perf_counters_2
as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 2
result = doca_telemetry_pci_cap_perf_counters_2_is_supported(devinfo);
if
(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Create PCI telemetry
struct
doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Read perf counters group 2
struct
doca_telemetry_pci_dpn dpn = {0, 0, 0};
struct
doca_telemetry_pci_perf_counters_2 counters= {};
result = doca_telemetry_pci_read_perf_counters_2(pci_telem, dpn, &counters);
if
(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
Retrieving Latency Histogram
Using a running doca_telemetry_pci
context which supports latency histogram the user must first call doca_telemetry_pci_get_latency_histogram_dimensions
to learn the correct dimmensions of the histogram. They can then allocate an array of histogram values and then finally they can call doca_telemetry_pci_read_latency_histogram
as many times as they like to get the most recent data available with each call.
The following is a more complete example:
doca_error_t result;
// Check for Ability to read perf counters group 2
result = doca_telemetry_pci_cap_latency_histogram_is_supported(devinfo);
if
(result != DOCA_SUCCESS)
// Capability is not supported or an error occoured, stop
// Create PCI telemetry
struct
doca_telemetry_pci *pci_telem;
result = doca_telemetry_pci_create(dev, &pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to create telemetry instance
// Start PCI telemetry
result = doca_telemetry_pci_start(pci_telem);
if
(result != DOCA_SUCCESS)
// Handle failure to start telemetry instance
// Learn the histograms dimmensions
struct
doca_telemetry_pci_dpn dpn = {0, 0, 0};
uint32_t bucket_count;
uint32_t bucket_width_ns;
result = doca_telemetry_pci_get_latency_histogram_dimensions(pci_telem, dpn, &bucket_count, &bucket_width_ns);
if
(result != DOCA_SUCCESS)
// Handle failure to get histogram dimmensions
// Allocate memory to hold histogram data
uint64_t* buckets_arr = malloc
(bucket_count * sizeof
(uint64_t));
if
( buckets_arr == NULL)
// Handle failure to allocate memory
// Fetch histogram data
result = doca_telemetry_pci_read_latency_histogram(pci_telem, dpn, buckets_arr);
if
(result != DOCA_SUCCESS)
// Handle failure to read data
// Use the data
// Cleanup
free
(buckets_arr);
doca_telemetry_pci_stop(pci_telem);
doca_telemetry_pci_destroy(pci_telem);
DOCA Telemetry PCI supports only CPU-based datapaths.