DTS supports on-board data collection from sysf , ethtool , and tc providers.

Fluent and Prometheus aggregator providers can collect the data from other applications.

The sysfs provider has several components: ib_port , hw_port , mr_cache , eth , hwmon and bf_ptm . By default, all the components (except bf_ptm ) are enabled when the provider is enabled:

Copy Copied! #disable-provider=sysfs

The components can be disabled separately. For instance, to disable eth :

Copy Copied! enable-provider=sysfs disable-provider=sysfs.eth

Warning ib_port and ib_hvw are state counters which are collected per port. These counters are only collected for ports whose state is active.

ib_port counters: Copy Copied! {hca_name}:{port_num}:ib_port_state {hca_name}:{port_num}:VL15_dropped {hca_name}:{port_num}:excessive_buffer_overrun_errors {hca_name}:{port_num}:link_downed {hca_name}:{port_num}:link_error_recovery {hca_name}:{port_num}:local_link_integrity_errors {hca_name}:{port_num}:multicast_rcv_packets {hca_name}:{port_num}:multicast_xmit_packets {hca_name}:{port_num}:port_rcv_constraint_errors {hca_name}:{port_num}:port_rcv_data {hca_name}:{port_num}:port_rcv_errors {hca_name}:{port_num}:port_rcv_packets {hca_name}:{port_num}:port_rcv_remote_physical_errors {hca_name}:{port_num}:port_rcv_switch_relay_errors {hca_name}:{port_num}:port_xmit_constraint_errors {hca_name}:{port_num}:port_xmit_data {hca_name}:{port_num}:port_xmit_discards {hca_name}:{port_num}:port_xmit_packets {hca_name}:{port_num}:port_xmit_wait {hca_name}:{port_num}:symbol_error {hca_name}:{port_num}:unicast_rcv_packets {hca_name}:{port_num}:unicast_xmit_packets

ib_hw counters: Copy Copied! {hca_name}:{port_num}:hw_state {hca_name}:{port_num}:hw_duplicate_request {hca_name}:{port_num}:hw_implied_nak_seq_err {hca_name}:{port_num}:hw_lifespan {hca_name}:{port_num}:hw_local_ack_timeout_err {hca_name}:{port_num}:hw_out_of_buffer {hca_name}:{port_num}:hw_out_of_sequence {hca_name}:{port_num}:hw_packet_seq_err {hca_name}:{port_num}:hw_req_cqe_error {hca_name}:{port_num}:hw_req_cqe_flush_error {hca_name}:{port_num}:hw_req_remote_access_errors {hca_name}:{port_num}:hw_req_remote_invalid_request {hca_name}:{port_num}:hw_resp_cqe_error {hca_name}:{port_num}:hw_resp_cqe_flush_error {hca_name}:{port_num}:hw_resp_local_length_error {hca_name}:{port_num}:hw_resp_remote_access_errors {hca_name}:{port_num}:hw_rnr_nak_retry_err {hca_name}:{port_num}:hw_rx_atomic_requests {hca_name}:{port_num}:hw_rx_dct_connect {hca_name}:{port_num}:hw_rx_icrc_encapsulated {hca_name}:{port_num}:hw_rx_read_requests {hca_name}:{port_num}:hw_rx_write_requests

ib_mr_cache counters: Copy Copied! {hca_name}:mr_cache:size_{n}:cur {hca_name}:mr_cache:size_{n}:limit {hca_name}:mr_cache:size_{n}:miss {hca_name}:mr_cache:size_{n}:size Warning Where n ranges from 0 to 24.

eth counters: Copy Copied! {hca_name}:{device_name}:eth_collisions {hca_name}:{device_name}:eth_multicast {hca_name}:{device_name}:eth_rx_bytes {hca_name}:{device_name}:eth_rx_compressed {hca_name}:{device_name}:eth_rx_crc_errors {hca_name}:{device_name}:eth_rx_dropped {hca_name}:{device_name}:eth_rx_errors {hca_name}:{device_name}:eth_rx_fifo_errors {hca_name}:{device_name}:eth_rx_frame_errors {hca_name}:{device_name}:eth_rx_length_errors {hca_name}:{device_name}:eth_rx_missed_errors {hca_name}:{device_name}:eth_rx_nohandler {hca_name}:{device_name}:eth_rx_over_errors {hca_name}:{device_name}:eth_rx_packets {hca_name}:{device_name}:eth_tx_aborted_errors {hca_name}:{device_name}:eth_tx_bytes {hca_name}:{device_name}:eth_tx_carrier_errors {hca_name}:{device_name}:eth_tx_compressed {hca_name}:{device_name}:eth_tx_dropped {hca_name}:{device_name}:eth_tx_errors {hca_name}:{device_name}:eth_tx_fifo_errors {hca_name}:{device_name}:eth_tx_heartbeat_errors {hca_name}:{device_name}:eth_tx_packets {hca_name}:{device_name}:eth_tx_window_errors

BlueField-2 hwmon counters: Collapse Source Copy Copied! {hwmon_name}:{l3cache}:CYCLES {hwmon_name}:{l3cache}:HITS_BANK0 {hwmon_name}:{l3cache}:HITS_BANK1 {hwmon_name}:{l3cache}:MISSES_BANK0 {hwmon_name}:{l3cache}:MISSES_BANK1 {hwmon_name}:{pcie}:IN_C_BYTE_CNT {hwmon_name}:{pcie}:IN_C_PKT_CNT {hwmon_name}:{pcie}:IN_NP_BYTE_CNT {hwmon_name}:{pcie}:IN_NP_PKT_CNT {hwmon_name}:{pcie}:IN_P_BYTE_CNT {hwmon_name}:{pcie}:IN_P_PKT_CNT {hwmon_name}:{pcie}:OUT_C_BYTE_CNT {hwmon_name}:{pcie}:OUT_C_PKT_CNT {hwmon_name}:{pcie}:OUT_NP_BYTE_CNT {hwmon_name}:{pcie}:OUT_NP_PKT_CNT {hwmon_name}:{pcie}:OUT_P_PKT_CNT {hwmon_name}:{tile}:MEMORY_READS {hwmon_name}:{tile}:MEMORY_WRITES {hwmon_name}:{tile}:MSS_NO_CREDIT {hwmon_name}:{tile}:VICTIM_WRITE {hwmon_name}:{tilenet}:CDN_DIAG_C_OUT_OF_CRED {hwmon_name}:{tilenet}:CDN_REQ {hwmon_name}:{tilenet}:DDN_REQ {hwmon_name}:{tilenet}:NDN_REQ {hwmon_name}:{trio}:TDMA_DATA_BEAT {hwmon_name}:{trio}:TDMA_PBUF_MAC_AF {hwmon_name}:{trio}:TDMA_RT_AF {hwmon_name}:{trio}:TPIO_DATA_BEAT {hwmon_name}:{triogen}:TX_DAT_AF {hwmon_name}:{triogen}:TX_DAT_AF

BlueField-3 hwmon counters: Copy Copied! {hwmon_name}:{llt}:GDC_BANK0_RD_REQ {hwmon_name}:{llt}:GDC_BANK1_RD_REQ {hwmon_name}:{llt}:GDC_BANK0_WR_REQ {hwmon_name}:{llt}:GDC_BANK1_WR_REQ {hwmon_name}:{llt_miss}:GDC_MISS_MACHINE_RD_REQ {hwmon_name}:{llt_miss}:GDC_MISS_MACHINE_WR_REQ {hwmon_name}:{mss}:SKYLIB_DDN_TX_FLITS {hwmon_name}:{mss}:SKYLIB_DDN_RX_FLITS

BlueField-3 bf_ptm counters: Copy Copied! bf:ptm:active_power_profile bf:ptm:atx_power_available bf:ptm:core_temp bf:ptm:ddr_temp bf:ptm:error_state bf:ptm:power_envelope bf:ptm:power_throttling_event_count bf:ptm:power_throttling_state bf:ptm:thermal_throttling_event_count bf:ptm:thermal_throttling_state bf:ptm:throttling_state bf:ptm:total_power bf:ptm:vr0_power bf:ptm:vr1_power

The bf_ptm component collects BlueField-3 power thermal counters using remote collection. It is disabled by default and can be enabled as follows:

Load kernel module mlxbf-ptm : Copy Copied! modprobe - v mlxbf-ptm Enable component using remote collection: Copy Copied! enable-provider=grpc.sysfs.bf_ptm Warning DPE server should be active before changing the dts_config.ini file. See section "Remote Collection" for details.

Ethtool counters is the generated list of counters which corresponds to Ethtool utility. Counters are generated on a per-device basis. See this community post for more information on mlx5 ethtool counters.

The following TC objects are supported and reported regarding the ingress filters:

Filters flower

Actions mirred tunnel_key



The info is provided as one of the following events:

Basic filter event

Flower/IPv4 filter event

Flower/IPv6 filter event

Basic action event

Mirred action event

Tunnel_key/IPv4 action event

Tunnel_key/IPv6 action event

General notes:

Actions always belong to a filter, so action events share the filter event's ID via the event_id data member

Basic filter event only contains textual kind (so users can see which real life objects' support they are lacking)

Basic action event only contains textual kind and some basic common statistics if available

fluent_aggr listens on a port for Fluent Bit Forward protocol input connections. Received data can be streamed via a Fluent Bit exporter.

The default port is 42442. This can be changed by updating the following option:

Copy Copied! fluent-aggr-port=42442





prometheus_aggr polls data from a list of Prometheus endpoints.

Each endpoint is listed in the following format:

Copy Copied! prometheus_aggr_endpoint.{N}={host_name},{host_port_url},{poll_inteval_msec}

Where N starts from 0.

Aggregated data can be exported via a Prometheus Aggr Exporter endpoint.

ifconfig collects network interface data. To enable, set:

Copy Copied! enable-provider=ifconfig

If the Prometheus endpoint is enabled, add the following configuration to cache every collected network interface and arrange the index according to their names:

Copy Copied! prometheus-fset-indexes=name

Metrices are collected for each network interface as follows:

Copy Copied! name rx_packets tx_packets rx_bytes tx_bytes rx_errors tx_errors rx_dropped tx_dropped multicast collisions rx_length_errors rx_over_errors rx_crc_errors rx_frame_errors rx_fifo_errors rx_missed_errors tx_aborted_errors tx_carrier_errors tx_fifo_errors tx_heartbeat_errors tx_window_errors rx_compressed tx_compressed rx_nohandler





hcaperf collects HCA performance data. Since it requires access to an RDMA device, it must use remote collection on the DPU. On the host, the user runs the container in privileged mode and RDMA device mount.

The counter list is device dependent.

To enable hcaperf in remote collection mode, set:

Copy Copied! enable-provider=grpc.hcaperf # specify HCAs to sample grpc.hcaperf.mlx5_0=sample grpc.hcaperf.mlx5_1=sample

Warning DPE server should be active before changing the dts_config.ini file. See section "Remote Collection" for details.





To enable hcaperf in regular mode, set:

Copy Copied! enable-provider=hcaperf # specify HCAs to sample hcaperf.mlx5_0=sample hcaperf.mlx5_1=sample

The nvidia-smi provider collects GPU and GPU process information provided by the NVIDIA system management interface.

This provider is supported only on x86_64 hosts with installed GPUs. All GPU cards supported by nvidia-smi are supported by this provider.

The counter list is GPU dependent. Additionally, per-process information is collected for the first 20 (by default) nvidia_smi_max_processes processes.

Counters can be either collected as string data "as is" in nvidia-smi or converted to numbers when nvsmi_with_numeric_fields is set.

To enable nvidia-smi provider and change parameters, set:

Copy Copied! enable-provider=nvidia-smi # Optional parameters: #nvidia_smi_max_processes=20 #nvsmi_with_numeric_fields=1





The dcgm provider collects GPU information provided by the NVIDIA data center GPU manager (DCGM) API.

This provider is supported only on x86_64 hosts with installed GPUs, and requires running the nv-hostengine service (refer to DCGM documentation for details).

DCGM counters are split into several groups by context:

GPU – basic GPU information (always)

COMMON – common fields that can be collected from all devices

PROF – profiling fields

ECC – ECC errors

NVLINK / NVSWITCH / VGPU – fields depending on the device type

To enable DCGM provider and counter groups, set:

Copy Copied! enable-provider=dcgm dcgm_events_enable_common_fields=1 #dcgm_events_enable_prof_fields=0 #dcgm_events_enable_ecc_fields=0 #dcgm_events_enable_nvlink_fields=0 #dcgm_events_enable_nvswitch_fields=0 #dcgm_events_enable_vgpu_fields=0

DTS can send the collected data to the following outputs:

Data writer (saves binary data to disk)

Fluent Bit (push-model streaming)

Prometheus endpoint (keeps the most recent data to be pulled)

The data writer is disabled by default to save space on BlueField. Steps for activating data write during debug can be found under section Enabling Data Write.

The schema folder contains JSON-formatted metadata files which allow reading the binary files containing the actual data. The binary files are written according to the naming convention shown in the following example ( apt install tree ):

Copy Copied! tree /opt/mellanox/doca/services/telemetry/data/ /opt/mellanox/doca/services/telemetry/data/ ├── {year} │ └── {mmdd} │ └── {hash} │ ├── {source_id} │ │ └── {source_tag}{timestamp}.bin │ └── {another_source_id} │ └── {another_source_tag}{timestamp}.bin └── schema └── schema_{MD5_digest}.json

New binary files appears when the service starts or when binary file age/size restriction is reached. If no schema or no data folders are present, refer to the Troubleshooting section.

Warning source_id is usually set to the machine hostname. source_tag is a line describing the collected counters, and it is often set as the provider's name or name of user-counters.

Reading the binary data can be done from within the DTS container using the following command:

Copy Copied! crictl exec -it <Container ID> /opt/mellanox/collectx/bin/clx_read -s /data/schema /data/path/to/datafile.bin

Warning The path to the data file must be an absolute path.

Example output:

Copy Copied! { "timestamp": 1634815738799728, "event_number": 0, "iter_num": 0, "string_number": 0, "example_string": "example_str_1" } { "timestamp": 1634815738799768, "event_number": 1, "iter_num": 0, "string_number": 1, "example_string": "example_str_2" } …





The Prometheus endpoint keeps the most recent data to be pulled by the Prometheus server and is enabled by default.

To check that data is available, run the following command on BlueField:

Copy Copied! curl -s http://0.0.0.0:9100/metrics

The command dumps every counter in the following format:

Copy Copied! counter_name {list of meta fields} counter_value timestamp

Additionally, endpoint supports JSON and CSV formats:

Copy Copied! curl -s http://0.0.0.0:9100/json/metrics curl -s http://0.0.0.0:9100/csv/metrics

Warning The default port for Prometheus can be changed in dts_config.ini .





Prometheus is configured as a part of dts_config.ini .

By default, the Prometheus HTTP endpoint is set to port 9100. Comment this line out to disable Prometheus export.

Copy Copied! prometheus=http://0.0.0.0:9100

Prometheus can use the data field as an index to keep several data records with different index values. Index fields are added to Prometheus labels.

Copy Copied! # Comma-separated counter set description for Prometheus indexing: #prometheus-indexes=idx1,idx2 # Comma-separated fieldset description for prometheus indexing #prometheus-fset-indexes=idx1,idx2

The default fset index is device_name . It allows Prometheus to keep ethtool data up for both the p0 and p1 devices.

Copy Copied! prometheus-fset-indexes=device_name

If fset index is not set, the data from p1 overwrites p0 's data.

For quick name filtering, the Prometheus exporter supports being provided with a comma-separated list of counter names to be ignored:

Copy Copied! #prometheus-ignore-names=counter_name1,counter_name_2

For quick filtering of data by tag, the Prometheus exporter supports being provided with a comma-separated list of data source tags to be ignored.

Users should add tags for all streaming data since the Prometheus exporter cannot be used for streaming. By default, FI_metrics are disabled.

Copy Copied! prometheus-ignore-tags=FI_metrics





Prometheus aggregator exporter is an endpoint that keeps the latest aggregated data using prometheus_aggr .

This exporter labels data according to its source.

To enable this provider, users must set 2 parameters in dts_config.ini :

Copy Copied! prometheus-aggr-exporter-host=0.0.0.0 prometheus-aggr-exporter-port=33333





Fluent Bit allows streaming to multiple destinations. Destinations are configured in .exp files that are documented in-place and can be found under:

Copy Copied! /opt/mellanox/doca/services/telemetry/config/fluent_bit_configs

Fluent Bit allows exporting data via "Forward" protocol which connects to the Fluent Bit/FluentD instance on customer side.

Export can be enabled manually:

Uncomment the line with fluent_bit_configs=… in dts_config.ini . Set enable=1 in required .exp files for the desired plugins. Additional configurations can be set according to instructions in the .exp file if needed. Restart the DTS. Set up receiving instance of Fluent Bit/FluentD if needed. See the data on the receiving side.

Export file destinations are set by configuring .exp files or creating new ones. It is recommended to start by going over documented example files. Documented examples exist for the following supported plugins:

forward

file

stdout

kafka

es (elastic search)

influx

Warning All .exp files are disabled by default if not configured by initContainer entry point through .yaml file.

Warning To forward the data to several destinations, create several forward_{num}.exp files. Each of these files must have their own destination host and port.

Each export destination has the following fields:

name – configuration name

plugin_name – Fluent Bit plugin name

enable – 1 or 0 values to enable/disable this destination

host – the host for Fluent Bit plugin

port – port for Fluent Bit plugin

msgpack_data_layout – the msgpacked data format. Default is flb_std . The other option is custom. See section Msgpack Data Layout for details.

plugin_key=val – key-value pairs of Fluent Bit plugin parameter (optional)

counterset / fieldset – file paths (optional). See details in section Cset/Fset Filtering.

source_tag=source_tag1,source_tag2 – comma-separated list of data page source tags for filtering. The rest tags are filtered out during export. Event tags are event provider names. All counters can be enabled/disabled only simultaneously with a counters keyword.

Warning Use # to comment a configuration line.





Data layout can be configured using .exp files by setting msgpack_data_layout=layout . There are two available layouts: Standard and Custom.

The standard flb_std data layout is an array of 2 fields:

timestamp double value

a plain dictionary (key-value pairs)

The standard layout is appropriate for all Fluent Bit plugins. For example:

Copy Copied! [timestamp_val, {"timestamp"->ts_val, type=>"counters/events", "source"=>"source_val", "key_1"=>val_1, "key_2"=>val_2,...}]

The custom data layout is a dictionary of meta-fields and counter fields. Values are placed into a separate plain dictionary. Custom data format can be dumped with stdout_raw output plugin of Fluent-Bit installed or can be forwarded with forward output plugin.

Counters example:

Copy Copied! {"timestamp"=>timestamp_val, "type"=>"counters", "source"=>"source_val", "values"=> {"key_1"=>val_1, "key_2"=>val_2,...}}

Events example:

Copy Copied! {"timestamp"=>timestamp_val, "type"=>"events", "type_name"=>"type_name_val", "source"=>" source_val", "values"=>{"key_1"=>val_1, "key_2"=>val_2,...}}





Each export file can optionally use one cset and one fset file to filter UFM telemetry counters and events data.

cset contains tokens per line to filter data with "type"="counters" .

fset contains several blocks started with the header line [event_type_name] and tokens under that header. An Fset file is used to filter data with "type"="events" . Warning Event type names could be prefixed to apply the same tokens to all fitting types. For example, to filter all ethtool events, use [ethtool_event_*] .

If several tokens must be matched simultaneously, use <tok1>+<tok2>+<tok3> . Exclusive tokens are available as well. For example, the line <tok1>+<tok2>-<tok3>-<tok4> filters names that match both tok1 and tok2 and do not match tok3 or tok4.

The following are the details of writing cset files:

Copy Copied! # Put tokens on separate lines # Tokens are the actual name 'fragments' to be matched # port$ # match names ending with token "port" # ^port # match names starting with token "port" # ^port$ # include name that is exact token "port # port+xmit # match names that contain both tokens "port" and "xmit" # port-support # match names that contain the token "port" and do not match the "-" token "support" # # Tip: To disable counter export put a single token line that fits nothing

The following are the details of writing fset files:

Copy Copied! # Put your events here # Usage: # # [type_name_1] # tokens # [type_name_2] # tokens # [type_name_3] # tokens # ... # Tokens are the actual name 'fragments' to be matched # port$ # match names ending with token "port" # ^port # match names starting with token "port" # ^port$ # include name that is exact token "port # port+xmit # match names that contain both tokens "port" and "xmit" # port-support # match names that contain the token "port" and do not match the "-" token "support" # The next example will export all the "tc" events and all events with type prefix "ethtool_" "ethtool" are filtered with token "port": # [tc] # # [ethtool_*] # packet # To know which event type names are available check export and find field "type_name"=>"ethtool_event_p0" # ... # Corner cases: # 1. Empty fset file will export all events. # 2. Tokens written above/without [event_type] will be ignored. # 3. If cannot open fset file, warning will be printed, all event types will be exported.

NetFlow exporter must be used when data is collected as NetFlow packets from the telemetry client applications. In this case, DOCA Telemetry NetFlow API sends NetFlow data packages to DTS via IPC. DTS uses NetFlow exporter to send data to the NetFlow collector (3rd party service).

To enable NetFlow exporter, set netflow-collector-ip and netflow-collector-port in dts_config.ini . netflow-collector-ip could be set either to IP or an address.