NVIDIA UFM Telemetry Documentation v1.10
NVIDIA UFM Telemetry Documentation v1.10

Prometheus Endpoint Support

UFM Telemetry can expose an http or https endpoint to allow simple and effective integration with monitoring systems that work in poll mode and support Prometheus, CSV, or JSON data formats. The endpoint provides only the last data sample. The user cannot obtain statistics for time points in the past.

An http endpoint provides data in Prometheus format by default. It also supports JSON and CSV formats. The user can request the desired format using a URL prefix, as shown in the table below.

Data Format

URL Prefix

Prometheus

-

JSON

/json

CSV

/csv

An http endpoint can provide all sampled data using the default /metrics URL. The filtering functionality described in the Cset/Fset Filtering section is also supported. To use it place <name>.cset or <name>.fset file in appropriate folders. This folder should be stated in configuration file. See section "Configuring Data Polling Endpoint" for more details.

A filter file name is included in the URL to request that the data be filtered through the particular .cset/.fset file the user intends. For example, if there are two filter files named name1.cset and name2.cset, then URLs /name1 (or /cset/name1 ) and /name2 (or /cset/name2) can be used to get filtered output described in these files accordingly.

The URL prefixes /cset and /fset can also be used to specify which filter file is meant.

URL

File Extension

Folder Parameter in Configuration File

Note

/cset

*.cset

plugin_env_PROMETHEUS_CSET_DIR

If the cset folder is not explicitly specified in the configuration file, then the cset directory is set the same as the fset directory.

/fset

*.fset

plugin_env_PROMETHEUS_FSET_DIR

If the fset folder is not explicitly specified in the configuration file, then the fset directory is set the same as the cset directory.

Warning

If a URL prefix is not specified, then the filter file will be searched under both cset and fset folders. If they both have files with the same names, then both filters will be applied.

A bre are a set URL prefixes can be used to manipulate data output. It is important to use the prefixes in the correct order as they have assigned priorities. The table below shows URL prefixes priority assignments with examples:

Priority

Prefix

Link Examples

Description

1

/labels

/labels/metrics,

/metrics

Used to show labels from metadata files

2

/json, /csv

/json/metrics,

/csv/metrics,

/labels/json/metrics,

/labels/csv/metrics

Used to specify output format

3

/cset, /fset

/cset/filter1,

/fset/filter2,

/labels/cset/filter1,

/labels/fset/filter2,

/json/cset/filter1,

/json/fset/filter2,

/csv/cset/filter1,

/csv/fset/filter2,

/labels/json/cset/filter1,

/labels/json/fset/filter2,

/labels/csv/cset/filter1,

/labels/csv/fset/filter2,

Used to specify which type of filer file should be applied

To configure the Prometheus endpoint, the keys listed below need to be set in the launch_ibdiagnet_config.ini file.

Copy
Copied!
            

plugin_env_PROMETHEUS_ENDPOINT http://0.0.0.0:9100 plugin_env_PROMETHEUS_PROXY_ENDPOINT_PORT 9200 plugin_env_PROMETHEUS_INDEXES port_num plugin_env_PROMETHEUS_FSET_INDEXES port,lid,guid plugin_env_PROMETHEUS_CSET_DIR /config/prometheus_configs/cset …

There are several options related to configuring the HTTP polling endpoint. The key plugin_env_PROMETHEUS_ENDPOINT is used to configure the IP interface for endpoint binding. The “0.0.0.0” part in the setting above means that any of the host's valid IP addresses can be used. Note that the user can also specify the host's IP address explicitly.

The plugin_env_PROMETHEUS_ENDPOINT key also configures the data transport. For regular HTTP, prefix to http. To send over a TLS connection, set the prefix to https, set the above mandatory parameters (keys), and select the existing security keys as follows.

A DH (key exchange protoon) file can also be specified if needed as follows:

Copy
Copied!
            

plugin_env_CLX_SSL_DH_FILE=/certs/dh.pem

To use custom labels for Prometheus statistics, a metadata file is used. For details about labels and label file format, see sections "Prometheus Labels" and "Prometheus Label Generation".

There are several options that allow configuring metadata. The file containing the labels used in Prometheus generation is set as follows:

Copy
Copied!
            

plugin_env_CLX_METADATA_FILE=/config/labels.txt

The user can create the metadata file upon system setup or use a script to generate it automatically via script, using the following parameter:

Copy
Copied!
            

plugin_env_CLX_METADATA_COMMAND=/opt/mellanox/collectx/telem/bin/gen_metadata --fabric compute --file /var/log/ibdiagnet2.ibnetdiscover --output /config/labels.txt

In the above example, the script generates metadata from /var/log/ibdiagnet2.ibnetdiscover. If the user wishes to create the label file manually, the above option should be commented out to prevent periodic overwriting of the content of the metadata file.

By default, the Prometheus endpoint provides statistics with the collection timestamps. The user can decide whether counter values will be passed with or without timestamps by setting the plugin_env_PROMETHEUS_SHOW_TIMESTAMPS parameter to T (true) or F (false), respectively. For example, to send counter values without timestamps, set the parameter as follows:

Copy
Copied!
            

plugin_env_PROMETHEUS_SHOW_TIMESTAMPS=F

To use data filters folders with counter set and field sets, the directories where the files are stored should be configured as follows:

Copy
Copied!
            

plugin_env_PROMETHEUS_CSET_DIR=/telemetry.config/prometheus_configs/cset plugin_env_PROMETHEUS_FSET_DIR=/telemetry.config/prometheus_configs/fset

Important

Any parameters not explicitly documented should not be changed and should be considered read-only.

For use cases such as UFM Enterprise or UFM Cyber AI where the network topology is known, a human-readable name can be presented based on the GUID.

Copy
Copied!
            

# TYPE PortXmitDataExtended counter # TYPE PortXmitPktsExtended counter PortXmitDataExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 85554128244 1628683905941 PortXmitPktsExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 1188251785 1628683905941

For integration with third-party applications, labels which are more human-readable may be generated using a labels metadata file, as described below.

To generate custom labels, a file containing key-value pairs is used. When the keys are matched, the key-value pairs added to the Prometheus labels are generated.

The following is an example of the format of a labels metadata file:

Copy
Copied!
            

ec0d9a0300b41a50_36|port_id|ec0d9a0300b41a50_36|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level|server ec0d9a0300b41a50_37|port_id|ec0d9a0300b41a50_37|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level| ec0d9a0300b41a58_1|port_id|ec0d9a0300b41a58_1|device_name||device_type|switch|fabric|compute|hostname|aggregation|node_desc|aggregation node|level||peer_level|leaf 98039b0300640b92_1|port_id|98039b0300640b92_1|device_name||device_type|host|fabric|compute|hostname|agx-1|node_desc|agx-1 mlx5_0|level|server|peer_level|leaf 98039b0300640c22_1|port_id|98039b0300640c22_1|device_name||device_type|host|fabric|compute|hostname|agx-2|node_desc|agx-2 mlx5_0|level|server|peer_level|leaf 0002c90300f172a0_2|port_id|0002c90300f172a0_2|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx4_0|level|server|peer_level|leaf 98039b0300640b9a_1|port_id|98039b0300640b9a_1|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx5_0|level|server|peer_level|leaf

The following is an example of the generated Prometheus output:

Copy
Copied!
            

# TYPE infiniband_port_xmit_data_bytes counter # TYPE infiniband_port_rcv_data_bytes counter # TYPE infiniband_link_error_recovery_events counter # TYPE infiniband_link_downed_events counter # TYPE infiniband_cbw gauge infiniband_port_xmit_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218360540 1628602711924 infiniband_port_rcv_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218429458 1628602711924 infiniband_link_error_recovery_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924 infiniband_link_downed_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924 infiniband_cbw {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS}} 0 1628602711924   where ADDITIONAL_LABELS include: hostname="agx-3" node_desc="agx-3 mlx5_0" device_name="" device_type="host" fabric="compute" level="server" peer_level="leaf"

To enable this functionality, the following additional keys need to be configured:

Copy
Copied!
            

plugin_env_CLX_EXPORT_API_IBNETDISCOVER_RUN_ONCE 1 # Without this, the gen_metadata.py script cannot generate the human readable names, nor the level and peer_level. plugin_env_CLX_METADATA_FILE /path/to/labels/file plugin_env_CLX_METADATA_COMMAND "python3 /opt/mellanox/collectx/telem/bin/gen_metadata.py --fabric compute --file /var/log/ibdiagnet2.ibnetdiscover -o /path/to/labels/file"

To test, the curl command can be used as follows:

Copy
Copied!
            

[root@jazz11 /]# curl --silent IP_ADDR_OF_HOST:9100/metrics |egrep "xmit|rcv" | tail port_xmit_discard{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_rcv_switch_relay_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_rcv_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_xmit_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.