Prometheus Endpoint Support

NVIDIA UFM Telemetry Documentation v1.7

UFM Telemetry exposes a Prometheus endpoint to allow simple and effective integration with Prometheus. To configure the Prometheus endpoint, the following keys need to be set in the launch_ibdiagnet_config.ini file.

Copy
Copied!
            

plugin_env_PROMETHEUS_ENDPOINT http://0.0.0.0:9100 plugin_env_PROMETHEUS_PROXY_ENDPOINT_PORT 9200 plugin_env_PROMETHEUS_INDEXES port_num plugin_env_PROMETHEUS_FSET_INDEXES port,lid,guid plugin_env_PROMETHEUS_CSET_DIR /config/prometheus_configs/cset

The default output includes the node_guid, port_guid, and port_num.

For use cases such as UFM Enterprise or UFM Cyber AI, this is sufficient, as the network topology is known, so that a human readable name can be presented, based on the GUID.

Copy
Copied!
            

# TYPE PortXmitDataExtended counter # TYPE PortXmitPktsExtended counter PortXmitDataExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 85554128244 1628683905941 PortXmitPktsExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 1188251785 1628683905941

However, for integration with third-party applications, labels which are more human-readable may be generated using a labels metadata file, as described below.

To generate custom labels, a file containing key-value pairs is used. When the keys are matched, the additional key/value pairs added to the Prometheus labels are generated.

An example labels metadata file has the following format:

Copy
Copied!
            

ec0d9a0300b41a50_36|port_id|ec0d9a0300b41a50_36|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level|server ec0d9a0300b41a50_37|port_id|ec0d9a0300b41a50_37|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level| ec0d9a0300b41a58_1|port_id|ec0d9a0300b41a58_1|device_name||device_type|switch|fabric|compute|hostname|aggregation|node_desc|aggregation node|level||peer_level|leaf 98039b0300640b92_1|port_id|98039b0300640b92_1|device_name||device_type|host|fabric|compute|hostname|agx-1|node_desc|agx-1 mlx5_0|level|server|peer_level|leaf 98039b0300640c22_1|port_id|98039b0300640c22_1|device_name||device_type|host|fabric|compute|hostname|agx-2|node_desc|agx-2 mlx5_0|level|server|peer_level|leaf 0002c90300f172a0_2|port_id|0002c90300f172a0_2|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx4_0|level|server|peer_level|leaf 98039b0300640b9a_1|port_id|98039b0300640b9a_1|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx5_0|level|server|peer_level|leaf

The Prometheus output generated:

Copy
Copied!
            

# TYPE infiniband_port_xmit_data_bytes counter # TYPE infiniband_port_rcv_data_bytes counter # TYPE infiniband_link_error_recovery_events counter # TYPE infiniband_link_downed_events counter # TYPE infiniband_cbw gauge infiniband_port_xmit_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218360540 1628602711924 infiniband_port_rcv_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218429458 1628602711924 infiniband_link_error_recovery_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924 infiniband_link_downed_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924 infiniband_cbw {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS}} 0 1628602711924   where ADDITIONAL_LABELS include: hostname="agx-3" node_desc="agx-3 mlx5_0" device_name="" device_type="host" fabric="compute" level="server" peer_level="leaf"

To enable this functionality, the following additional keys need to be configured:

Copy
Copied!
            

plugin_env_CLX_EXPORT_API_IBNETDISCOVER_RUN_ONCE 1 # Without this, the gen_metadata.py script cannot generate the human readable names, nor the level and peer_level. plugin_env_CLX_METADATA_FILE /path/to/labels/file plugin_env_CLX_METADATA_COMMAND "python3 /opt/mellanox/collectx/telem/bin/gen_metadata.py --fabric compute --file /var/log/ibdiagnet2.ibnetdiscover -o /path/to/labels/file"

To test, the curl command can be used as follows:

Copy
Copied!
            

[root@jazz11 /]# curl --silent IP_ADDR_OF_HOST:9100/metrics |egrep "xmit|rcv" | tail port_xmit_discard{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_rcv_switch_relay_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_rcv_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043 port_xmit_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.