Prometheus Endpoint Support
UFM Telemetry exposes a Prometheus endpoint to allow simple and effective integration with Prometheus. To configure the Prometheus endpoint, the following keys need to be set in the launch_ibdiagnet_config.ini file.
plugin_env_PROMETHEUS_ENDPOINT http://0.0.0.0:9100
plugin_env_PROMETHEUS_PROXY_ENDPOINT_PORT 9200
plugin_env_PROMETHEUS_INDEXES port_num
plugin_env_PROMETHEUS_FSET_INDEXES port,lid,guid
plugin_env_PROMETHEUS_CSET_DIR /config/prometheus_configs/cset
The default output includes the node_guid, port_guid, and port_num.
For use cases such as UFM Enterprise or UFM Cyber AI, this is sufficient, as the network topology is known, so that a human readable name can be presented, based on the GUID.
# TYPE PortXmitDataExtended counter
# TYPE PortXmitPktsExtended counter
PortXmitDataExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 85554128244 1628683905941
PortXmitPktsExtended{source="0x0002c90300f172a0", node_guid="2c90300f172a0", port_guid="2c90300f172a2", port_num="2"} 1188251785 1628683905941
However, for integration with third-party applications, labels which are more human-readable may be generated using a labels metadata file, as described below.
To generate custom labels, a file containing key-value pairs is used. When the keys are matched, the additional key/value pairs added to the Prometheus labels are generated.
An example labels metadata file has the following format:
ec0d9a0300b41a50_36|port_id|ec0d9a0300b41a50_36|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level|server
ec0d9a0300b41a50_37|port_id|ec0d9a0300b41a50_37|device_name|SwitchIB Mellanox Technologies|device_type|switch|fabric|compute|hostname||node_desc||level|leaf|peer_level|
ec0d9a0300b41a58_1|port_id|ec0d9a0300b41a58_1|device_name||device_type|switch|fabric|compute|hostname|aggregation|node_desc|aggregation node|level||peer_level|leaf
98039b0300640b92_1|port_id|98039b0300640b92_1|device_name||device_type|host|fabric|compute|hostname|agx-1|node_desc|agx-1 mlx5_0|level|server|peer_level|leaf
98039b0300640c22_1|port_id|98039b0300640c22_1|device_name||device_type|host|fabric|compute|hostname|agx-2|node_desc|agx-2 mlx5_0|level|server|peer_level|leaf
0002c90300f172a0_2|port_id|0002c90300f172a0_2|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx4_0|level|server|peer_level|leaf
98039b0300640b9a_1|port_id|98039b0300640b9a_1|device_name||device_type|host|fabric|compute|hostname|agx-3|node_desc|agx-3 mlx5_0|level|server|peer_level|leaf
The Prometheus output generated:
# TYPE infiniband_port_xmit_data_bytes counter
# TYPE infiniband_port_rcv_data_bytes counter
# TYPE infiniband_link_error_recovery_events counter
# TYPE infiniband_link_downed_events counter
# TYPE infiniband_cbw gauge
infiniband_port_xmit_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218360540 1628602711924
infiniband_port_rcv_data_bytes {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 82218429458 1628602711924
infiniband_link_error_recovery_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924
infiniband_link_downed_events {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS} 0 1628602711924
infiniband_cbw {port_id="0002c90300f172a0_2", ADDITIONAL_LABELS}} 0 1628602711924
where ADDITIONAL_LABELS include:
hostname="agx-3"
node_desc="agx-3 mlx5_0"
device_name=""
device_type="host"
fabric="compute"
level="server"
peer_level="leaf"
To enable this functionality, the following additional keys need to be configured:
plugin_env_CLX_EXPORT_API_IBNETDISCOVER_RUN_ONCE 1 # Without this, the gen_metadata.py script cannot generate the human readable names, nor the level and peer_level.
plugin_env_CLX_METADATA_FILE /path/to/labels/file
plugin_env_CLX_METADATA_COMMAND "python3 /opt/mellanox/collectx/telem/bin/gen_metadata.py --fabric compute --file /var/log/ibdiagnet2.ibnetdiscover -o /path/to/labels/file"
To test, the curl command can be used as follows:
[root@jazz11 /]# curl --silent IP_ADDR_OF_HOST:9100/metrics |egrep "xmit|rcv" | tail
port_xmit_discard{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043
port_rcv_switch_relay_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043
port_rcv_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043
port_xmit_constraint_errors{device_name="",device_type="host",fabric="compute",hostname="jazz32",level="server",node_desc="jazz32 mlx5_2",peer_level="leaf",port_id="ec0d9a0300c04a54_1"} 0 1629194120043