Key Performance Indexes (KPI) Plugin

The KPI plugin periodically collects telemetry metrics and topology data from one or multiple UFM Telemetry and UFM clusters to calculate high-level Key Performance Indicators (KPIs). It can operate as a standalone Docker container or as a UFM plugin.

The calculated KPIs and collected telemetry metrics are stored in a Prometheus time-series database using the remote-write Prometheus protocol.

KPIs

NICs Connectivity

Name: connected_endpoints
Description: This KPI shows the percentage of NICs connectivity, namely, from all NICs available, how many are connected. The desired value is 100%, meaning all NICs are connected.
The default threshold for "bad" values is ≤ 95%.

Note

The complete list of NICs includes all those detected at least once since the plugin is started.

Topology Correctness

Name: topology_correctness
Description: This KPI reports the number of wrongly connected links. The ability to spread traffic over multiple minimal-distance paths is critical in utilizing the bandwidth provided by the network.
Therefore, a pass/fail criteria for the topology not being broken by miss-connections is key for the network to provide reasonable bandwidth to running applications. The threshold for failure is > 0.

Link Stability

Name: stability
Description: For each UID (Unique Identifier), the KPI checks for changes in the link down counter. A change may indicate a link recovery, suggesting there was a period with no link. The statistics are tracked per UID, and the ideal value for link down errors is 0, indicating no problematic links.

Overheated Components

Name: operating_conditions
Description: This KPI displays the total number of elements (ports and devices) experiencing operating condition violations. Each hardware element has a predefined normal operating temperature range, with a common default threshold set at 70°C (158°F) or higher.

Bandwidth Loss due to Congestion

Name: bw_loss_by_congestion
Description: An Infiniband (IB) network is lossless and can therefore experience loss congestion spread. Features like congestion control aim to minimize this issue. This KPI reports the percentage of bandwidth loss due to congestion for each layer and direction, as measured by the port-xmit-wait counter.
The reported percentage is calculated by averaging the xmit-wait equivalent time with the time between samples across all links in the specified layer-direction group.

Fat Tree (Single Root) to Tree Conversion

Name: tree_over_subscription
Description: Fat Trees were constructed by Leiserson in order to enable building a tree structure using fixed radix K and capacity switches T. Consider a regular K-1 tree (each switch has one parent and K-1 children), which have a single root switch service and after H-1 levels (K-1)^(H-1) leaf switches connecting (K-1)^H hosts each with bandwidth B. Consider the worst case traffic pattern where each leaf switch send traffic to other leaf switches that must cross the top. The connection from each leaf switch up should carry B(K-1) traffic and the capacity of that switch is bi-directional B(K-1) too. One level up the capacity and up link capacity required is B(K-1)^2, etc. But that requires exponentially growing capacity from links and switches.

For example, the below diagram is of a three level fat tree:

image-2024-8-6_15-46-5-1-version-1-modificationdate-1723658038377-api-v2.png

The Single Root Tree:

image-2024-8-6_15-46-16-1-version-1-modificationdate-1723658037657-api-v2.png

The solution to that problem was to split each single switch in the original Single Root Tree (SRT) into Multi Rooted Tree - which is a Fat Tree. In each level there are exactly same number of switches connected up and down except for the top level which carry half the number of switches - connecting only down.

The Fat Tree formulations define which switches connect to which other switches but it does not break the basic concept that a Fat Tree can be collapsed back into a simple tree by merging switches into much larger ones. A pair of leaf switches can still be classified by their distance - or the level of their common parents - just like in the original SRT.

When we want to evaluate the damage to the perfect tree structure due to link faults, it is very hard to get an exact maximum flow bandwidth without lengthy N^2 complex algorithm. However, if we examine the SRT obtained by collapsing the Fat Tree back into a tree, we can get a upper bound to the available bandwidth between different branches of the SRT.

This metric evaluates Fat Tree clusters topology and extracts their original Tree structure. Then it evaluates the over-subscription of each sub-tree. This way, the impact of the exact set of missing links can be evaluated in terms of the bandwidth taken out between sub-tress of the topology.

Top-of-rack (ToR) to ToR Max Flow (Bandwidth) Matrix

Name: tor_to_tor_bw
Description: This KPI provides a simple metric for the impact of link failures. One such network property is the available bandwidth for pairs of ToRs. This provides a meaningful yet traffic independent metric. So no prior knowledge of the exact set of instantaneous traffic patterns is required.

It is most important to look at the TOR up-ports since on Fat Trees the number of possible paths rows exponentially when going up the topology towards the roots. So the impact of link faults is decaying exponentially with their level towards the roots.

It would have been nice if we could just count the number of missing up links on each of the TORs in the pair and claim that TOR X lost x uplinks and TOR Y lost y uplinks, the lost bandwidth is linkBW*min(x,y). But in reality the worst case lost bandwidth is linkBW*(x+y). This is an artifact of the specific links lost and the network structure.

The code we provide performs a Fat Tree specific computation of the exact available bandwidth for each TOR's pair.

The algorithm provides unique identifiers for sub-trees and consider the subtree each link of the TORs connect to. Then it sums the min number of links connected from each of the TORs to each of the subtrees.

A more straight forward approach utilizing networkx or direct implementation of Dinic's max-flow algorithm yielded much higher runtime, yielding this metric impractical to use.

We use an example to demonstrate these concepts:

Consider the Fat Tree GFT(3;4,4,4;1,4,4) depicted in the diagram below:

image-2024-8-6_15-46-45-1-version-1-modificationdate-1723658036793-api-v2.png

We re-arrange the top row of switches (cores) such that the full-bipartites between switch levels 2 and 3 are close together:

image-2024-8-6_15-47-22-1-version-1-modificationdate-1723658036400-api-v2.png

The re-ordered picture highlights that if we imagine the red links as missing links, then:

TORs 200 and 210 missing links are connected to the same bipartite and thus they only lose 1 out of the 4 paths between them.
However, TORs 200 and 220, missing links connect to 2 different L2-L3 bipartites and thus 2 different paths out of the 4 possible are lost.

The resulting TOR to TOR max-flow matrix, providing the maximal bandwidth between each pair, in units of single link bandwidth is provided below:

image-2024-8-6_15-47-33-1-version-1-modificationdate-1723658034747-api-v2.png

The Algorithm

The following steps are conducted in order to provide that table:

DFS from top to bottom collecting the list of parents for each switch. We require all TORs be accessed so we try all top level switches until we find one that is connected to all TORs. After we are done we can ask for every TORs pair what is the level of the lowest common parent. This is required since routing will only use shortest paths, and thus go up to that level only.
Establish for each TOR and up the unique bottom-up subtree it is part of. The same sub-tree number should be used for all parents of each switch that are on the same level. Since the topology may be missing some links, it is not simple to check if there was previous allocation of such number, so first all parents are scanned, and then a consolidation step decides which levels are missing a number. If any needs to be set then another step applies the consolidated value.
For TOR switch, we sum the total number of links that connect to switches with same bottom-up tree (recognized by their assigned number). We do that for every parent level.
Now that we have that data for every TOR, we compute the TOR to TOR max-flow by first looking up the level of their common parent and then computing the the minimal number of flows they connect to each bottom-up tree they connect to.

Deployment

The plugin could be deployed as a standalone application or as a UFM plugin.

Deploy the KPI as a UFM plugin

Pull/load the latest image of the plugin:

Copy
Copied!

            
            docker pull mellanox/ufm-plugin-kpi

Pull/load the latest image of the plugin:

Copy
Copied!

            
            /opt/ufm/scripts/manage_ufm_plugins.sh add -p kpi -t <TAG>

Deploy the KPI as a Standalone application

Pull/load the latest image of the plugin:

Copy
Copied!

            
            docker pull mellanox/ufm-plugin-kpi

Pull/load the latest image of the plugin:

Copy
Copied!

            
            docker run --network host -v /opt/ufm/files/conf/plugins/kpi:/config -v /opt/ufm/files/logs/plugins/kpi:/logs --rm -dit $IMAGE

Configurations

The configurations can be managed through a configuration file.

Within the container, the configurations file can be found under /config/kpi_plugin.conf
On the host, the shared volume location of the /config is /opt/ufm/files/conf/plugins/kpi/kpi_plugin.conf

Clusters Configurations

The default KPI configurations KPI include one default cluster called `unknown`, once the plugin starts, the default cluster is configured to collect the secondary telemetry endpoint http://localhost:9002/csv/metrics and the local UFM topology data.

To change the parameters of the default cluster, or to add additional clusters, please refer to the following configuration under the kpi_plugin.conf:

Copy
Copied!

            
            ### Set name to "cluster-config-$cluster_name". Add section per each cluster
[cluster-config-unknown]
 
### uncomment and set 2 following options:
#host_list = host_name[1-1024],another_host_name[1-100]
### threshold to distinguish between hosts and switches (inclusive)
#host_max_ports = 4
 
### OR
 
### If hostlist format is not possible, another option is per-level regular expressions
### That is a list of regular expressions to detect nodes level in topology (0 is lowest - hosts)
### Key structure is 'level.<level number>.<per-level running index>'
#level.0.0=some-host-pattern-\d+
#level.0.1=another-host-pattern-\d+
#level.1.0=leaf-pattern-l\d+
#level.2.0=spine-pattern-s\d+
#level.3.0=core-pattern-c\d+
 
 
### If running as standalone app please uncomment and set next 3 lines
#ufm_ip=0.0.0.0
#ufm_access_token=1234567890abcdefghijklmnopqrst
#telemetry_url=http://0.0.0.0:9100
 
 
### If running as UFM plugin and UFM port has changed
#ufm_port=1234

Property	Description	Default Value
`cluster.host_list`	Set of hosts in hostlist format. Used to detect the topology leafs.	None
`cluster.host_max_ports`	The maximal number of ports in a server. Used as threshold to classify a node as server / switch.	4
`cluster.level.X.Y`	Regular Expression to capture topology levels. The first index (X) is the level and the following index (Y) is a running index for level X	None
`cluster.ufm_ip`	UFM IP that used to collect the topology data	'127.0.0.1'
`cluster.ufm_port`	UFM port that used to collect the topology data in case the cluster is local	'8000'
`cluster.ufm_access_token`	The UFM access token that should be provided in case the cluster is collecting data from a remote UFM	None
`cluster.telemetry_url`	UFM telemetry endpoint URL that used to collect the telemetry metrics.	‘http://127.0.0.1:9002’
`cluster. telemetry_metrics_push_delta_only`	If True, only changed telemetry metrics are pushed to Prometheus after each pulling interval, with a fallback to push unchanged metrics if they remain static for over an hour. Otherwise, all fetched metrics will be pushed to the Prometheus after each pulling interval	True

KPI Configurations

The configurations are related to the KPI plugin itself. please refer to the below configuration section in order to manage the KPI generic configurations:

Copy
Copied!

            
            [kpi-config]
### Optional Comma separated list that contains list of the disabled KPIs that we don't want to store them in Prometheus DB
### Available KPIs:
### connected_endpoints,topology_correctness,stability,operating_conditions,bw_loss_by_congestion,tree_over_subscription,tor_to_tor_bw,general,telemetry_metrics
disabled_kpis=telemetry_metrics

Property	Description	Default Value
`disabled_kpis`	Optional Comma-separated list that contains the list of the disabled KPIs that we don't want to store in Prometheus DB. The available KPIs are: connected_endpoints,topology_correctness,stability,operating_conditions,bw_loss_by_congestion,tree_over_subscription,tor_to_tor_bw,general,telemetry_metrics	telemetry_metrics

Property

Description

Default Value

disabled_kpis

Optional Comma-separated list that contains the list of the disabled KPIs that we don't want to store in Prometheus DB.

The available KPIs are: connected_endpoints,topology_correctness,stability,operating_conditions,bw_loss_by_congestion,tree_over_subscription,tor_to_tor_bw,general,telemetry_metrics

telemetry_metrics

Prometheus Configurations

The plugin includes a local Prometheus server instance that is used by default, the following parameters are used to manage the Prometheus configurations:

Copy
Copied!

            
            [prometheus-config]
prometheus_ip=0.0.0.0
prometheus_port=9090
prometheus_db_data_retention_size=120GB
prometheus_db_data_retention_time=15d

Property	Description	Default Value
`prometheus_ip`	IP of Prometheus server	0.0.0.0
`prometheus_port`	Port of Prometheus server	9090
`prometheus_db_data_retention_size`	Data retention policy by size (used only for the local Prometheus server)	120GB
`prometheus_db_data_retention_time`	Data retention policy by time (used only for the local Prometheus server)	15d

The data storage path of the local Prometheus DB is under /opt/ufm/files/conf/plugins/kpi/prometheus_data.

Global Time Interval Configurations

The following is used by all clusters globally, each property could be overridden by adding it under the cluster’s section.

Copy
Copied!

            
            [time-interval-config]
telemetry_interval=300
ufm_interval=60

Property	Description	Default Value
`telemetry_interval`	Polling interval for the telemetry metrics data in seconds	300
`ufm_interval`	Polling interval for the UFM APIs in seconds	60
`disabled_kpis`	Polling interval for the connected_endpoints KPI specifically	60

Logs Configurations

The below configurations are to manage the kpi_plugin.log file

Copy
Copied!

            
            [logs-config]
logs_level = INFO
logs_file_name = /log/kpi_plugin.log
log_file_max_size = 10485760
log_file_backup_count = 5

Grafana Dashboard Templates

The KPI plugin provides several Grafana dashboard templates that the Grafana users can import, these dashboards display the KPIs panels and graphs. The main dashboard is a cluster view that shows the cluster’s KPIs. The dashboard should be connected to the KPI Prometheus server.

The dashboard JSON template can be found under /opt/ufm/files/conf/plugins/kpi/grafana/:

image-2024-8-6_16-2-34-1-version-1-modificationdate-1723658034097-api-v2.png

Logs

The following logs are exposed under /opt/ufm/files/logs/plugins/kpi/ in case of UFM plugin mode, and under /log inside the container in case of Standalone mode.

kpi_plugin.log - The application logs.
kpi_plugin_stderr/stdout - The application service logs
prometheus.log - The local Prometheus logs.

REST APIs

Get List of Configured Clusters

URL:
- UFM Plugin: https://<IP>/ufmRest/plugin/kpi/api/cluster/__names__
- Standalone: http://<IP>:8686/api/cluster/__names__
Method: GET
Response: List of cluster names strings

Get KPIs Information

URL:
- UFM Plugin: https://<IP>/ufmRest/plugin/kpi/files/kpi_info
- Standalone: http://<IP>:8686/files/kpi_info
Method: GET
Response: List of KPI names and HTML descriptions.

Get KPIs Values

URL:
- UFM Plugin: https://<IP>/ufmRest/plugin/kpi/api/cluster/<cluster_name>
- Standalone: http://<IP>:8686/api/cluster/<cluster_name>
Method: GET

Response: A list of KPIs with the relative calculates graph values to be used in the UI. Generally, 2 main components are expected: the table data (key “2d_matrix_data”) and the graph data (key “graph_data”). Each value corresponds to rules expected by the UI:

display name – KPI display name
2d_matrix_data – Latest data in a table-like format. Every list item is a dict that makes a table row - corresponding to the columns of that row.
- data – List of dict, each dict is a column in the row. May contain:
  - percentage – Value in percentage.
  - value – Raw value.
  - name – Column header
- direction – For arrows icon (up / down)
- name – Row header
graph_data – dictionary with the following items:
- multi_line_graph – time series graph for one or more series.
  - title – Graph title.
  - x_label – X axis label.
  - x_values – X axis values.
  - y_labels – Y axis labels.
  - y_values – Y axis values.
- info – string with information of the graph data.

Get KPI Plugin Overview

URL:
- UFM Plugin: https://<IP>/ufmRest/plugin/kpi/overview?time_window=<value> (in hours)
- Standalone: http://<IP>:8686/overview?time_window=<value> (in hours)
Method: GET

Response:

Copy
Copied!

            
            {
    "clusters_telemetry_info": [
        {
            "cluster_name": "unknown",
            "interval": 60,
            "url": "http://pjazz:9002/csv/metrics"
        }
    ],
    "prometheus_info": {
        "db_statistics": {
            "appended_samples_rate_per_sec": 11.46,
            "bytes_rate_per_sample": 9.227,
            "total_compressed_blocks_size_bytes": 272329313.0,
            "total_head_chunks_size_bytes": 1937238.0,
            "total_wal_size_bytes": 2989807.0
        },
        "url": "http://pjazz:9191"
    }
}

Where:

clusters_telemetry_info: contains a list of the configured clusters' information, each cluster has the following properties:

Property	Description	Default Value
`cluster_name`	Cluster’s name	One cluster with name ‘unknown’
`interval`	Cluster’s telemetry pulling time interval	60
`url`	The cluster’s URL telemetry	http://0.0.0.0:9002/csv/metrics

prometheus_info: Contains the general Prometheus configurations (e.g. the URL) and statistics about the collected samples (for the local Prometheus mode)

Property	Description
`db_statistics. appended_samples_rate_per_sec`	Prometheus rate of the total appended samples per second, calculated by the following Prometheus expression: *rate(prometheus_tsdb_head_samples_appended_total[{rate_window_time}h])*
`db_statistics. bytes_rate_per_sample`	Prometheus rate of ingested bytes per sample, calculated by the following Prometheus expression: *rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[{rate_window_time}h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[{rate_window_time}h])*
`db_statistics. total_compressed_blocks_size_bytes`	Total compressed blocks size in bytes that the Prometheus was stored, calculating by the following Prometheus expression: *prometheus_tsdb_storage_blocks_bytes*
`db_statistics. total_head_chunks_size_bytes`	Total HEAD chunks blocks size in bytes that the Prometheus was stored, calculating by the following Prometheus expression: *prometheus_tsdb_head_chunks_storage_size_bytes*
`db_statistics.total_wal_size_bytes`	Total WAL size bytes, calculating by the following Prometheus expression: *prometheus_tsdb_wal_storage_size_bytes*
`prometheus_info.url`	The promethues’s URL telemetry

On This Page

Key Performance Indexes (KPI) Plugin

KPIs

NICs Connectivity

Link Stability

Overheated Components

Bandwidth Loss due to Congestion

Fat Tree (Single Root) to Tree Conversion

Top-of-rack (ToR) to ToR Max Flow (Bandwidth) Matrix

Deployment

Deploy the KPI as a UFM plugin

Deploy the KPI as a Standalone application

Configurations

Clusters Configurations

KPI Configurations

Prometheus Configurations

Global Time Interval Configurations

Logs Configurations

Grafana Dashboard Templates

Logs

REST APIs

Get List of Configured Clusters

Get KPIs Information

Get KPIs Values

Get KPI Plugin Overview

UI Views

KPI Overview

Cluster KPIs Data