NVIDIA UFM Enterprise User Manual v6.18.0
NVIDIA UFM Enterprise User Manual v6.18.0

Key Performance Indexes (KPI) Plugin

The KPI plugin periodically collects telemetry metrics and topology data from one or multiple UFM Telemetry and UFM clusters to calculate high-level Key Performance Indicators (KPIs). It can operate as a standalone Docker container or as a UFM plugin.

The calculated KPIs and collected telemetry metrics are stored in a Prometheus time-series database using the remote-write Prometheus protocol.

NICs Connectivity

  • Name: connected_endpoints

  • Description: This KPI shows the percentage of NICs connectivity, namely, from all NICs available, how many are connected. The desired value is 100%, meaning all NICs are connected.

    The default threshold for "bad" values is ≤ 95%.

    Note

    The complete list of NICs includes all those detected at least once since the plugin is started.

Topology Correctness

  • Name: topology_correctness

  • Description: This KPI reports the number of wrongly connected links. The ability to spread traffic over multiple minimal-distance paths is critical in utilizing the bandwidth provided by the network.

    Therefore, a pass/fail criteria for the topology not being broken by miss-connections is key for the network to provide reasonable bandwidth to running applications. The threshold for failure is > 0.

Link Stability

  • Name: stability

  • Description: For each UID (Unique Identifier), the KPI checks for changes in the link down counter. A change may indicate a link recovery, suggesting there was a period with no link. The statistics are tracked per UID, and the ideal value for link down errors is 0, indicating no problematic links.

Overheated Components

  • Name: operating_conditions

  • Description: This KPI displays the total number of elements (ports and devices) experiencing operating condition violations. Each hardware element has a predefined normal operating temperature range, with a common default threshold set at 70°C (158°F) or higher.

Bandwidth Loss due to Congestion

  • Name: bw_loss_by_congestion

  • Description: An Infiniband (IB) network is lossless and can therefore experience loss congestion spread. Features like congestion control aim to minimize this issue. This KPI reports the percentage of bandwidth loss due to congestion for each layer and direction, as measured by the port-xmit-wait counter.

    The reported percentage is calculated by averaging the xmit-wait equivalent time with the time between samples across all links in the specified layer-direction group.

Fat Tree (Single Root) to Tree Conversion

  • Name: tree_over_subscription

  • Description: Fat Trees were constructed by Leiserson in order to enable building a tree structure using fixed radix K and capacity switches T. Consider a regular K-1 tree (each switch has one parent and K-1 children), which have a single root switch service and after H-1 levels (K-1)^(H-1) leaf switches connecting (K-1)^H hosts each with bandwidth B. Consider the worst case traffic pattern where each leaf switch send traffic to other leaf switches that must cross the top. The connection from each leaf switch up should carry B(K-1) traffic and the capacity of that switch is bi-directional B(K-1) too. One level up the capacity and up link capacity required is B(K-1)^2, etc. But that requires exponentially growing capacity from links and switches.

For example, the below diagram is of a three level fat tree:

image-2024-8-6_15-46-5-1-version-1-modificationdate-1724059787610-api-v2.png

The Single Root Tree:

image-2024-8-6_15-46-16-1-version-1-modificationdate-1724059787323-api-v2.png

The solution to that problem was to split each single switch in the original Single Root Tree (SRT) into Multi Rooted Tree - which is a Fat Tree. In each level there are exactly same number of switches connected up and down except for the top level which carry half the number of switches - connecting only down.

The Fat Tree formulations define which switches connect to which other switches but it does not break the basic concept that a Fat Tree can be collapsed back into a simple tree by merging switches into much larger ones. A pair of leaf switches can still be classified by their distance - or the level of their common parents - just like in the original SRT.

When we want to evaluate the damage to the perfect tree structure due to link faults, it is very hard to get an exact maximum flow bandwidth without lengthy N^2 complex algorithm. However, if we examine the SRT obtained by collapsing the Fat Tree back into a tree, we can get a upper bound to the available bandwidth between different branches of the SRT.

This metric evaluates Fat Tree clusters topology and extracts their original Tree structure. Then it evaluates the over-subscription of each sub-tree. This way, the impact of the exact set of missing links can be evaluated in terms of the bandwidth taken out between sub-tress of the topology.

Top-of-rack (ToR) to ToR Max Flow (Bandwidth) Matrix

  • Name: tor_to_tor_bw

  • Description: This KPI provides a simple metric for the impact of link failures. One such network property is the available bandwidth for pairs of ToRs. This provides a meaningful yet traffic independent metric. So no prior knowledge of the exact set of instantaneous traffic patterns is required.

It is most important to look at the TOR up-ports since on Fat Trees the number of possible paths rows exponentially when going up the topology towards the roots. So the impact of link faults is decaying exponentially with their level towards the roots.

It would have been nice if we could just count the number of missing up links on each of the TORs in the pair and claim that TOR X lost x uplinks and TOR Y lost y uplinks, the lost bandwidth is linkBW*min(x,y). But in reality the worst case lost bandwidth is linkBW*(x+y). This is an artifact of the specific links lost and the network structure.

The code we provide performs a Fat Tree specific computation of the exact available bandwidth for each TOR's pair.

The algorithm provides unique identifiers for sub-trees and consider the subtree each link of the TORs connect to. Then it sums the min number of links connected from each of the TORs to each of the subtrees.

A more straight forward approach utilizing networkx or direct implementation of Dinic's max-flow algorithm yielded much higher runtime, yielding this metric impractical to use.

We use an example to demonstrate these concepts:

Consider the Fat Tree GFT(3;4,4,4;1,4,4) depicted in the diagram below:

image-2024-8-6_15-46-45-1-version-1-modificationdate-1724059787007-api-v2.png

We re-arrange the top row of switches (cores) such that the full-bipartites between switch levels 2 and 3 are close together:

image-2024-8-6_15-47-22-1-version-1-modificationdate-1724059786483-api-v2.png

The re-ordered picture highlights that if we imagine the red links as missing links, then:

  1. TORs 200 and 210 missing links are connected to the same bipartite and thus they only lose 1 out of the 4 paths between them.

  2. However, TORs 200 and 220, missing links connect to 2 different L2-L3 bipartites and thus 2 different paths out of the 4 possible are lost.

The resulting TOR to TOR max-flow matrix, providing the maximal bandwidth between each pair, in units of single link bandwidth is provided below:

image-2024-8-6_15-47-33-1-version-1-modificationdate-1724059786237-api-v2.png

The Algorithm

The following steps are conducted in order to provide that table:

  1. DFS from top to bottom collecting the list of parents for each switch. We require all TORs be accessed so we try all top level switches until we find one that is connected to all TORs. After we are done we can ask for every TORs pair what is the level of the lowest common parent. This is required since routing will only use shortest paths, and thus go up to that level only.

  2. Establish for each TOR and up the unique bottom-up subtree it is part of. The same sub-tree number should be used for all parents of each switch that are on the same level. Since the topology may be missing some links, it is not simple to check if there was previous allocation of such number, so first all parents are scanned, and then a consolidation step decides which levels are missing a number. If any needs to be set then another step applies the consolidated value.

  3. For TOR switch, we sum the total number of links that connect to switches with same bottom-up tree (recognized by their assigned number). We do that for every parent level.

  4. Now that we have that data for every TOR, we compute the TOR to TOR max-flow by first looking up the level of their common parent and then computing the the minimal number of flows they connect to each bottom-up tree they connect to.

The plugin could be deployed as a standalone application or as a UFM plugin.

Deploy the KPI as a UFM plugin

  1. Pull/load the latest image of the plugin:

    Copy
    Copied!
                

    docker pull mellanox/ufm-plugin-kpi

  2. Pull/load the latest image of the plugin:

    Copy
    Copied!
                

    /opt/ufm/scripts/manage_ufm_plugins.sh add -p kpi -t <TAG>

Deploy the KPI as a Standalone application

  1. Pull/load the latest image of the plugin:

    Copy
    Copied!
                

    docker pull mellanox/ufm-plugin-kpi

  2. Pull/load the latest image of the plugin:

    Copy
    Copied!
                

    docker run --network host -v /opt/ufm/files/conf/plugins/kpi:/config -v /opt/ufm/files/logs/plugins/kpi:/logs --rm -dit $IMAGE

The configurations can be managed through a configuration file.

  • Within the container, the configurations file can be found under /config/kpi_plugin.conf

  • On the host, the shared volume location of the /config is /opt/ufm/files/conf/plugins/kpi/kpi_plugin.conf

Clusters Configurations

The default KPI configurations KPI include one default cluster called `unknown`, once the plugin starts, the default cluster is configured to collect the secondary telemetry endpoint http://localhost:9002/csv/metrics and the local UFM topology data.

To change the parameters of the default cluster, or to add additional clusters, please refer to the following configuration under the kpi_plugin.conf:

Copy
Copied!
            

### Set name to "cluster-config-$cluster_name". Add section per each cluster [cluster-config-unknown]   ### uncomment and set 2 following options: #host_list = host_name[1-1024],another_host_name[1-100] ### threshold to distinguish between hosts and switches (inclusive) #host_max_ports = 4   ### OR   ### If hostlist format is not possible, another option is per-level regular expressions ### That is a list of regular expressions to detect nodes level in topology (0 is lowest - hosts) ### Key structure is 'level.<level number>.<per-level running index>' #level.0.0=some-host-pattern-\d+ #level.0.1=another-host-pattern-\d+ #level.1.0=leaf-pattern-l\d+ #level.2.0=spine-pattern-s\d+ #level.3.0=core-pattern-c\d+     ### If running as standalone app please uncomment and set next 3 lines #ufm_ip=0.0.0.0 #ufm_access_token=1234567890abcdefghijklmnopqrst #telemetry_url=http://0.0.0.0:9100     ### If running as UFM plugin and UFM port has changed #ufm_port=1234

Property

Description

Default Value

cluster.host_list

Set of hosts in hostlist format. Used to detect the topology leafs.

None

cluster.host_max_ports

The maximal number of ports in a server. Used as threshold to classify a node as server / switch.

4

cluster.level.X.Y

Regular Expression to capture topology levels. The first index (X) is the level and the following index (Y) is a running index for level X

None

cluster.ufm_ip

UFM IP that used to collect the topology data

'127.0.0.1'

cluster.ufm_port

UFM port that used to collect the topology data in case the cluster is local

'8000'

cluster.ufm_access_token

The UFM access token that should be provided in case the cluster is collecting data from a remote UFM

None

cluster.telemetry_url

UFM telemetry endpoint URL that used to collect the telemetry metrics.

http://127.0.0.1:9002

cluster. telemetry_metrics_push_delta_only

If True, only changed telemetry metrics are pushed to Prometheus after each pulling interval, with a fallback to push unchanged metrics if they remain static for over an hour. Otherwise, all fetched metrics will be pushed to the Prometheus after each pulling interval

True


KPI Configurations

The configurations are related to the KPI plugin itself. please refer to the below configuration section in order to manage the KPI generic configurations:

Copy
Copied!
            

[kpi-config] ### Optional Comma separated list that contains list of the disabled KPIs that we don't want to store them in Prometheus DB ### Available KPIs: ### connected_endpoints,topology_correctness,stability,operating_conditions,bw_loss_by_congestion,tree_over_subscription,tor_to_tor_bw,general,telemetry_metrics disabled_kpis=telemetry_metrics

Property

Description

Default Value

disabled_kpis

Optional Comma-separated list that contains the list of the disabled KPIs that we don't want to store in Prometheus DB.

The available KPIs are: connected_endpoints,topology_correctness,stability,operating_conditions,bw_loss_by_congestion,tree_over_subscription,tor_to_tor_bw,general,telemetry_metrics

telemetry_metrics


Prometheus Configurations

The plugin includes a local Prometheus server instance that is used by default, the following parameters are used to manage the Prometheus configurations:

Copy
Copied!
            

[prometheus-config] prometheus_ip=0.0.0.0 prometheus_port=9090 prometheus_db_data_retention_size=120GB prometheus_db_data_retention_time=15d

Property

Description

Default Value

prometheus_ip

IP of Prometheus server

0.0.0.0

prometheus_port

Port of Prometheus server

9090

prometheus_db_data_retention_size

Data retention policy by size (used only for the local Prometheus server)

120GB

prometheus_db_data_retention_time

Data retention policy by time (used only for the local Prometheus server)

15d

The data storage path of the local Prometheus DB is under /opt/ufm/files/conf/plugins/kpi/prometheus_data.

Global Time Interval Configurations

The following is used by all clusters globally, each property could be overridden by adding it under the cluster’s section.

Copy
Copied!
            

[time-interval-config] telemetry_interval=300 ufm_interval=60

Property

Description

Default Value

telemetry_interval

Polling interval for the telemetry metrics data in seconds

300

ufm_interval

Polling interval for the UFM APIs in seconds

60

disabled_kpis

Polling interval for the connected_endpoints KPI specifically

60


Logs Configurations

The below configurations are to manage the kpi_plugin.log file

Copy
Copied!
            

[logs-config] logs_level = INFO logs_file_name = /log/kpi_plugin.log log_file_max_size = 10485760 log_file_backup_count = 5


Grafana Dashboard Templates

The KPI plugin provides several Grafana dashboard templates that the Grafana users can import, these dashboards display the KPIs panels and graphs. The main dashboard is a cluster view that shows the cluster’s KPIs. The dashboard should be connected to the KPI Prometheus server.

The dashboard JSON template can be found under /opt/ufm/files/conf/plugins/kpi/grafana/:

image-2024-8-6_16-2-34-1-version-1-modificationdate-1724059785983-api-v2.png


The following logs are exposed under /opt/ufm/files/logs/plugins/kpi/ in case of UFM plugin mode, and under /log inside the container in case of Standalone mode.

  • kpi_plugin.log - The application logs.

  • kpi_plugin_stderr/stdout - The application service logs

  • prometheus.log - The local Prometheus logs.

Get List of Configured Clusters

  • URL:

    • UFM Plugin: https://<IP>/ufmRest/plugin/kpi/api/cluster/__names__

    • Standalone: http://<IP>:8686/api/cluster/__names__

  • Method: GET

  • Response: List of cluster names strings

Get KPIs Information

  • URL:

    • UFM Plugin: https://<IP>/ufmRest/plugin/kpi/files/kpi_info

    • Standalone: http://<IP>:8686/files/kpi_info

  • Method: GET

  • Response: List of KPI names and HTML descriptions.

Get KPIs Values

  • URL:

    • UFM Plugin: https://<IP>/ufmRest/plugin/kpi/api/cluster/<cluster_name>

    • Standalone: http://<IP>:8686/api/cluster/<cluster_name>

  • Method: GET

Response: A list of KPIs with the relative calculates graph values to be used in the UI. Generally, 2 main components are expected: the table data (key “2d_matrix_data”) and the graph data (key “graph_data”). Each value corresponds to rules expected by the UI:

  • display name – KPI display name

  • 2d_matrix_data – Latest data in a table-like format. Every list item is a dict that makes a table row - corresponding to the columns of that row.

    • data – List of dict, each dict is a column in the row. May contain:

      • percentage – Value in percentage.

      • value – Raw value.

      • name – Column header

    • direction – For arrows icon (up / down)

    • name – Row header

  • graph_data – dictionary with the following items:

    • multi_line_graph – time series graph for one or more series.

      • title – Graph title.

      • x_label – X axis label.

      • x_values – X axis values.

      • y_labels – Y axis labels.

      • y_values – Y axis values.

    • info – string with information of the graph data.

Get KPI Plugin Overview

  • URL:

    • UFM Plugin: https://<IP>/ufmRest/plugin/kpi/overview?time_window=<value> (in hours)

    • Standalone: http://<IP>:8686/overview?time_window=<value> (in hours)

  • Method: GET

Response:

Copy
Copied!
            

{     "clusters_telemetry_info": [         {             "cluster_name": "unknown",             "interval": 60,             "url": "http://pjazz:9002/csv/metrics"         }     ],     "prometheus_info": {         "db_statistics": {             "appended_samples_rate_per_sec": 11.46,             "bytes_rate_per_sample": 9.227,             "total_compressed_blocks_size_bytes": 272329313.0,             "total_head_chunks_size_bytes": 1937238.0,             "total_wal_size_bytes": 2989807.0         },         "url": "http://pjazz:9191"     } }

Where:

clusters_telemetry_info: contains a list of the configured clusters' information, each cluster has the following properties:

Property

Description

Default Value

cluster_name

Cluster’s name

One cluster with name ‘unknown’

interval

Cluster’s telemetry pulling time interval

60

url

The cluster’s URL telemetry

http://0.0.0.0:9002/csv/metrics

prometheus_info: Contains the general Prometheus configurations (e.g. the URL) and statistics about the collected samples (for the local Prometheus mode)

Property

Description

db_statistics. appended_samples_rate_per_sec

Prometheus rate of the total appended samples per second, calculated by the following Prometheus expression: rate(prometheus_tsdb_head_samples_appended_total[{rate_window_time}h])

db_statistics. bytes_rate_per_sample

Prometheus rate of ingested bytes per sample, calculated by the following Prometheus expression:

rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[{rate_window_time}h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[{rate_window_time}h])

db_statistics. total_compressed_blocks_size_bytes

Total compressed blocks size in bytes that the Prometheus was stored, calculating by the following Prometheus expression:

prometheus_tsdb_storage_blocks_bytes

db_statistics. total_head_chunks_size_bytes

Total HEAD chunks blocks size in bytes that the Prometheus was stored, calculating by the following Prometheus expression:

prometheus_tsdb_head_chunks_storage_size_bytes

db_statistics.total_wal_size_bytes

Total WAL size bytes, calculating by the following Prometheus expression:

prometheus_tsdb_wal_storage_size_bytes

prometheus_info.url

The promethues’s URL telemetry


© Copyright 2024, NVIDIA. Last updated on Aug 27, 2024.