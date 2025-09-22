Name : connected_endpoints

Description: This KPI shows the percentage of NICs connectivity, namely, from all NICs available, how many are connected. The desired value is 100%, meaning all NICs are connected. The default threshold for "bad" values is ≤ 95%. Note The complete list of NICs includes all those detected at least once since the plugin is started.

Topology Correctness

Name : topology_correctness

Description: This KPI reports the number of wrongly connected links. The ability to spread traffic over multiple minimal-distance paths is critical in utilizing the bandwidth provided by the network. Therefore, a pass/fail criteria for the topology not being broken by miss-connections is key for the network to provide reasonable bandwidth to running applications. The threshold for failure is > 0.

Name : stability

Description: For each UID (Unique Identifier), the KPI checks for changes in the link down counter. A change may indicate a link recovery, suggesting there was a period with no link. The statistics are tracked per UID, and the ideal value for link down errors is 0, indicating no problematic links.

Name : operating_conditions

Description: This KPI displays the total number of elements (ports and devices) experiencing operating condition violations. Each hardware element has a predefined normal operating temperature range, with a common default threshold set at 70°C (158°F) or higher.

Name : bw_loss_by_congestion

Description: An Infiniband (IB) network is lossless and can therefore experience loss congestion spread. Features like congestion control aim to minimize this issue. This KPI reports the percentage of bandwidth loss due to congestion for each layer and direction, as measured by the port-xmit-wait counter. The reported percentage is calculated by averaging the xmit-wait equivalent time with the time between samples across all links in the specified layer-direction group.

Name : tree_over_subscription

Description: Fat Trees were constructed by Leiserson in order to enable building a tree structure using fixed radix K and capacity switches T. Consider a regular K-1 tree (each switch has one parent and K-1 children), which have a single root switch service and after H-1 levels (K-1)^(H-1) leaf switches connecting (K-1)^H hosts each with bandwidth B. Consider the worst case traffic pattern where each leaf switch send traffic to other leaf switches that must cross the top. The connection from each leaf switch up should carry B(K-1) traffic and the capacity of that switch is bi-directional B(K-1) too. One level up the capacity and up link capacity required is B(K-1)^2, etc. But that requires exponentially growing capacity from links and switches.

For example, the below diagram is of a three level fat tree:

The Single Root Tree:

The solution to that problem was to split each single switch in the original Single Root Tree (SRT) into Multi Rooted Tree - which is a Fat Tree. In each level there are exactly same number of switches connected up and down except for the top level which carry half the number of switches - connecting only down.

The Fat Tree formulations define which switches connect to which other switches but it does not break the basic concept that a Fat Tree can be collapsed back into a simple tree by merging switches into much larger ones. A pair of leaf switches can still be classified by their distance - or the level of their common parents - just like in the original SRT.

When we want to evaluate the damage to the perfect tree structure due to link faults, it is very hard to get an exact maximum flow bandwidth without lengthy N^2 complex algorithm. However, if we examine the SRT obtained by collapsing the Fat Tree back into a tree, we can get a upper bound to the available bandwidth between different branches of the SRT.

This metric evaluates Fat Tree clusters topology and extracts their original Tree structure. Then it evaluates the over-subscription of each sub-tree. This way, the impact of the exact set of missing links can be evaluated in terms of the bandwidth taken out between sub-tress of the topology.

Name : tor_to_tor_bw

Description: This KPI provides a simple metric for the impact of link failures. One such network property is the available bandwidth for pairs of ToRs. This provides a meaningful yet traffic independent metric. So no prior knowledge of the exact set of instantaneous traffic patterns is required. It is most important to look at the TOR up-ports since on Fat Trees the number of possible paths rows exponentially when going up the topology towards the roots. So the impact of link faults is decaying exponentially with their level towards the roots. It would have been nice if we could just count the number of missing up links on each of the TORs in the pair and claim that TOR X lost x uplinks and TOR Y lost y uplinks, the lost bandwidth is linkBW*min(x,y). But in reality the worst case lost bandwidth is linkBW*(x+y). This is an artifact of the specific links lost and the network structure. The code we provide performs a Fat Tree specific computation of the exact available bandwidth for each TOR's pair. The algorithm provides unique identifiers for sub-trees and consider the subtree each link of the TORs connect to. Then it sums the min number of links connected from each of the TORs to each of the subtrees. A more straight forward approach utilizing networkx or direct implementation of Dinic's max-flow algorithm yielded much higher runtime, yielding this metric impractical to use.

We use an example to demonstrate these concepts:

Consider the Fat Tree GFT(3;4,4,4;1,4,4) depicted in the diagram below:

We re-arrange the top row of switches (cores) such that the full-bipartites between switch levels 2 and 3 are close together:

The re-ordered picture highlights that if we imagine the red links as missing links, then:

TORs 200 and 210 missing links are connected to the same bipartite and thus they only lose 1 out of the 4 paths between them. However, TORs 200 and 220, missing links connect to 2 different L2-L3 bipartites and thus 2 different paths out of the 4 possible are lost.

The resulting TOR to TOR max-flow matrix, providing the maximal bandwidth between each pair, in units of single link bandwidth is provided below:

The Algorithm

The following steps are conducted in order to provide that table:

DFS from top to bottom collecting the list of parents for each switch. We require all TORs be accessed so we try all top level switches until we find one that is connected to all TORs. After we are done we can ask for every TORs pair what is the level of the lowest common parent. This is required since routing will only use shortest paths, and thus go up to that level only. Establish for each TOR and up the unique bottom-up subtree it is part of. The same sub-tree number should be used for all parents of each switch that are on the same level. Since the topology may be missing some links, it is not simple to check if there was previous allocation of such number, so first all parents are scanned, and then a consolidation step decides which levels are missing a number. If any needs to be set then another step applies the consolidated value. For TOR switch, we sum the total number of links that connect to switches with same bottom-up tree (recognized by their assigned number). We do that for every parent level. Now that we have that data for every TOR, we compute the TOR to TOR max-flow by first looking up the level of their common parent and then computing the the minimal number of flows they connect to each bottom-up tree they connect to.

Name : queue_depth_average, queue_depth_variance

Description : This is a post-processing of the Egress Queue Depth Histogram feature. The egress queue depth is the length of the queue the packet pointers must wait in for them to be transmitted via the port. A “healthy” behavior of the queue depth is that it is somewhat constant and short – longer queues imply higher xmit-wait and latency, and jittering queue depth means the congestion on the port changes rapidly and unpredictably. The egress queue depth is the length of the queue the packet pointers must wait in for them to be transmitted via the port. Following that definition of a healthy queue depth, we would like to know the average of the queue depth, and variance of the queue depth. The reported values are calculated per port.

Required telemetry configuration: plugin_env_CLX_EXPORT_API_SKIP_PERFORMANCE_HISTOGRAM_BUFFER_DATA=0 in /opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini

Name : lost_bandwidth_percentage

Description : This is a post-processing of the PortRcvData Histogram feature. PortRcvData Histogram samples the amount of data at the ingress at short timescales, where the behavior should be 100% BW usage or 0% BW usage. Anything in between might give a sign for “unhealthy” port, as described in the NCCL pickle research. The reported values are calculated per port.

Required telemetry configuration: plugin_env_CLX_EXPORT_API_SKIP_PERFORMANCE_HISTOGRAM_PORTS_DATA=0 in /opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini