IB Link Resiliency Plugin

Overview

The primary objective of the IB Link Resiliency (IBLR) plugin is to enhance cluster availability and improve the rate of job completion.

This objective is accomplished by combining different mechanisms, both ML-based and rule-based, for identifying problematic links. Then, the plugin autonomously applies corrective measures to these links with the aim of restoring their normal function.

For cluster topologies where no redundancy exists at the level of access links, the plugin will only execute a mitigation procedure for trunk links.

The IBLR plugin execution cycle is composed of the following tasks:

Collects telemetry data from the UFM secondary telemetry service, which by default collects telemetry data every 5 minutes.
Employs ML-based prediction models and rule-based detection logic to alert on problematic ports.
Based on the alerts issued by the different alert engines, determines which measures should be taken:
1. Which links are underperforming and should be isolated from the fabric.
2. Which isolated links have recovered following the mitigation process, and hence should be reinstated to the fabric.
Applies the required actions through the UFM and reports to the UFM events table.

Schematic Flow: External View

iblr-1-version-1-modificationdate-1739373047100-api-v2.PNG

Schematic Flow: Internal View

iblr_execution_cycle-version-1-modificationdate-1739373046757-api-v2.png

Data Collection

The plugin collects its telemetry data from the secondary telemetry endpoint, a low-frequency UFM Telemetry service which collects a large set of counters. Secondary telemetry is enabled by default, and in case it is being de-activated by the user, the plugin will not be able to operate.

In addition to telemetry data, the plugin periodically reads from the UFM the cluster topology information in order to align its internal logic with any changes that has taken place in the cluster.

Main Collected Counters

Name	Description
`hist0-hist15`	FEC histogram counters. Counter hist{i} is incremented by one every time a FEC block is arrives with i bit errors.
`phy_effective_errors`	Number of FEC blocks that could not be corrected (8 or more errors for FEC8 algorithm).
`phy_symbol_errors`	Number of symbols dropped due to errors.
`PortRcvErrorsExtended`	Number of data blocks dropped due to errors.
`PortRcvDataExtended`	Number of received data blocks.
`CableInfo.Temperature`	Module temperature.
`CableInfo.diag_supply_voltage`	Module voltage.
`link_down_events`	Number of times the link was down.

Alert Generation

The alert generation block is composed of four independent engines, each employing either ML-based or rule-based logic in order to identify problematic links:

Failure prediction
Failure detection
Link flaps detection
Operating conditions violation

Per each alert engine, each port can be configured to operate in one of two modes:

Shadow mode: the plugin will not act upon this link based on alerts from this engine.
Active mode: the plugin is allowed to automatically execute mitigation steps upon this link in case an alert was raised from this engine.

Alert Engine Operation Mode Configurations

Name	Type	Values	Default Value	Description
`mode`	str	shadow, active	shadow	This configuration assigns the default operation mode for the respective alert engine.
`except_list`	List[str]	List of port identifiers (node guid and port number)	empty	Ports included in the exception list are operating in the mode opposite to that indicated in the `mode` configuration field. This enables to define a subset of the ports that will operate differently from the remainder of the cluster.

Name

Type

Values

Default Value

Description

mode

str

shadow, active

shadow

This configuration assigns the default operation mode for the respective alert engine.

except_list

List[str]

List of port identifiers (node guid and port number)

empty

Ports included in the exception list are operating in the mode opposite to that indicated in the mode configuration field.

This enables to define a subset of the ports that will operate differently from the remainder of the cluster.

Failure Prediction

The port failure prediction logic is based on a binary classification ML model, trained to predict ahead of time cases where physical layer issues will eventually result in data loss.

The model relies mainly on the error histogram counters of the Forward Error Correction (FEC) block.

The model was validated to provide valuable predictions on both HDR and NDR generation clusters.

iblr_model-version-1-modificationdate-1739373046397-api-v2.png

Failure Prediction Configurations

Name	Type	Values	Default Value	Description
`model_threshold`	float	(0, 1)	0.8	The decision threshold of the binary classification model. If the model's output probability exceeds this threshold an alert is triggered. The recommended value range is 0.65-0.8, with lower values favoring high recall and higher values favoring high precision.

Name

Type

Values

Default Value

Description

model_threshold

float

(0, 1)

0.8

The decision threshold of the binary classification model. If the model's output probability exceeds this threshold an alert is triggered.

The recommended value range is 0.65-0.8, with lower values favoring high recall and higher values favoring high precision.

Failure Detection

The failure detection module is comprised of two distinct rules, aimed at detecting based on post-FEC errors when a link is not meeting the desired performance specifications:

Packet drop rate: calculates the rate of dropped packets (local and remote errors) vs. successfully received packets.
Symbol errors: calculates the rate of symbol errors during two different time intervals (12 minutes and 125 minutes). If the threshold is exceeded for either interval, the rule triggers the failure detection alert.
This rule is only supported in NDR clusters, and is ignored in HDR clusters.

In case either of the rules exceeds its respective threshold, a failure detection alert will be raised.

Failure Detection Configurations

Name	Type	Default Value	Description
`detect_pdr_rule_enabled`	bool	True	Flag indicating whether the packet drop rate rule is enabled. If disabled, alerts will not be generated based on this rule.
`detect_symbol_error_rule_enabled`	bool	True	Flag indicating whether the symbol errors rule is enabled. If disabled, alerts will not be generated based on this rule.
`detect_pdr_error_rate_threshold`	float	1e-12	The threshold for the packet drop rate rule.
`detect_symbol_error_short_interval_threshold`	float	2.88	The threshold for the symbol errors rule, 12 minutes interval.
`detect_symbol_error_long_interval_threshold`	float	3	The threshold for the symbol errors rule, 125 minutes interval.

Link Flaps Detection

A heuristic for detection of link flap events is being utilized. This rule defines a link flap as a scenario where link down occurred on both ends of the link more times than defined by the link flap threshold.

By setting the right threshold, this rule is meant to capture unintentional link flaps, and filter out links that were intentionally reset or sufferred from a rare link down event.

Link Flaps Detection Configurations

Name	Type	Default Value	Description
`link_down_threshold_flapping_trunk_link`	int	15	The minimal number of link down events considered a link flap, for switch-switch links.
`link_down_threshold_flapping_access_link`	int	15	The minimal number of link down events considered a link flap, for host-switch links.

Name

Type

Default Value

Description

link_down_threshold_flapping_trunk_link

int

15

The minimal number of link down events considered a link flap, for switch-switch links.

link_down_threshold_flapping_access_link

int

15

The minimal number of link down events considered a link flap, for host-switch links.

Operating Conditions Violation

A module aimed at validating that the ports are within their nominal operating conditions. Configurable lower and upper thresholds are used to indicate whether the voltage and temperature of the port are exceeding the desired range.

Operating Conditions Violation Configurations

Name	Type	Default Value	Description
`port_temperature_upper_threshold`	int	70	The temperature upper threshold value, in °C.
`port_temperature_lower_threshold`	int	-10	The temperature lower threshold value, in °C.
`port_voltage_upper_threshold`	int	3500	The voltage upper threshold value, in mV.
`port_voltage_lower_threshold`	int	3100	The voltage lower threshold value, in mV.

Action Execution

The action execution block processes inputs from the alert generation module, the UFM and the internal port state table in order to determine which links should undergo isolation or de-isolation.

Then, it executes the actions via a UFM API, and finally it updates both on the actions that were and were not taken via messages to the UFM events table.

Action Execution Flow

The action execution flow is comprised of 6 main modules (below image, right side), which can be aggregated into three main steps:

Update of the internal port state DB, based on data collected from the alert generation module and from the UFM.
Based on the ports' state as captured by the port state DB, run two decision processes to determine which links should be isolated and which should be de-isolated.
Apply the neccessary actions through the UFM.

Once a link is selected for action, it enters a mitigation cycle as outlined below.

Isolation and de-isolation of links is done via updating the UFM Unhealthy Ports file. This file is submitted by the UFM to the SM, triggering the update of the routing tables throughout the fabric.

Notably, even while the link is isolated and no application traffic flows through it, the physical layer remains active hence we are able to keep monitoring the state of the link using the same aforementioned alert engines.

This enables the the decision process to determine if the link has successfully recovered following the mitigation procedure, and if so reinstate it back into the network.

action_execution_flow-version-1-modificationdate-1739373046017-api-v2.png

Link State Transition Diagram

The below diagram depicts the link state mahcine managed by the IBLR plugin.

Application traffic will flow through the link only if it is in Healthy state. In all other states, the link is isolated and traffic is redirected to other parallel links.

The black arrows in the diagram indicate state transitions triggered by the plugin, while the red dotted arrows indicate state transitions triggered outside the plugin (e.g. by the user).

While the main flow of isolating a problematic link, waiting for it to recover and reinstating it is fully operated by the plugin, there are two additional flows that require user intervention:

If a link that was isolated by the plugin has not recovered within a pre-determined period, it will be moved into the unrecoverable state.
This comes to suggest that the network operator should inspect the link, attempt to recover it, and if successfull de-isolate the link to signal the link was maintained.
If a link was isolated outside the scope of the plugin, the plugin will not de-isolate the link automatically as it has insufficient knowledge with repect to the isolation reason.
In that case, de-isolation should also be handled by the user or external system that isolated the link.

state_machine-version-1-modificationdate-1739373045683-api-v2.png

Action Execution Checkpoints

The action execution decison process includes a set of configurable checkpoints, that provide the user the ability to determine under what conditions actions will be taken.

These configurable checkpoints can be broadly divided into the following groups:

Isolation and de-isolation rate limit constraints.
Topology-dependent constraints that prevent isolations in case this will lead to insufficient redundancy in the fabric.
Isolated links' time limits to qualifying for de-isolation or be considered unrecoverable.

The below tables describe the main configuration parameters used by the decision process checkpoints.

Isolation Checkpoints Configurations

Name	Type	Description
`max_per_hour/day/week/month`	int	The maximum number of distinct links allowed to be isolated within a specific time window.
`min_links_per_switch_pair`	int	The minimum number of healthy links between two switches, when reached no further isolations will be allowed.
`min_active_ports_per_switch`	int	The minimum number of active ports per switch, when reached no further isolations will be allowed.

De-isolation Checkpoints Configurations

Name	Type	Description
`max_per_hour`	int	The maximum number of distinct links allowed to be de-isolated within an hour.
`min_health_time_before_deisolation_min`	int	The minimum period in minutes that a link has to be clean of alerts in order to be qualified for de-isolation.
`max_recovery_time_threshold_hours`	int	The maximum period in hours that a link is granted for recovery and de-isolation, before declaring it as unrecoverable.

Reporting

Customers may be looking to extend or replace the plugin's mitigation procedure with their private business logic.

For that purpose, the action execution module reports both on actions it has taken, as well as on actions it has not taken due to the decision process failing at one of the checkpoints.

Reporting is done via sending event-specific messages to the UFM events table.

Event Messages: Indication of Action

Event	Message Format	Comments
Isolation	`Completed isolation of the link between ports <src_guid>_<src_num> <-> <dest_guid>_<dest_num>, isolation reasons: <alert_reasons>`
De-isolation	`Completed deisolation of the link between ports <src_guid>_<src_num> <-> <dest_guid>_<dest_num> after no alerts were triggered for <isolation_reasons>`	Isolation reasons are the alerts that triggered the initial isolation
Declaring unrecoverable port	`Port <src_guid>_<src_num> could not be deisolated for more than <threshold value> hours due to repeating <alert_reason> alerts, hence it is marked as unrecoverable`	Threshold is configurable, the plugin will not deisolate unrecoverable ports

Event Messages: Indication of Action Not Taken

Event	Message Format	Comments
Total alerts count exceeding alert threshold per iteration	`Isolation will not be performed because the number of ports alerted for isolation (<actual count>) is greater than the threshold for the maximum number of isolation alerts (<threshold value>)`	Threshold is configurable
Access link alert	`Isolation will not be performed for port <guid>_<num> because this is an access link, triggered alerts: <alert_reasons>`
Shadow mode preventing isolation	`Isolation will not be performed for port <guid>_<num> because the port is in shadow mode, triggered alerts: <alert_reasons>`
Shadow mode preventing de-isolation	`Deisolation will not be performed for port <guid>_<num> because the port is in shadow mode, triggered alerts: <alert_reasons>`	Only if mode changed from active to shadow while port is isolated
Exceeding isolation count per time	`Isolation will not be performed for port <guid>_<num> because the number of isolated ports in the last <time frame> (<actual count>) exceeds the threshold for max isolations allowed (<threshold value>), triggered alerts: <alert_reasons>`	Time frame is in (hour, day, week, month), each has a configurable threshold
Exceeding de-isolation count per time	`Deisolation will not be performed for port <guid>_<num> because the number of de-isolated ports in the last <time frame> (<actual count>) exceeds the threshold for max de-isolations allowed (<threshold value>)`	Time frame is hour, threshold is configurable
Insufficient redundancy in the connectivity between the switches	`Isolation will not be performed for port <guid>_<num> because the number of links between switches <src_guid> and <dest_guid> (<actual count>) will decrease below the threshold for minimum number of healthy links (<threshold value>), triggered alerts: <alert_reasons>`	Threshold is configurable
Insufficient number of active ports in the switch	`Isolation will not be performed for port <guid>_<num> because the number of active ports for switch <switch_guid> (<actual count>) will decrease below the threshold for minimum number of active ports (<threshold value>), triggered alerts: <alert_reasons>`	Threshold is configurable, validated for the two switches which the link connects

Deployment

The IB Link Resiliency plugin can be deployed using the following methods:

On the UFM Appliance
On the UFM Software

To deploy the plugin, follow these steps:

Download the ufm-plugin-ib-link-resiliency-image from DockerHub.
Load the downloaded image onto the UFM server. This can be done either by using the UFM GUI by navigating to the Settings -> Plugins Management tab or by loading the image via the following instructions:
1. Log in to the UFM server terminal.
2. Run:
  Copy
  
  Copied!
```
            
            docker load -i <path_to_image>
        
```
After successfully loading the plugin image, the plugin should become visible within the plugins management table within the UFM GUI. To initiate the plugin’s execution, simply right-click on the respective line in the table.

iblr-2-version-1-modificationdate-1739373045350-api-v2.PNG

Note

The supported InfiniBand hardware technologies are HDR and NDR.

IB Link Resiliency UI

After the successful deployment of the plugin, a new item is shown in the UFM side menu for the IB Link Resiliency plugin: 

iblr-4-version-1-modificationdate-1739373045070-api-v2.PNG

Current State

This page displays the Current State table presenting the cluster status, outlining the following counts:

Number of ports.
Number of isolated ports.
Number of ports operating in active/shadow mode, per each of the alert engines.
Number of predicted failures, reflecting ports which were recently flagged by the ML model as having a high probability of failure.

iblr-5-version-1-modificationdate-1739373044803-api-v2.PNG

The Port Level Status table is displayed below and shows the status of the cluster ports, as stored in the port state DB.

iblr-6-version-1-modificationdate-1739373044593-api-v2.PNG

The user can filter the Port Level Status table by clicking on any value in the Current State table.

For example, if the user clicks on the number of isolated switch-to-switch ports, the Port Level Status table will display only the isolated switch-to-switch ports.

iblr-7-version-1-modificationdate-1739373044273-api-v2.PNG

Configuration

This page displays a subset of the aforementioned IBLR configuration parameters.

Below are representative screenshots showing some of the configuration parameters accessible through the UI:

Action execution configurations:

Failure prediction operating mode configurations:

iblr-9-version-1-modificationdate-1739373043730-api-v2.PNG

On This Page