What can I help you with?
NVIDIA UFM Enterprise User Manual v6.20.1

IB Link Resiliency Plugin

The primary objective of the IB Link Resiliency (IBLR) plugin is to enhance cluster availability and improve the rate of job completion.

This objective is accomplished by combining different mechanisms, both ML-based and rule-based, for identifying problematic links. Then, the plugin autonomously applies corrective measures to these links with the aim of restoring their normal function.

For cluster topologies where no redundancy exists at the level of access links, the plugin will only execute a mitigation procedure for trunk links.

The IBLR plugin execution cycle is composed of the following tasks:

  1. Collects telemetry data from the UFM secondary telemetry service, which by default collects telemetry data every 5 minutes.

  2. Employs ML-based prediction models and rule-based detection logic to alert on problematic ports.

  3. Based on the alerts issued by the different alert engines, determines which measures should be taken:

    1. Which links are underperforming and should be isolated from the fabric.

    2. Which isolated links have recovered following the mitigation process, and hence should be reinstated to the fabric.

  4. Applies the required actions through the UFM and reports to the UFM events table.

Schematic Flow: External View

iblr-1-version-1-modificationdate-1739373047100-api-v2.PNG

Schematic Flow: Internal View

iblr_execution_cycle-version-1-modificationdate-1739373046757-api-v2.png

The plugin collects its telemetry data from the secondary telemetry endpoint, a low-frequency UFM Telemetry service which collects a large set of counters. Secondary telemetry is enabled by default, and in case it is being de-activated by the user, the plugin will not be able to operate.

In addition to telemetry data, the plugin periodically reads from the UFM the cluster topology information in order to align its internal logic with any changes that has taken place in the cluster.

Main Collected Counters

Name

Description

hist0-hist15

FEC histogram counters. Counter hist{i} is incremented by one every time a FEC block is arrives with i bit errors.

phy_effective_errors

Number of FEC blocks that could not be corrected (8 or more errors for FEC8 algorithm).

phy_symbol_errors

Number of symbols dropped due to errors.

PortRcvErrorsExtended

Number of data blocks dropped due to errors.

PortRcvDataExtended

Number of received data blocks.

CableInfo.Temperature

Module temperature.

CableInfo.diag_supply_voltage

Module voltage.

link_down_events

Number of times the link was down.


The alert generation block is composed of four independent engines, each employing either ML-based or rule-based logic in order to identify problematic links:

  1. Failure prediction

  2. Failure detection

  3. Link flaps detection

  4. Operating conditions violation

Per each alert engine, each port can be configured to operate in one of two modes:

  • Shadow mode: the plugin will not act upon this link based on alerts from this engine.

  • Active mode: the plugin is allowed to automatically execute mitigation steps upon this link in case an alert was raised from this engine.

Alert Engine Operation Mode Configurations

Name

Type

Values

Default Value

Description

mode

str

shadow, active

shadow

This configuration assigns the default operation mode for the respective alert engine.

except_list

List[str]

List of port identifiers (node guid and port number)

empty

Ports included in the exception list are operating in the mode opposite to that indicated in the mode configuration field.

This enables to define a subset of the ports that will operate differently from the remainder of the cluster.


Failure Prediction

The port failure prediction logic is based on a binary classification ML model, trained to predict ahead of time cases where physical layer issues will eventually result in data loss.

The model relies mainly on the error histogram counters of the Forward Error Correction (FEC) block.

The model was validated to provide valuable predictions on both HDR and NDR generation clusters.

iblr_model-version-1-modificationdate-1739373046397-api-v2.png

Failure Prediction Configurations

Name

Type

Values

Default Value

Description

model_threshold

float

(0, 1)

0.8

The decision threshold of the binary classification model. If the model's output probability exceeds this threshold an alert is triggered.

The recommended value range is 0.65-0.8, with lower values favoring high recall and higher values favoring high precision.

Failure Detection

The failure detection module is comprised of two distinct rules, aimed at detecting based on post-FEC errors when a link is not meeting the desired performance specifications:

  1. Packet drop rate: calculates the rate of dropped packets (local and remote errors) vs. successfully received packets.

  2. Symbol errors: calculates the rate of symbol errors during two different time intervals (12 minutes and 125 minutes). If the threshold is exceeded for either interval, the rule triggers the failure detection alert.

    This rule is only supported in NDR clusters, and is ignored in HDR clusters.

In case either of the rules exceeds its respective threshold, a failure detection alert will be raised.

Failure Detection Configurations

Name

Type

Default Value

Description

detect_pdr_rule_enabled

bool

True

Flag indicating whether the packet drop rate rule is enabled.

If disabled, alerts will not be generated based on this rule.

detect_symbol_error_rule_enabled

bool

True

Flag indicating whether the symbol errors rule is enabled.

If disabled, alerts will not be generated based on this rule.

detect_pdr_error_rate_threshold

float

1e-12

The threshold for the packet drop rate rule.

detect_symbol_error_short_interval_threshold

float

2.88

The threshold for the symbol errors rule, 12 minutes interval.

detect_symbol_error_long_interval_threshold

float

3

The threshold for the symbol errors rule, 125 minutes interval.

Link Flaps Detection

A heuristic for detection of link flap events is being utilized. This rule defines a link flap as a scenario where link down occurred on both ends of the link more times than defined by the link flap threshold.

By setting the right threshold, this rule is meant to capture unintentional link flaps, and filter out links that were intentionally reset or sufferred from a rare link down event.

Link Flaps Detection Configurations

Name

Type

Default Value

Description

link_down_threshold_flapping_trunk_link

int

15

The minimal number of link down events considered a link flap, for switch-switch links.

link_down_threshold_flapping_access_link

int

15

The minimal number of link down events considered a link flap, for host-switch links.

Operating Conditions Violation

A module aimed at validating that the ports are within their nominal operating conditions. Configurable lower and upper thresholds are used to indicate whether the voltage and temperature of the port are exceeding the desired range.

Operating Conditions Violation Configurations

Name

Type

Default Value

Description

port_temperature_upper_threshold

int

70

The temperature upper threshold value, in °C.

port_temperature_lower_threshold

int

-10

The temperature lower threshold value, in °C.

port_voltage_upper_threshold

int

3500

The voltage upper threshold value, in mV.

port_voltage_lower_threshold

int

3100

The voltage lower threshold value, in mV.

The action execution block processes inputs from the alert generation module, the UFM and the internal port state table in order to determine which links should undergo isolation or de-isolation.

Then, it executes the actions via a UFM API, and finally it updates both on the actions that were and were not taken via messages to the UFM events table.

Action Execution Flow

The action execution flow is comprised of 6 main modules (below image, right side), which can be aggregated into three main steps:

  1. Update of the internal port state DB, based on data collected from the alert generation module and from the UFM.

  2. Based on the ports' state as captured by the port state DB, run two decision processes to determine which links should be isolated and which should be de-isolated.

  3. Apply the neccessary actions through the UFM.

Once a link is selected for action, it enters a mitigation cycle as outlined below.

Isolation and de-isolation of links is done via updating the UFM Unhealthy Ports file. This file is submitted by the UFM to the SM, triggering the update of the routing tables throughout the fabric.

Notably, even while the link is isolated and no application traffic flows through it, the physical layer remains active hence we are able to keep monitoring the state of the link using the same aforementioned alert engines.

This enables the the decision process to determine if the link has successfully recovered following the mitigation procedure, and if so reinstate it back into the network.

action_execution_flow-version-1-modificationdate-1739373046017-api-v2.png

Link State Transition Diagram

The below diagram depicts the link state mahcine managed by the IBLR plugin.

Application traffic will flow through the link only if it is in Healthy state. In all other states, the link is isolated and traffic is redirected to other parallel links.

The black arrows in the diagram indicate state transitions triggered by the plugin, while the red dotted arrows indicate state transitions triggered outside the plugin (e.g. by the user).

While the main flow of isolating a problematic link, waiting for it to recover and reinstating it is fully operated by the plugin, there are two additional flows that require user intervention:

  1. If a link that was isolated by the plugin has not recovered within a pre-determined period, it will be moved into the unrecoverable state.

    This comes to suggest that the network operator should inspect the link, attempt to recover it, and if successfull de-isolate the link to signal the link was maintained.

  2. If a link was isolated outside the scope of the plugin, the plugin will not de-isolate the link automatically as it has insufficient knowledge with repect to the isolation reason.

    In that case, de-isolation should also be handled by the user or external system that isolated the link.

state_machine-version-1-modificationdate-1739373045683-api-v2.png

Action Execution Checkpoints

The action execution decison process includes a set of configurable checkpoints, that provide the user the ability to determine under what conditions actions will be taken.

These configurable checkpoints can be broadly divided into the following groups:

  1. Isolation and de-isolation rate limit constraints.

  2. Topology-dependent constraints that prevent isolations in case this will lead to insufficient redundancy in the fabric.

  3. Isolated links' time limits to qualifying for de-isolation or be considered unrecoverable.

The below tables describe the main configuration parameters used by the decision process checkpoints.

Isolation Checkpoints Configurations

Name

Type

Description

max_per_hour/day/week/month

int

The maximum number of distinct links allowed to be isolated within a specific time window.

min_links_per_switch_pair

int

The minimum number of healthy links between two switches, when reached no further isolations will be allowed.

min_active_ports_per_switch

int

The minimum number of active ports per switch, when reached no further isolations will be allowed.


De-isolation Checkpoints Configurations

Name

Type

Description

max_per_hour

int

The maximum number of distinct links allowed to be de-isolated within an hour.

min_health_time_before_deisolation_min

int

The minimum period in minutes that a link has to be clean of alerts in order to be qualified for de-isolation.

max_recovery_time_threshold_hours

int

The maximum period in hours that a link is granted for recovery and de-isolation, before declaring it as unrecoverable.

Reporting

Customers may be looking to extend or replace the plugin's mitigation procedure with their private business logic.

For that purpose, the action execution module reports both on actions it has taken, as well as on actions it has not taken due to the decision process failing at one of the checkpoints.

Reporting is done via sending event-specific messages to the UFM events table.

Event Messages: Indication of Action

Event

Message Format

Comments

Isolation

Completed isolation of the link between ports <src_guid>_<src_num> <-> <dest_guid>_<dest_num>, isolation reasons: <alert_reasons>

De-isolation

Completed deisolation of the link between ports <src_guid>_<src_num> <-> <dest_guid>_<dest_num> after no alerts were triggered for <isolation_reasons>

Isolation reasons are the alerts that triggered the initial isolation

Declaring unrecoverable port

Port <src_guid>_<src_num> could not be deisolated for more than <threshold value> hours due to repeating <alert_reason> alerts, hence it is marked as unrecoverable

Threshold is configurable, the plugin will not deisolate unrecoverable ports


Event Messages: Indication of Action Not Taken

Event

Message Format

Comments

Total alerts count exceeding alert threshold per iteration

Isolation will not be performed because the number of ports alerted for isolation (<actual count>) is greater than the threshold for the maximum number of isolation alerts (<threshold value>)

Threshold is configurable

Access link alert

Isolation will not be performed for port <guid>_<num> because this is an access link, triggered alerts: <alert_reasons>

Shadow mode preventing isolation

Isolation will not be performed for port <guid>_<num> because the port is in shadow mode, triggered alerts: <alert_reasons>

Shadow mode preventing de-isolation

Deisolation will not be performed for port <guid>_<num> because the port is in shadow mode, triggered alerts: <alert_reasons>

Only if mode changed from active to shadow while port is isolated

Exceeding isolation count per time

Isolation will not be performed for port <guid>_<num> because the number of isolated ports in the last <time frame> (<actual count>) exceeds the threshold for max isolations allowed (<threshold value>), triggered alerts: <alert_reasons>

Time frame is in (hour, day, week, month), each has a configurable threshold

Exceeding de-isolation count per time

Deisolation will not be performed for port <guid>_<num> because the number of de-isolated ports in the last <time frame> (<actual count>) exceeds the threshold for max de-isolations allowed (<threshold value>)

Time frame is hour, threshold is configurable

Insufficient redundancy in the connectivity between the switches

Isolation will not be performed for port <guid>_<num> because the number of links between switches <src_guid> and <dest_guid> (<actual count>) will decrease below the threshold for minimum number of healthy links (<threshold value>), triggered alerts: <alert_reasons>

Threshold is configurable

Insufficient number of active ports in the switch

Isolation will not be performed for port <guid>_<num> because the number of active ports for switch <switch_guid> (<actual count>) will decrease below the threshold for minimum number of active ports (<threshold value>), triggered alerts: <alert_reasons>

Threshold is configurable, validated for the two switches which the link connects

The IB Link Resiliency plugin can be deployed using the following methods:

  1. On the UFM Appliance

  2. On the UFM Software

To deploy the plugin, follow these steps:

  1. Download the ufm-plugin-ib-link-resiliency-image from DockerHub.

  2. Load the downloaded image onto the UFM server. This can be done either by using the UFM GUI by navigating to the Settings -> Plugins Management tab or by loading the image via the following instructions:

    1. Log in to the UFM server terminal.

    2. Run:

      Copy
      Copied!
                  

      docker load -i <path_to_image>

  3. After successfully loading the plugin image, the plugin should become visible within the plugins management table within the UFM GUI. To initiate the plugin’s execution, simply right-click on the respective line in the table.

iblr-2-version-1-modificationdate-1739373045350-api-v2.PNG

Note

The supported InfiniBand hardware technologies are HDR and NDR.

After the successful deployment of the plugin, a new item is shown in the UFM side menu for the IB Link Resiliency plugin: 

iblr-4-version-1-modificationdate-1739373045070-api-v2.PNG

Current State

This page displays the Current State table presenting the cluster status, outlining the following counts:

  1. Number of ports.

  2. Number of isolated ports.

  3. Number of ports operating in active/shadow mode, per each of the alert engines.

  4. Number of predicted failures, reflecting ports which were recently flagged by the ML model as having a high probability of failure.

iblr-5-version-1-modificationdate-1739373044803-api-v2.PNG

The Port Level Status table is displayed below and shows the status of the cluster ports, as stored in the port state DB.

iblr-6-version-1-modificationdate-1739373044593-api-v2.PNG

The user can filter the Port Level Status table by clicking on any value in the Current State table.

For example, if the user clicks on the number of isolated switch-to-switch ports, the Port Level Status table will display only the isolated switch-to-switch ports.

iblr-7-version-1-modificationdate-1739373044273-api-v2.PNG

Configuration

This page displays a subset of the aforementioned IBLR configuration parameters.

Below are representative screenshots showing some of the configuration parameters accessible through the UI:

  • Action execution configurations:

    iblr-8-version-1-modificationdate-1739373044010-api-v2.PNG

  • Failure prediction operating mode configurations:

iblr-9-version-1-modificationdate-1739373043730-api-v2.PNG

© Copyright 2025, NVIDIA. Last updated on Feb 20, 2025.