What can I help you with?
NVIDIA UFM Enterprise User Manual v6.19.5

IB Link Resiliency Plugin

The primary objective of the IB Link Resiliency plugin is to enhance cluster availability and improve the rate of job completion.

This objective is accomplished by combining three different mechanisms for identifying problematic links: machine learning (ML) models for predicting the probability of links to drop packets, rule-based logic for detecting links with high packet drop rate, and theshold-based logic for validating that ports are within their nominal operating conditions.

The plugin then applies corrective measures to links that were flagged by the aforementioned mechanisms: it isolates the links, implements maintenance procedures on them, and subsequently restores the fixed links to their original state by removing the isolation.

The IB Link Resiliency plugin performs the following tasks:

  1. Collects telemetry data from UFM and employs ML models and rule-based logic to determine which links need to be isolated/de-isolated.

  2. Isolates these links to avert any interruption to traffic flow.

  3. Maintains a record of maintenance procedures that can be executed to restore an isolated link.

  4. After performing the required maintenance, the system verifies if the links can be de-isolated and restored to operational status (brought back online).

Per each mechanism of problematic links identification, the IB Link Resiliency plugin operates in one of the following two distinct modes:

  1. Shadow mode (supported for both switch-to-switch and switch-to-host links)

    • Collects telemetry data, runs problematic links identification logic, and saves the predictions to files.

  2. Active mode (supported only for switch-to-switch links)

    • Collects telemetry data, runs problematic links identification logic, and saves the predictions to files.

    • Automatically isolates and de-isolates the links flagged by the identification mechanism.

The IB Link Resiliency plugin can be deployed using the following methods:

  1. On the UFM Appliance

  2. On the UFM Software

To deploy the plugin, follow these steps:

  1. Download the ufm-plugin-ib-link-resiliency-image from DockerHub.

  2. Load the downloaded image onto the UFM server. This can be done either by using the UFM GUI by navigating to the Settings -> Plugins Management tab or by loading the image via the following instructions:

  3. Log in to the UFM server terminal.

  4. Run:

    Copy
    Copied!
                

    docker load -i <path_to_image>

  5. After successfully loading the plugin image, the plugin should become visible within the plugins management table within the UFM GUI. To initiate the plugin’s execution, simply right-click on the respective in the table.

iblr-2-version-1-modificationdate-1734985998743-api-v2.PNG

Note

The supported InfiniBand hardware technologies are HDR, Beta on NDR.

The IB Link Resiliency plugin collects data from the UFM Enterprise appliance in the following two methods:

  1. Low-frequency collection: This process occurs every 5 minutes in case the IB Link Resiliency use secondary telemetry to gathers data for the following counter: hist0,hist1,hist2,hist3,hist4,hist5,hist6,hist7,hist8,hist9,hist10,hist11,hist12,hist13,hist14,hist15,

    phy_effective_errors,phy_symbol_errors,CableInfo.Temperature,switch_temperature,CableInfo.diag_supply_voltage,

    PortRcvErrorsExtended,PortRcvDataExtended,snr_host_lane0,snr_host_lane1,snr_host_lane2,snr_host_lane3,

    snr_media_lane0,snr_media_lane1,snr_media_lane2,snr_media_lane3,link_down_events,LinkErrorRecoveryCounterExtended,

    time_since_last_clear

  2. The collected counters can be configurable and customized to suit your requirements. The counters can be found at /opt/ufm/files/conf/plugins/ib-link-resiliency/counters.cfg

iblr-3-version-1-modificationdate-1734986002017-api-v2.PNG

The IB Link Resiliency configuration is used for controlling data collection and isolation/de-isolation. The configuration can be found under /opt/ufm/files/conf/plugins/ib-link-resiliency/iblr.cfg.

Name

Section name

Description

mode

Prediction

The mode can be either "active" or "shadow."

In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports which are predicted to fail except those listed in the "except" list.

In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on the ports which are predicted to fail only if these are listed in the "except" list.

except_list

Prediction

Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/predict.csv

Format:

port_guid,port_number

0x1070fd03001769b4,1

0x1070fd03001769b4,3

mode

Detection

The mode can be either "active" or "shadow."

In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports where failures were detected except those listed in the "except" list.

In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on ports where failures were detected only if these are listed in the "except" list.

except_list

Detection

Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/detect.csv

Format:

port_guid,port_number

0x1070fd03001769b4,1

0x1070fd03001769b4,3

mode

NOC

The mode can be either "active" or "shadow."

In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports that are considered as out of nominal operating conditions except those listed in the "except" list.

In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on ports that are considered as out of nominal operating conditions only if these are listed in the "except" list.

except_list

NOC

Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/noc.csv

Format:

port_guid,port_number

0x1070fd03001769b4,1

0x1070fd03001769b4,3

max_per_hour

Isolation

The maximum number of ports that can be isolated in a hour

max_per_week

Isolation

Maximum number of ports that can be isolated in a week

max_per_month

Isolation

Maximum number of the ports that can be isolated in a month

min_links_per_switch_pair

Isolation

Minimum links between two switches to perform isolation

min_active_ports_per_switch

Isolation

Minimum number of active ports per switch before perform isolation

deisolation_time

DeIsolation

The waiting time before deisolate the isolated port

max_per_hour

DeIsolation

The maximum number of deisolated ports per hour

absolute_threshold_of_isolated_ports

Isolation

The maximum number of ports than can be isolated in one iteration (i.e. one telemetry collection cycle)

interval

LowFreqCollector

The periodic interval for low frequency collection in case use IB Link Resiliency dynamic telemetry

After the successful deployment of the plugin, a new item is shown in the UFM side menu for the IB Link Resiliency plugin: 

iblr-4-version-1-modificationdate-1734986002383-api-v2.PNG

Current State

This page displays a table presenting the current cluster status, outlining the following counts:

  1. Number of ports

  2. Number of isolated ports

  3. Number of ports in active/shadow failure prediction mode

  4. Number of ports in active/shadow operating conditions validation mode

  5. Number of ports in active/shadow failure detection mode

  6. Failures prediction, reflecting ports which were recently flagged by the ML models as having a high probability of dropping packets.

iblr-5-version-1-modificationdate-1734985999077-api-v2.PNG

The "Port Level Status" table is displayed below the "Current Status" table and shows the status of the cluster ports.

iblr-6-version-1-modificationdate-1734985999387-api-v2.PNG

The user can filter the "Port Level Status" table by clicking on any value in the "Current Status" table. The "Port Level Status" table will then be filtered based on the selected value.

For example, if the user clicks on the number of isolated switch-to-switch ports, the "Port Level Status" table will display only the isolated switch-to-switch ports.

iblr-7-version-1-modificationdate-1734985999787-api-v2.PNG


Configuration

This page displays the IB Link Resiliency plugin configuration update method.

The IB Link Resiliency configuration is divided into four sections:

  • General Configurations

    iblr-8-version-1-modificationdate-1734986000060-api-v2.PNG

  • Failure Prediction

iblr-9-version-1-modificationdate-1734986000473-api-v2.PNG

  • Failure Detection

iblr-10-version-1-modificationdate-1734986000753-api-v2.PNG

  • Port Operating Conditions Validation

iblr-11-version-1-modificationdate-1734986001350-api-v2.PNG

  • ML Model Configurations

iblr-12-version-1-modificationdate-1734986001057-api-v2.PNG


© Copyright 2024, NVIDIA. Last updated on Jan 7, 2025.