IB Link Resiliency Plugin
The primary objective of the IB Link Resiliency plugin is to enhance cluster availability and improve the rate of job completion.
This objective is accomplished by combining three different mechanisms for identifying problematic links: machine learning (ML) models for predicting the probability of links to drop packets, rule-based logic for detecting links with high packet drop rate, and theshold-based logic for validating that ports are within their nominal operating conditions.
The plugin then applies corrective measures to links that were flagged by the aforementioned mechanisms: it isolates the links, implements maintenance procedures on them, and subsequently restores the fixed links to their original state by removing the isolation.
The IB Link Resiliency plugin performs the following tasks:
Collects telemetry data from UFM and employs ML models and rule-based logic to determine which links need to be isolated/de-isolated.
Isolates these links to avert any interruption to traffic flow.
Maintains a record of maintenance procedures that can be executed to restore an isolated link.
After performing the required maintenance, the system verifies if the links can be de-isolated and restored to operational status (brought back online).
Per each mechanism of problematic links identification, the IB Link Resiliency plugin operates in one of the following two distinct modes:
Shadow mode (supported for both switch-to-switch and switch-to-host links)
Collects telemetry data, runs problematic links identification logic, and saves the predictions to files.
Active mode (supported only for switch-to-switch links)
Collects telemetry data, runs problematic links identification logic, and saves the predictions to files.
Automatically isolates and de-isolates the links flagged by the identification mechanism.
The IB Link Resiliency plugin can be deployed using the following methods:
On the UFM Appliance
On the UFM Software
To deploy the plugin, follow these steps:
Download the ufm-plugin-ib-link-resiliency-image from DockerHub.
Load the downloaded image onto the UFM server. This can be done either by using the UFM GUI by navigating to the Settings -> Plugins Management tab or by loading the image via the following instructions:
Log in to the UFM server terminal.
Run:
docker load -i <path_to_image>
After successfully loading the plugin image, the plugin should become visible within the plugins management table within the UFM GUI. To initiate the plugin’s execution, simply right-click on the respective in the table.
The supported InfiniBand hardware technologies are HDR, Beta on NDR.
The IB Link Resiliency plugin collects data from the UFM Enterprise appliance in the following two methods:
Low-frequency collection: This process occurs every 5 minutes in case the IB Link Resiliency use secondary telemetry to gathers data for the following counter: hist0,hist1,hist2,hist3,hist4,hist5,hist6,hist7,hist8,hist9,hist10,hist11,hist12,hist13,hist14,hist15,
phy_effective_errors,phy_symbol_errors,CableInfo.Temperature,switch_temperature,CableInfo.diag_supply_voltage,
PortRcvErrorsExtended,PortRcvDataExtended,snr_host_lane0,snr_host_lane1,snr_host_lane2,snr_host_lane3,
snr_media_lane0,snr_media_lane1,snr_media_lane2,snr_media_lane3,link_down_events,LinkErrorRecoveryCounterExtended,
time_since_last_clear
The collected counters can be configurable and customized to suit your requirements. The counters can be found at /opt/ufm/files/conf/plugins/ib-link-resiliency/counters.cfg
The IB Link Resiliency configuration is used for controlling data collection and isolation/de-isolation. The configuration can be found under /opt/ufm/files/conf/plugins/ib-link-resiliency/iblr.cfg.
Name |
Section name |
Description |
|
Prediction |
The mode can be either "active" or "shadow." In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports which are predicted to fail except those listed in the "except" list. In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on the ports which are predicted to fail only if these are listed in the "except" list. |
|
Prediction |
Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/predict.csv Format: port_guid,port_number 0x1070fd03001769b4,1 0x1070fd03001769b4,3 |
|
Detection |
The mode can be either "active" or "shadow." In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports where failures were detected except those listed in the "except" list. In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on ports where failures were detected only if these are listed in the "except" list. |
|
Detection |
Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/detect.csv Format: port_guid,port_number 0x1070fd03001769b4,1 0x1070fd03001769b4,3 |
|
NOC |
The mode can be either "active" or "shadow." In active mode, the IB Link Resiliency will enforce isolation/deisolation rules on all ports that are considered as out of nominal operating conditions except those listed in the "except" list. In shadow mode, the IB Link Resiliency will enforce isolation/deisolation rules on ports that are considered as out of nominal operating conditions only if these are listed in the "except" list. |
|
NOC |
Includes the ports that receive the opposite treatment compared to the mode. The expect list saved in location /opt/ufm/files/conf/plugins/ib-link-resiliency/noc.csv Format: port_guid,port_number 0x1070fd03001769b4,1 0x1070fd03001769b4,3 |
|
Isolation |
The maximum number of ports that can be isolated in a hour |
|
Isolation |
Maximum number of ports that can be isolated in a week |
|
Isolation |
Maximum number of the ports that can be isolated in a month |
|
Isolation |
Minimum links between two switches to perform isolation |
|
Isolation |
Minimum number of active ports per switch before perform isolation |
|
DeIsolation |
The waiting time before deisolate the isolated port |
|
DeIsolation |
The maximum number of deisolated ports per hour |
|
Isolation |
The maximum number of ports than can be isolated in one iteration (i.e. one telemetry collection cycle) |
|
LowFreqCollector |
The periodic interval for low frequency collection in case use IB Link Resiliency dynamic telemetry |
After the successful deployment of the plugin, a new item is shown in the UFM side menu for the IB Link Resiliency plugin:
Current State
This page displays a table presenting the current cluster status, outlining the following counts:
Number of ports
Number of isolated ports
Number of ports in active/shadow failure prediction mode
Number of ports in active/shadow operating conditions validation mode
Number of ports in active/shadow failure detection mode
Failures prediction, reflecting ports which were recently flagged by the ML models as having a high probability of dropping packets.
The "Port Level Status" table is displayed below the "Current Status" table and shows the status of the cluster ports.
The user can filter the "Port Level Status" table by clicking on any value in the "Current Status" table. The "Port Level Status" table will then be filtered based on the selected value.
For example, if the user clicks on the number of isolated switch-to-switch ports, the "Port Level Status" table will display only the isolated switch-to-switch ports.
Configuration
This page displays the IB Link Resiliency plugin configuration update method.
The IB Link Resiliency configuration is divided into four sections:
General Configurations
Failure Prediction
Failure Detection
Port Operating Conditions Validation
ML Model Configurations