NVIDIA UFM Enterprise User Manual v6.18.0
NVIDIA UFM Enterprise User Manual v6.18.0

Autonomous Link Maintenance (ALM) Plugin

The primary objective of the Autonomous Link Maintenance (ALM) plugin is to enhance cluster availability and improve the rate of job completion. This objective is accomplished by utilizing machine learning (ML) models to predict potential link failures. The plugin then isolates the expected failing links, implements maintenance procedures on them, and subsequently restores the fixed links to their original state by removing the isolation.

The ALM plugin performs the following tasks:

  1. Collects telemetry data from UFM and employs ML jobs to predict which ports need to be isolated/de-isolated

  2. Identifies potential link failures and isolates them to avert any interruption to traffic flow

  3. Maintains a record of maintenance procedures that can be executed to restore an isolated link

  4. After performing the required maintenance, the system verifies if the links can be de-isolated and restored to operational status (brought back online)

The ALM plugin operates in the following two distinct modes:

  1. Shadow mode

    • Collects telemetry data, runs ML prediction jobs, and saves the predictions to files.

  2. Active mode

    • Collects telemetry data, runs ML prediction jobs, and saves the predictions to files.

    • Automatically isolates and de-isolates based on predictions.

    • It is essential to note that a subset of the links must be specified in the allow list to enable this functionality.

The Autonomous Link Maintenance (ALM) plugin can be deployed using the following methods:

  1. On the UFM Appliance

  2. On the UFM Software

To deploy the plugin, follow these steps:

  1. Download the ufm-plugin-alm-image from the NVIDIA License Portal (NLP).

  2. Load the downloaded image onto the UFM server. This can be done either by using the UFM GUI by navigating to the Settings -> Plugins Management tab or by loading the image via the following instructions:

  3. Log in to the UFM server terminal.

  4. Run:

    Copy
    Copied!
                

    docker load -I <path_to_image> 

  5. After successfully loading the plugin image, the plugin should become visible within the plugins management table within the UFM GUI. To initiate the plugin’s execution, simply right-click on the respective in the table.

ALM-version-1-modificationdate-1724059729917-api-v2.png

Note

The supported InfiniBand hardware technologies are HDR, Beta on NDR.

The ALM plugin collects data from the UFM Enterprise appliance in the following two methods:

  1. Low-frequency collection: This process occurs every 5 minutes in case the Alm use secondary telemetry(default) or 7 minutes when alm use adynamic telemetry api and gathers data for the following counter: hist0,hist1,hist2,hist3,hist4,hist5,hist6,hist7,hist8,hist9,hist10,hist11,hist12,hist13,hist14,hist15,

    phy_effective_errors,phy_symbol_errors,CableInfo.Temperature,switch_temperature,CableInfo.diag_supply_voltage,

    PortRcvErrorsExtended,PortRcvDataExtended,snr_host_lane0,snr_host_lane1,snr_host_lane2,snr_host_lane3,

    snr_media_lane0,snr_media_lane1,snr_media_lane2,snr_media_lane3,link_down_events,LinkErrorRecoveryCounterExtended,

    time_since_last_clear

  2. High-frequency collection(disabled by default) : This process occurs every 10 seconds and gathers data for the following counters: phy_state,logical_state,link_speed_active,link_width_active,fec_mode_active, raw_ber,eff_ber,symbol_ber,phy_raw_errors_lane0,phy_raw_errors_lane1,phy_raw_errors_lane2, phy_raw_errors_lane3,phy_effective_errors,phy_symbol_errors,time_since_last_clear, hist0,hist1,hist2,hist3,hist4,switch_temperature,CableInfo.temperature,link_down_events, plr_rcv_codes,plr_rcv_code_err,plr_rcv_uncorrectable_code,plr_xmit_codes,plr_xmit_retry_codes, plr_xmit_retry_events,plr_sync_events,hi_retransmission_rate,fast_link_up_status, time_to_link_up,status_opcode,status_message,down_blame,local_reason_opcode, remote_reason_opcode,e2e_reason_opcode,num_of_ber_alarams,PortRcvRemotePhysicalErrorsExtended, PortRcvErrorsExtended,PortXmitDiscardsExtended,PortRcvSwitchRelayErrorsExtended,PortRcvConstraintErrorsExtended, VL15DroppedExtended,PortXmitWaitExtended,PortXmitDataExtended,PortRcvDataExtended,PortXmitPktsExtended, PortRcvPktsExtended,PortUniCastXmitPktsExtended,PortUniCastRcvPktsExtended,PortMultiCastXmitPktsExtended,PortMultiCastRcvPktsExtended

  3. The collected counters can be configurable and customized to suit your requirements. The counters can be found at /opt/ufm/conf/plugins/alm/counters.cfg

    alm2-version-1-modificationdate-1724059728170-api-v2.png

The ALM configuration is used for controlling data collection and isolation/de-isolation. The configuration can be found under /opt/ufm/cyber-ai/conf/cyberai.cfg.

Name

Section name

Description

mode

Prediction

The mode can be either "active" or "shadow."

In active mode, the ALM will enforce isolation/deisolation rules on all ports which predict to fail except those listed in the "expect" list.

In shadow mode, the ALM will enforce isolation/deisolation rules on the ports listed in the "except" list, and predict to fail.

except_list

Prediction

Includes the ports that receive the opposite treatment compared to the mode. the expect list saved in location /opt/ufm/files/conf/plugin/alm/predict.csv

Format:

port_guid,port_number

0x1070fd03001769b4,1

0x1070fd03001769b4,3

mode

NOC

The mode can be either "active" or "shadow."

In active mode, the ALM will enforce isolation/deisolation rules on all ports that considered as out of nominal condition except those listed in the "expect" list.

In shadow mode, the ALM will enforce isolation/deisolation rules on the ports listed in the "except" list.

except_list

NOC

Includes the ports that receive the opposite treatment compared to the mode.

the expect list saved in location /opt/ufm/files/conf/plugin/alm/noc.csv

Format:

port_guid,port_number

0x1070fd03001769b4,1

0x1070fd03001769b4,3

max_per_hour

Isolation

The maximum number of ports that can be isolated in a hour

max_per_week

Isolation

Maximum number of ports that can be isolated in a week

max_per_month

Isolation

Maximum number of the ports that can be isolated in a month

min_links_per_switch_pair

Isolation

Minimum links between two switches to perform isolation

min_active_ports_per_switch

Isolation

Minimum number of active ports per switch before perform isolation

Deisolation_time

DeIsolation

The waiting time before deisolate the isolated port

max_per_hour

DeIsolation

The maximum number of deisolated port per hour

absolute_threshold_of_isolated_ports

Isolation

The maximum number of ports than can be isolated in one sample

LowFreqCollector

use_secondary

the flag to determine if we need to use secondary telemetry or dynamic telemetry, if the flag true alm will use secondary else will use dynamic telemetry

LowFreqCollector

secondary_interval

the periodic interval for low frequency collection in case use secondary set to true

LowFreqCollector

interval

periodic interval for low frequency collection in case use alm dynamic telemetry

After the successful deployment of the plugin, a new item is shown in the UFM side menu for the ALM plugin: 

image-2024-4-18_15-18-9-version-1-modificationdate-1724059730153-api-v2.png

Current State

This page displays a table presenting the current cluster status, outlining the following counts:

  1. Number of ports

  2. Number of isolated ports

  3. Number of ports in active/shadow prediction mode

  4. Number of ports in active/shadow NOC mode

  5. Number of ports out of NOC

image-2024-4-18_15-31-51-version-1-modificationdate-1724059730367-api-v2.png

Events Summary

This page displays a table presenting a port count summary, outlining the following counts:

  1. Number of isolated ports in the past hour, week, and month for ‘host to switch’ and ‘switch to switch’.

  2. Number of de-isolated ports in the past hour, week, and month for ‘host to switch’ and ‘switch to switch’.

  3. Number of isolation actions not taken from prediction by ALM in the past hour, week, and month for ‘host to switch’ and ‘switch to switch’.

  4. Number of isolation actions not taken from NOC by ALM in the past hour, week, and month for ‘host to switch’ and ‘switch to switch’.

image-2024-4-18_15-26-46-version-1-modificationdate-1724059730797-api-v2.png

Port Level Status

This page displays a table presenting the cluster ports.

image-2024-4-18_16-33-9-version-1-modificationdate-1724059731190-api-v2.png

Configuration

This page displays ALM plugin configuration update method.

The ALM configuration is divided into four sections:

  • General Configurations

image-2024-4-19_23-46-31-version-1-modificationdate-1724059731840-api-v2.png

  • Prediction Mode

image-2024-4-19_23-48-10-version-1-modificationdate-1724059732090-api-v2.png

  • NOC Mode

image-2024-4-19_23-48-40-version-1-modificationdate-1724059732673-api-v2.png

  • ML Model Configurations

image-2024-4-19_23-49-13-version-1-modificationdate-1724059733213-api-v2.png


The table presented below displays the names and descriptions of ALM jobs. These jobs are designed to predict the ports that require isolation/de-isolation. Upon enabling the ALM plugin, these ALM jobs run periodically.

ALM Job Name

Description

Frequency

Port_hist

By using the low frequency bit error histogram counters, the ALM job identifies the ports that will be monitored at high frequency in the next time interval. The job generates an output file that is later read by the high frequency telemetry monitoring job. It prioritizes links that are more susceptible to failure.

5 mins or 7 mins based on configuration

Low_freq_predict

Predicts the likelihood of a port failure by analyzing input data from low frequency telemetry, while only utilizing physical layer counters. The prediction works for isolated ports as well. The resulting output from this task serves as a critical input for determining whether to isolate or de-isolate ports.

5 mins or 7 mins based on configuration

Data Filter

Due to the vast amounts of data generated by the ALM in real time, it's impossible to store all the data for offline analysis. As such, the data filter applies a set of rules to select the N (N << total number of ports in cluster) "most interesting" ports in a given timepoint based on the current, past and future telemetry data. The data of the selected ports will persist for a longer time period, and thus enable to use it offline for deubg, model validation, model training and so forth.

5 mins or 7 mins based on configuration

Metric Filter

This job maintains only data samples that are needed for the purpose of online calculation of the model's performance. Specifically, it will only collect samples where a failure was either predicted or actually occurred (e.g. there was an event of packet drop or symbol error).

This filtering is required in order to enable the performance calculation to be executed for the entire history of the running plugin, without having to store excessive amounts of data.

5 mins or 7 mins based on configuration

© Copyright 2024, NVIDIA. Last updated on Aug 27, 2024.