Autonomous Link Maintenance (ALM) Plugin

The Autonomous Link Maintenance (ALM) plugin serves the following tasks:

  1. Collects telemetry data from UFM and runs machine learning (ML) jobs to predict the port that needs to be isolated/de-isolated

  2. Detects potential link failures and isolate them to prevent any disruption to traffic flow

  3. Keeps track of maintenance procedures that can be carried out to restore an isolated link

  4. Once maintenance has been performed, the system checks to see if the links can be de-isolated and brought back online

The following are the methods by which the Autonomous Link Maintenance (ALM) plugin can be deployed:

  1. On UFM Appliance

  2. On UFM Software

First, you need to download the ufm-plugin-alm-image from the NVIDIA License Portal (NLP); then you need to load the image on the UFM server; either by using the UFM GUI -> Settings -> Plugins Management tab or by loading the image via the following instructions:

  1. Log in to the UFM server terminal.

  2. Run:

    Copy
    Copied!
                

    docker load -I <path_to_image> 

Once the plugin image is loaded, the plugin should become visible in the plugins management table within the UFM GUI. To execute the plugin, right-click on the corresponding row within the table.

ALM.png

The ALM plugin collects data from the UFM Enterprise appliance in the following two methods:

  1. Low frequency collection: Runs every 10 minutes and data for the following counter: hist0, hist1, hist2, hist3, hist4, phy_effective_errors, phy_symbol_errors

  2. High frequency collection : Runs every 10 seconds and collects the data for the following counters: phy_state,logical_state,link_speed_active,link_width_active,fec_mode_active, raw_ber,eff_ber,symbol_ber,phy_raw_errors_lane0,phy_raw_errors_lane1,phy_raw_errors_lane2, phy_raw_errors_lane3,phy_effective_errors,phy_symbol_errors,time_since_last_clear, hist0,hist1,hist2,hist3,hist4,switch_temperature,CableInfo.temperature,link_down_events, plr_rcv_codes,plr_rcv_code_err,plr_rcv_uncorrectable_code,plr_xmit_codes,plr_xmit_retry_codes, plr_xmit_retry_events,plr_sync_events,hi_retransmission_rate,fast_link_up_status, time_to_link_up,status_opcode,status_message,down_blame,local_reason_opcode, remote_reason_opcode,e2e_reason_opcode,num_of_ber_alarams,PortRcvRemotePhysicalErrorsExtended, PortRcvErrorsExtended,PortXmitDiscardsExtended,PortRcvSwitchRelayErrorsExtended,PortRcvConstraintErrorsExtended, VL15DroppedExtended,PortXmitWaitExtended,PortXmitDataExtended,PortRcvDataExtended,PortXmitPktsExtended, PortRcvPktsExtended,PortUniCastXmitPktsExtended,PortUniCastRcvPktsExtended,PortMultiCastXmitPktsExtended,PortMultiCastRcvPktsExtended

The below table lists ALM job names and provides their descriptions. The ALM jobs are created to predict which ports need to be isolated/de-isolated. The ALM jobs run periodically upon enabling the ALM plugin .

ALM Job Name

Description

Frequency

Port_hist

Use the low frequency bit error histogram counters to determine which ports will be monitored in high frequency in the next time interval. This job outputs the file that is later read by the high frequency telemetry monitoring job. Links that are more susceptible to failure will be chosen in higher probabilities.

600 seconds

Low_freq_predict

Predicts the likelihood of a port failure by analyzing input data from low frequency telemetry, while only utilizing physical layer counters. The prediction works for isolated ports as well. The resulting output from this task serves as a critical input for determining whether to isolate or de-isolate ports.

10 seconds

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.