Autonomous Link Maintenance (ALM) Plugin
The Autonomous Link Maintenance (ALM) plugin serves the following tasks:
Collects telemetry data from UFM and runs machine learning (ML) jobs to predict the port that needs to be isolated/de-isolated
Detects potential link failures and isolate them to prevent any disruption to traffic flow
Keeps track of maintenance procedures that can be carried out to restore an isolated link
Once maintenance has been performed, the system checks to see if the links can be de-isolated and brought back online
The following are the methods by which the Autonomous Link Maintenance (ALM) plugin can be deployed:
On UFM Appliance
On UFM Software
First, you need to download the ufm-plugin-alm-image from the NVIDIA License Portal (NLP); then you need to load the image on the UFM server; either by using the UFM GUI -> Settings -> Plugins Management tab or by loading the image via the following instructions:
Log in to the UFM server terminal.
Run:
docker load -I <path_to_image>
Once the plugin image is loaded, the plugin should become visible in the plugins management table within the UFM GUI. To execute the plugin, right-click on the corresponding row within the table.
The ALM plugin collects data from the UFM Enterprise appliance in the following two methods:
Low frequency collection: Runs every 10 minutes and data for the following counter: hist0, hist1, hist2, hist3, hist4, phy_effective_errors, phy_symbol_errors
High frequency collection : Runs every 10 seconds and collects the data for the following counters: phy_state,logical_state,link_speed_active,link_width_active,fec_mode_active, raw_ber,eff_ber,symbol_ber,phy_raw_errors_lane0,phy_raw_errors_lane1,phy_raw_errors_lane2, phy_raw_errors_lane3,phy_effective_errors,phy_symbol_errors,time_since_last_clear, hist0,hist1,hist2,hist3,hist4,switch_temperature,CableInfo.temperature,link_down_events, plr_rcv_codes,plr_rcv_code_err,plr_rcv_uncorrectable_code,plr_xmit_codes,plr_xmit_retry_codes, plr_xmit_retry_events,plr_sync_events,hi_retransmission_rate,fast_link_up_status, time_to_link_up,status_opcode,status_message,down_blame,local_reason_opcode, remote_reason_opcode,e2e_reason_opcode,num_of_ber_alarams,PortRcvRemotePhysicalErrorsExtended, PortRcvErrorsExtended,PortXmitDiscardsExtended,PortRcvSwitchRelayErrorsExtended,PortRcvConstraintErrorsExtended, VL15DroppedExtended,PortXmitWaitExtended,PortXmitDataExtended,PortRcvDataExtended,PortXmitPktsExtended, PortRcvPktsExtended,PortUniCastXmitPktsExtended,PortUniCastRcvPktsExtended,PortMultiCastXmitPktsExtended,PortMultiCastRcvPktsExtended
The below table lists ALM job names and provides their descriptions. The ALM jobs are created to predict which ports need to be isolated/de-isolated. The ALM jobs run periodically upon enabling the ALM plugin .
ALM Job Name |
Description |
Frequency |
Port_hist |
Use the low frequency bit error histogram counters to determine which ports will be monitored in high frequency in the next time interval. This job outputs the file that is later read by the high frequency telemetry monitoring job. Links that are more susceptible to failure will be chosen in higher probabilities. |
600 seconds |
Low_freq_predict |
Predicts the likelihood of a port failure by analyzing input data from low frequency telemetry, while only utilizing physical layer counters. The prediction works for isolated ports as well. The resulting output from this task serves as a critical input for determining whether to isolate or de-isolate ports. |
10 seconds |