NVIDIA UFM Cyber-AI Documentation v2.6.1
NVIDIA UFM Cyber-AI Documentation v2.6.1

Cyber-AI Analytics

image2022-4-21_14-6-56.png

  • Network Alerts: Alerts for the entire cluster. The algorithm checks for unusual changes in several important metrics and notifies the user.

  • Tenant/Application Alerts: Triggered by PKey monitoring in the cluster. It checks the most congested PKeys for a better understanding of applications' health.

  • Link Failure Prediction: Prediction of future link failures 1-to-24 hours in advance using machine learning algorithms with a probability indicator and the counters that influenced the triggering of the alert the most .

  • Link Anomaly: Detects anomalous behavior in the cluster with a probability indicator. It detects the most significant influencers on the anomaly notice.

Network Anomalies

The purpose of this tab is to detect abnormal behavior at the level of the entire cluster.

An ETL process runs hourly and calculates network aggregated statistics while another process checks how the current statistics compare to statistics aggregated over the previous month. If over 20% of the difference is detected (default value that can be changed) the system triggers an alert with relevant information. It is also possible to see recommended action by clicking the relevant icon per alert.

The web UI provides a list of alerts as shown in the following:

image2022-4-21_14-8-10.png

Clicking any alert provides an additional layer of analysis that shows the anomalous parameter over three different time ranges.

network-anomalies-over-time.png


Tenant/Application Alerts

The ETL process of UFM Cyber-AI combines a partitioning key (PKey) topology with network telemetry to monitor PKey performance.

Based on normalized congestion measurements (the default is greater than 70%) the system detects the most congested PKeys. This is done by counting the amount of time when the alert is received.

In addition, a resource allocation pie is available which shows allocated nodes for PKey via free nodes.

Detailed event information is provided to the user regarding PKey alerts, where the user can see PKey details and descriptions of the alert.

image2022-4-21_14-13-19.png

Clicking any PKeys alert shows six graphs representing network statistics in general and per selected Pkey.

image2022-4-21_14-13-43.png

This way the user can see the impact of a specific PKey throughout the entire network and can see if PKey activity is normal both from a performance and from a duration of usage (if the activity is happening in a reasonable time) point of view.

Pic3.JPG


Link Failure Prediction

UFM Cyber-AI trains machine learning algorithms to predict future failures by collecting monitoring information (i.e. training data for the machine learning algorithms) over a time duration (e.g. 1-24 hours) in advance of (retrospectively known) previous failures that occurred and having the algorithms learn the connection between different parameters over time.

Using the machine learning algorithm, the processor derives the potential failure pattern by, for example, alerting future failure times of components. The processor repeatedly updates the alerted future failure times based on newly collected failures.

The dashboard provides a list of ports with the most Link Failure Predictions alerts raised and the relation between Alerted and the Total number of devices in the cluster.

image2022-4-21_14-15-35.png

In the “Top Port by link anomaly” graph, the user can filter the alerts table below by clicking any node name on the graph to add the appropriate filters to the table.

Users may see the detailed events through an event list where alert details like Node Name, Port, Hours to Fail, and alert Description are available.

PIC4.JPG

Clicking any alert in the list shows more information and recommended actions related to the alerted node, it will also show any alerts related to the cable that is connected to this node, if there is any, also three graphs representing counters that influenced the triggering of the alert will be shown below. Several time ranges are available.

pic5.JPG

The default view for the graph provides two lines for each graph: One for current data, and another for historical data which is calculated based on average values from the prior week.

Users can choose to switch between Weekly average (default) to Day of Week average.

Day of Week Average is based on the calculation of the statistics in the same hours and day of the week of the past month. For example The average for 8AM–9AM on Mondays during the past month.

image2022-4-21_14-16-12.png

Also, users can add more graphs for more counters by clicking the "Add More" button below the graphs.

PIC6.JPG

Then a new counter could be chosen, and a new graph for that counter will be added.

Link Anomalies

Port anomaly detection is based on defining composite metrics to reliably detect anomalies, where such metrics dynamically change, for example, according to a baseline that is determined and subsequently updated by a system.

In addition, there is a process for defining an anomaly score that provides a statistical estimation, such as the number of standard deviations, or the number of Mean Absolute Errors (MAEs) from a baseline value of the feature (i.e., metrics value), and assigning a degree of severity according to the number of standard deviations or MAEs.

The dashboard provides a list of top ports reporting link anomalies including the number of times an anomaly is detected and statistics regarding Alerted and the Total number of devices in the cluster.

linkanomaly.png

In the “Top Port by link anomaly” graph, the user can filter the alerts in the table below by clicking any node name on the graph to add the appropriate filters to the table.

Users can also see detailed events in the events list where the alert details such as Node Name, Probability, and Alert Description are available.

image2022-4-21_14-17-11.png

image2022-4-21_14-17-32.png

Clicking any alert in the list shows more information and recommended actions related to the alerted node, it will also show any alerts related to the cable that is connected to this node, if there is any. In addition, three graphs representing counters that influenced the triggering of the alert will be shown below. Several time ranges are available.

pic7.JPG

The default view provides two lines for each graph: One for current data, and another for historical data which is calculated based on average values from the prior week.

Users can choose to switch between Weekly average (default) to Day of Week average.

Day of Week Average is based on calculating the statistics in the same hours and day of the week of the past month. For example, the average for 8AM–9AM on Mondays during the past month.

image2022-4-21_14-18-26.png

Also, users can add more graphs for more counter by clicking the add more button below the graphs.

Pic_8.JPG

add-counter.JPG

Then a new counter could be chosen, and a new graph for that counter will be added.

Logical Server Alerts

Logical server data collection and analytic jobs are disabled by default. To enable this, the related flags should be updated in the scheduler_settings.cfg file:

Copy
Copied!
            

[analytics_job::logical_server_port_join] interval = 300 delay = 720 max_input = 12 standard_timeout = 180 enabled = true   [analytics_job::logical_server_aggr] interval = 300 delay = 780 max_input = 12 standard_timeout = 180 enabled = true   [data_prep_ufm::logical_server] interval = 60 delay = 60 skip_collection = false json_collection = false

The ETL process of UFM Cyber-AI combines the topology of the logical server, with network telemetry allowing the monitoring of logical servers' performance.

Based on utilization measurements (the default is greater than 70%) the system detects the most utilized logical server. This is done by counting the amount of time when the alert is received.

In addition, a resource allocation pie is available which shows allocated nodes for logical servers compared to free nodes.

Detailed event information is provided to the user regarding logical server alerts, where the user can see logical server details and a description of the alert.

image2022-4-21_14-22-31.png

Clicking any logical server alert shows six graphs representing network statistics in general and per selected logical server.

image2022-4-21_14-23-1.png

This way the user can see the impact of a specific logical server throughout the entire network and can see if logical server activity is normal both from a performance and from a duration of usage (i.e., if the activity is happening in a reasonable time) point of view.

image2022-4-21_14-23-36.png


Recommended Actions

A recommended action is available for all alert types. The user can click on any alert from alerts table in each page to see the recommended actions for the alert.

recommended-actions.JPG


Specification Description

The purpose of this module is to analyze the anomalies that were previously found in ML models and to understand possible common ground for the anomalies.

image2021-12-11_12-12-54.png

The table above represents the number of anomalies found by the ML model for each attribute’s combination, such as roles for source and destination (endpoint, core, tor), cable parameters (length, Pn, Sn, Version, Type, Width), and Nic type.

Event Flow Charts

Total Anomalies Over Time

Number of anomalies over time:

image2021-12-11_12-14-54.png


Anomalies Influencers

Shows the number of anomalies for each combination of influencers.

image2022-4-21_15-52-25.png

Global interactive and general filters can be applied by clicking on any entity in the dashboard.

Different times can be chosen by clicking on the last 12 hours.

image2021-12-11_12-19-50.png

Clicking on reset will clear all of the filters.

Specification Description

The present invention generally relates to the detection anomaly over cables and understanding degradation mechanisms for improving stability in data centers.

This innovation includes the detection of trends, intrusion, and any abnormal behavior of cables.

Moreover, with analysis of degradation over time we can determine better future performance strategies.

Customer Output

Threshold Alerts Tab

threshold-alert1.JPG

threshold-alert2.JPG


Deviation from Usual Behavior Tab

deviation1.JPG

deviation2.JPG

Background Art

Cable Anomaly Detection

  1. There are 5 measurements from the management tool (IB) with four thresholds per measure; see the Ethernet example below.

    Copy
    Copied!
                

    module_voltage Channel_*_ tx_power Channel_*_rx_power Channel_*_tx_bias module_temp

  2. There is a 5D (dimensions) GMM model which clusters channel and threshold behavior.

    image2021-12-11_12-30-2.png

  3. To indicate alert: UFM Cyber-AI is calculating for every new data entry its deviation from channel centroid probabilistically per measurement.

  4. The system is defining the probability rate for the indication above deviation

  5. Each event per measurement is unique to node, port, and SN.

  6. For user convenience, there is the representation of the current measure via pre-defined thresholds in the tachometer

  7. For every chosen entry in the table, the trend graph is updated

  8. The trend graph represents the trend for the chosen measure to detect abnormal behavior over time

Introduction

Analytic jobs are critical components in CyberAI. Each analytic job has a specific task to accomplish and runs periodically in a docker container. They process raw data collected from UFM Telemetry and generate informative data that can be displayed to the user in a form of alerts that can be used in making decisions. The process of data includes splitting the data into chunks of 5 mins, calculating the delta (difference between counters values), aggregating data (hourly, day of week, topology, and PKey), and inference the data for any alerts.

Job Types

  1. File Splitter: This job splits the file if it contains more than one timestamp.

  2. Delta Processing: This job calculates the delta from the current sampling and the previous 5 minutes.

  3. Hourly Aggregation: This job aggregates all delta files in the previous hour into one csv file.

  4. Network Hourly Aggregation: Similar to hourly aggregation but, make average over all network nodes.

  5. DOW Aggregation: Collect the CSV files on the same day of the week (DOW), at the same hour, to be aggregated.

  6. Network DOW Aggregation: Similar to DOW aggregation but makes average over all network nodes.

  7. Network Anomaly: Analyzes the network hourly data with the network DOW aggregation and looks for anomalies.

  8. Topology Aggregation: Merges data collected from hourly aggregation, cables, and UFM topology files, and generates a file to be used by ML hourly aggregation.

  9. ML hourly Anomaly: Analyzes the topology merged file using ML model files and looks for link anomalies alert

  10. ML hourly model: Analyzes the topology merged file using ML model files and looks for link failure prediction Alert

  11. ML Weekly Aggregation: Updates the ML model used by ML hourly aggregation based on the weekly collected topology.

  12. PKEY Port Join: Merges the delta output files with the PKEY data and generates a file to be input for the PKEY aggregation.

  13. PKEY Aggregation: Analyzes the joined PKEY data and looks for PKEY (tenant) alerts.

  14. Logical Server Join: Merges the delta output files with the logical server data and generates a file to be input for the logical server aggregation.

  15. Logical Servers Aggregation: Analyzes the joint logical servers data and looks for logical servers alerts.

  16. Cable Daily: Analysis of cable counters files and looks for cable threshold and deviation alerts.

  17. Weekly Aggregation: Makes weekly average on hourly data to be displayed to compare the hourly data with the weekly average of this hour.

Output Sample

image2021-12-11_12-39-0.png

© Copyright 2023, NVIDIA. Last updated on Dec 11, 2023.