NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Cyber-AI Documentation v2.13.0 Cyber-AI Analytics

Cyber-AI Analytics

Anomaly Analysis

image-2025-7-16_2-37-10-version-1-modificationdate-1754593611395-api-v2.png

Cluster Status: Collects information about the links periodically.
Link Anomaly: Detects anomalous behavior in the cluster with a probability indicator. It detects the most significant influencers on the anomaly notice.
Link Failure Prediction: Prediction of future link failures 1-to-24 hours in advance using machine learning algorithms with a probability indicator.

UFM Cyber-AI trains machine learning algorithms to predict future failures by collecting monitoring information (i.e. training data for the machine learning algorithms) over a time duration (e.g. 1-24 hours) in advance of (retrospectively known) previous failures that occurred and having the algorithms learn the connection between different parameters over time.

Using the machine learning algorithm, the processor derives the potential failure pattern by, for example, alerting future failure times of components. The processor repeatedly updates the alerted future failure times based on newly collected failures.

The dashboard displays a table showing 'Switch to Switch' and 'Switch to Host' link failure prediction alerts, along with the relation between Alerted and the Total number of devices in the cluster.

image-2024-10-3_12-2-35-version-1-modificationdate-1754593651236-api-v2.png

In the “Predicted Failures” table, the user can filter the alerts table below by clicking any value on the table to add the appropriate filters to the table.

Users may see the detailed events through an event list where alert details like Node Name, Port, Occurrence and Probability are available.

image-2024-10-3_11-21-10-version-1-modificationdate-1754593650229-api-v2.png

When clicking on the arrow icon in the alert row, the table will expand and will show the history for the specific link.

image-2024-10-3_11-22-13-version-1-modificationdate-1754593650590-api-v2.png

Clicking on any alert in the list displays five graphs representing the counters that influenced the alert's triggering, with several time ranges available.

The default view provides two lines for each graph: one for the current data, and another for the calculated historical data based on average values from the prior week.

Users can choose to switch between Weekly average (default) to Day of Week average.

Day of Week Average is based on the calculation of the statistics in the same hours and day of the week of the past month. For example The average for 8AM–9AM on Mondays during the past month.

image-2024-10-3_11-25-50-version-1-modificationdate-1754593650910-api-v2.png

Also, users can add more graphs for more counters by clicking the "Add More" button below the graphs.

image-2024-2-6_16-35-16-1-version-1-modificationdate-1754593644277-api-v2.png

Then a new counter could be chosen, and a new graph for that counter will be added.

Additionally, it shows the top telemetry table, which is collapsed by default.

image-2024-10-3_13-30-34-version-1-modificationdate-1754593654994-api-v2.png

The recommended actions section will always appear at the bottom of the page with a reference to the user manual.

image-2024-10-3_12-5-31-version-1-modificationdate-1754593651597-api-v2.png

Link Anomaly

Port anomaly detection is based on defining composite metrics to reliably detect anomalies, where such metrics dynamically change, for example, according to a baseline that is determined and subsequently updated by a system.

In addition, there is a process for defining an anomaly score that provides a statistical estimation, such as the number of standard deviations, or the number of Mean Absolute Errors (MAEs) from a baseline value of the feature (i.e., metrics value), and assigning a degree of severity according to the number of standard deviations or MAEs.

The dashboard provides the following views:

Time Filter

Users can filter link anomaly data by time via searching for either an absolute or relative time.

image-2024-10-3_12-6-41-version-1-modificationdate-1754593651902-api-v2.png

Event Flow Charts

Event flow charts display anomalies between devices, with each link in the chart describing the number of anomalies between two devices.

The width of the link reflects the number of anomalies that occurred between the two devices.

image-2024-10-3_12-7-11-version-1-modificationdate-1754593652205-api-v2.png

By default, the 'Top 10' button is selected, displaying the top 10 devices with the highest occurrences of anomalies.

Clicking on link/device filters the table below.

Anomaly Details

The table below represents the anomaly details such as Last Anomaly time, Number of Occurrences, Node and Partner Node.

Users can filter the anomaly details table by clicking on either a device or a link in the event flow charts.

anomalydetails-version-1-modificationdate-1754593645826-api-v2.PNG

Clicking on the arrow icon in the anomaly node, the table will expand and will show all the history for the specific alert.

anomalydetails2-version-1-modificationdate-1754593646107-api-v2.PNG

Total Anomalies Over Time

When clicking on any anomaly node, an overtime graph will appear, showing all counters related to that anomaly.

The chart will display the number of anomalies over time, with the time scale based on the selected filter.

image-2024-10-3_13-15-26-version-1-modificationdate-1754593653101-api-v2.png

When clicking on a counter from the expanded table, the chart will display the counter values over time.

image-2024-10-3_13-24-11-version-1-modificationdate-1754593654324-api-v2.png

Link Anomaly Snapshots

Clicking on the anomaly node will display all telemetry counters for the selected port, starting from the selected time range, the table is collapsed by default.

image-2024-10-3_13-29-42-version-1-modificationdate-1754593654640-api-v2.png

Cluster Status

Provides information about the cluster and the distribution of the attributes value according to the selected time range.

The dashboard provides the following views:

Filters

Users can filter link status data by time by searching for either an absolute or relative time, or by link attribute by clicking on the 'All Filters' button.

Clicking on the 'All Filters' button will open the following modal:

filters1-version-1-modificationdate-1754593647334-api-v2.PNG

This modal filters the link status dashboard by the selected attribute values. Each attribute represents a dropdown containing the available values.

Additionally, clicking on the reset button will reset all filters.

Attributes

The most important attributes will be displayed as histograms and donut charts.

By clicking on a graph, the entire link status dashboard will be filtered based on the selection in the graph.

For example, clicking on "Disabled" in the " Physical Mngr Fsm State " graph will filter all the other graphs and tables by this selection.

image-2024-10-3_14-6-20-version-1-modificationdate-1754593655607-api-v2.png

Users can add more graphs for more attributes by clicking the "Add More" button below the graphs.

attributes2-version-1-modificationdate-1754593647944-api-v2.PNG

Counters

The most important counters will be displayed as histograms.

By clicking on a graph, the entire link status dashboard will be filtered based on the selection. Users can also filter the x-axis values (counter values) using the slider above.

image-2024-10-3_14-19-19-version-1-modificationdate-1754593656264-api-v2.png

Additional graphs for more counters can be added by clicking the "Add More" button below the graphs.

add_counter-version-1-modificationdate-1754593648539-api-v2.PNG

Counters Threshold

The user can assign a threshold value for each counter, with a new file added under /opt/ufm/cyber-ai/conf/counters_threshold.cfg for this purpose:

Copy
Copied!

            
            [raw_ber]
threshold= 1E-4
 
[eff_ber]
normal_range = 1E-8
 
[symbol_ber]
normal_range = 1E-10

Any value equal to the threshold will be highlighted in orange, while any value exceeding the threshold will be highlighted in red.

Users can view the threshold value by hovering over the question mark icon next to the counter name, which will display a popover containing a short description of the chart and the threshold value.

image-2024-10-3_15-19-14-version-1-modificationdate-1754593656563-api-v2.png

Links Snapshots

The table below shows the link snapshot details, including the device name, node GUID, port number, and related counters. By default, the table is collapsed and will automatically expand when a new counter or attribute is added, or when the user clicks on a chart to filter the data. When new attributes or counters are added, they will be included as columns in the table.

image-2024-10-3_15-22-18-version-1-modificationdate-1754593656895-api-v2.png

Anomalies

Clicking on a snapshot in the Links Snapshots table will display the most important counters as time graphs. These charts will show the counter values over time, with the time scale corresponding to the selected time filter.

anomalities-version-1-modificationdate-1754593649238-api-v2.PNG

Users can add more graphs for more counters by clicking the "Add More" button below the graphs.

anomalities2-version-1-modificationdate-1754593649565-api-v2.PNG

Job Analytics

Introduction

Analytic jobs are critical components in CyberAI. Each analytic job has a specific task to accomplish and runs periodically in a docker container. They process raw data collected from UFM Telemetry and generate informative data that can be displayed to the user in a form of alerts that can be used in making decisions. The process of data includes splitting the data into chunks of 5 mins, calculating the delta (difference between counters values), aggregating data (hourly, day of week, topology, and PKey), and inference the data for any alerts.

Job Types

File Splitter: This job splits the file if it contains more than one timestamp.
Delta Processing: This job calculates the delta from the current sampling and the previous 5 minutes.
Hourly Aggregation: This job aggregates all delta files in the previous hour into one csv file.
DOW Aggregation: Collect the CSV files on the same day of the week (DOW), at the same hour, to be aggregated.
Topology Aggregation: Merges data collected from hourly aggregation, cables, and UFM topology files, and generates a file to be used by ML hourly aggregation.
ML hourly Anomaly: Analyzes the topology merged file using ML model files and looks for link anomalies alert
ML Failure Prediction Aggregation :Analyzes the delta output using ML model and predict port failures.
Weekly Aggregation: Makes weekly average on hourly data to be displayed to compare the hourly data with the weekly average of this hour.

Output Sample

image2021-12-11_12-39-0-version-1-modificationdate-1754593637703-api-v2.png

On This Page