NVIDIA UFM Cyber-AI Documentation v2.11.0

Cyber-AI Analytics

image-2024-10-6_16-50-48-version-1-modificationdate-1738834309427-api-v2.png

  • Cluster Status: Collects information about the links periodically.

  • Link Anomaly: Detects anomalous behavior in the cluster with a probability indicator. It detects the most significant influencers on the anomaly notice.

  • Link Failure Prediction: Prediction of future link failures 1-to-24 hours in advance using machine learning algorithms with a probability indicator.

  • Network Alerts: Alerts for the entire cluster. The algorithm checks for unusual changes in several important metrics and notifies the user.

  • Tenant/Application Alerts: Triggered by PKey monitoring in the cluster. It checks the most congested PKeys for a better understanding of applications health.

Network Alerts

The purpose of this tab is to detect abnormal behavior at the level of the entire cluster.

An ETL process runs hourly and calculates network aggregated statistics while another process checks how the current statistics compare to statistics aggregated over the previous month. If over 20% of the difference is detected (default value that can be changed) the system triggers an alert with relevant information. It is also possible to see recommended action by clicking the relevant icon per alert.

The web UI provides a list of alerts as shown in the following:

image-2024-2-6_16-15-35-version-1-modificationdate-1738834326827-api-v2.png

Clicking any alert provides an additional layer of analysis that shows the recommended actions related to the selected alert anomalous parameter over three different time ranges.

network_alerts1-version-1-modificationdate-1738834326527-api-v2.PNG

network_alerts2-version-1-modificationdate-1738834326150-api-v2.PNG

Also, users can add more graphs for more counters by clicking the "Add More" button below the graphs.

Tenant/Application Alerts

The ETL process of UFM Cyber-AI combines a partitioning key (PKey) topology with network telemetry to monitor PKey performance.

Based on normalized congestion measurements (the default is greater than 70%) the system detects the most congested PKeys. This is done by counting the amount of time when the alert is received.

In addition, a resource allocation pie is available which shows allocated nodes for PKey via free nodes.

Detailed event information is provided to the user regarding PKey alerts, where the user can see PKey details and descriptions of the alert.

tenant-version-1-modificationdate-1738834325720-api-v2.PNG

Clicking any PKeys alert shows six graphs representing network statistics in general and per selected Pkey.

image2022-4-21_14-13-43-version-1-modificationdate-1738834350967-api-v2.png

This way the user can see the impact of a specific PKey throughout the entire network and can see if PKey activity is normal both from a performance and from a duration of usage (if the activity is happening in a reasonable time) point of view.

Pic3-version-1-modificationdate-1738834356917-api-v2.JPG

Link Failure Prediction

UFM Cyber-AI trains machine learning algorithms to predict future failures by collecting monitoring information (i.e. training data for the machine learning algorithms) over a time duration (e.g. 1-24 hours) in advance of (retrospectively known) previous failures that occurred and having the algorithms learn the connection between different parameters over time.

Using the machine learning algorithm, the processor derives the potential failure pattern by, for example, alerting future failure times of components. The processor repeatedly updates the alerted future failure times based on newly collected failures.

The dashboard displays a table showing 'Switch to Switch' and 'Switch to Host' link failure prediction alerts, along with the relation between Alerted and the Total number of devices in the cluster.

image-2024-10-3_12-2-35-version-1-modificationdate-1738834316433-api-v2.png

In the “Predicted Failures” table, the user can filter the alerts table below by clicking any value on the table to add the appropriate filters to the table.

Users may see the detailed events through an event list where alert details like Node Name, Port, Occurrence and Probability are available.

image-2024-10-3_11-21-10-version-1-modificationdate-1738834317243-api-v2.png

When clicking on the arrow icon in the alert row, the table will expand and will show the history for the specific link.

image-2024-10-3_11-22-13-version-1-modificationdate-1738834316947-api-v2.png

Clicking on any alert in the list displays five graphs representing the counters that influenced the alert's triggering, with several time ranges available.

The default view provides two lines for each graph: one for the current data, and another for the calculated historical data based on average values from the prior week.

Users can choose to switch between Weekly average (default) to Day of Week average.

Day of Week Average is based on the calculation of the statistics in the same hours and day of the week of the past month. For example The average for 8AM–9AM on Mondays during the past month.

image-2024-10-3_11-25-50-version-1-modificationdate-1738834316690-api-v2.png

Also, users can add more graphs for more counters by clicking the "Add More" button below the graphs.

image-2024-2-6_16-35-16-1-version-1-modificationdate-1738834323907-api-v2.png

Then a new counter could be chosen, and a new graph for that counter will be added.

Additionally, it shows the top telemetry table, which is collapsed by default.

image-2024-10-3_13-30-34-version-1-modificationdate-1738834311540-api-v2.png

The recommended actions section will always appear at the bottom of the page with a reference to the user manual.

image-2024-10-3_12-5-31-version-1-modificationdate-1738834316053-api-v2.png

Link Anomaly

Port anomaly detection is based on defining composite metrics to reliably detect anomalies, where such metrics dynamically change, for example, according to a baseline that is determined and subsequently updated by a system.

In addition, there is a process for defining an anomaly score that provides a statistical estimation, such as the number of standard deviations, or the number of Mean Absolute Errors (MAEs) from a baseline value of the feature (i.e., metrics value), and assigning a degree of severity according to the number of standard deviations or MAEs.

The dashboard provides the following views:

Time Filter

Users can filter link anomaly data by time via searching for either an absolute or relative time.

image-2024-10-3_12-6-41-version-1-modificationdate-1738834315717-api-v2.png

Event Flow Charts

Event flow charts display anomalies between devices, with each link in the chart describing the number of anomalies between two devices.

The width of the link reflects the number of anomalies that occurred between the two devices.

image-2024-10-3_12-7-11-version-1-modificationdate-1738834315347-api-v2.png

By default, the 'Top 10' button is selected, displaying the top 10 devices with the highest occurrences of anomalies.

Clicking on link/device filters the table below.

Anomaly Details

The table below represents the anomaly details such as Last Anomaly time, Number of Occurrences, Node and Partner Node.

Users can filter the anomaly details table by clicking on either a device or a link in the event flow charts.

anomalydetails-version-1-modificationdate-1738834322010-api-v2.PNG

Clicking on the arrow icon in the anomaly node, the table will expand and will show all the history for the specific alert.

anomalydetails2-version-1-modificationdate-1738834321677-api-v2.PNG

Total Anomalies Over Time

When clicking on any anomaly node, an overtime graph will appear, showing all counters related to that anomaly.

The chart will display the number of anomalies over time, with the time scale based on the selected filter.

image-2024-10-3_13-15-26-version-1-modificationdate-1738834313840-api-v2.png

When clicking on a counter from the expanded table, the chart will display the counter values over time.

image-2024-10-3_13-24-11-version-1-modificationdate-1738834312187-api-v2.png

Link Anomaly Snapshots

Clicking on the anomaly node will display all telemetry counters for the selected port, starting from the selected time range, the table is collapsed by default.

image-2024-10-3_13-29-42-version-1-modificationdate-1738834311963-api-v2.png

Cluster Status

Provides information about the cluster and the distribution of the attributes value according to the selected time range.

The dashboard provides the following views:

Filters

Users can filter link status data by time by searching for either an absolute or relative time, or by link attribute by clicking on the 'All Filters' button.

filters-version-1-modificationdate-1738834320693-api-v2.PNG

Clicking on the 'All Filters' button will open the following modal:

filters1-version-1-modificationdate-1738834320367-api-v2.PNG

This modal filters the link status dashboard by the selected attribute values. Each attribute represents a dropdown containing the available values.

Additionally, clicking on the reset button will reset all filters.

Attributes

The most important attributes will be displayed as histograms and donut charts.

By clicking on a graph, the entire link status dashboard will be filtered based on the selection in the graph.

For example, clicking on "Disabled" in the " Physical Mngr Fsm State " graph will filter all the other graphs and tables by this selection.

image-2024-10-3_14-6-20-version-1-modificationdate-1738834310783-api-v2.png

Users can add more graphs for more attributes by clicking the "Add More" button below the graphs.

attributes2-version-1-modificationdate-1738834319623-api-v2.PNG

Counters

The most important counters will be displayed as histograms.

By clicking on a graph, the entire link status dashboard will be filtered based on the selection. Users can also filter the x-axis values (counter values) using the slider above.

image-2024-10-3_14-19-19-version-1-modificationdate-1738834310260-api-v2.png

Additional graphs for more counters can be added by clicking the "Add More" button below the graphs.

add_counter-version-1-modificationdate-1738834318913-api-v2.PNG

Counters Threshold

The user can assign a threshold value for each counter, with a new file added under /opt/ufm/cyber-ai/conf/counters_threshold.cfg for this purpose:

Copy
Copied!
            

[raw_ber] threshold= 1E-4   [eff_ber] normal_range = 1E-8   [symbol_ber] normal_range = 1E-10

Any value equal to the threshold will be highlighted in orange, while any value exceeding the threshold will be highlighted in red.

Users can view the threshold value by hovering over the question mark icon next to the counter name, which will display a popover containing a short description of the chart and the threshold value.

image-2024-10-3_15-19-14-version-1-modificationdate-1738834310020-api-v2.png

Links Snapshots

The table below shows the link snapshot details, including the device name, node GUID, port number, and related counters. By default, the table is collapsed and will automatically expand when a new counter or attribute is added, or when the user clicks on a chart to filter the data. When new attributes or counters are added, they will be included as columns in the table.

image-2024-10-3_15-22-18-version-1-modificationdate-1738834309740-api-v2.png

Anomalies

Clicking on a snapshot in the Links Snapshots table will display the most important counters as time graphs. These charts will show the counter values over time, with the time scale corresponding to the selected time filter.

anomalities-version-1-modificationdate-1738834318237-api-v2.PNG

Users can add more graphs for more counters by clicking the "Add More" button below the graphs.

anomalities2-version-1-modificationdate-1738834317933-api-v2.PNG

Logical Server Alerts

Logical server data collection and analytic jobs are disabled by default. To enable this, the related flags should be updated in the scheduler_settings.cfg file:

Copy
Copied!
            

[analytics_job::logical_server_port_join] interval = 300 delay = 720 max_input = 12 standard_timeout = 180 enabled = true   [analytics_job::logical_server_aggr] interval = 300 delay = 780 max_input = 12 standard_timeout = 180 enabled = true   [data_prep_ufm::logical_server] interval = 60 delay = 60 skip_collection = false json_collection = false

The ETL process of UFM Cyber-AI combines the topology of the logical server, with network telemetry allowing the monitoring of logical servers' performance.

Based on utilization measurements (the default is greater than 70%) the system detects the most utilized logical server. This is done by counting the amount of time when the alert is received.

In addition, a resource allocation pie is available which shows allocated nodes for logical servers compared to free nodes.

Detailed event information is provided to the user regarding logical server alerts, where the user can see logical server details and a description of the alert.

image2022-4-21_14-22-31-version-1-modificationdate-1738834353740-api-v2.png

Clicking any logical server alert shows six graphs representing network statistics in general and per selected logical server.

image2022-4-21_14-23-1-version-1-modificationdate-1738834354080-api-v2.png

This way the user can see the impact of a specific logical server throughout the entire network and can see if logical server activity is normal both from a performance and from a duration of usage (i.e., if the activity is happening in a reasonable time) point of view.

image2022-4-21_14-23-36-version-1-modificationdate-1738834354507-api-v2.png

Recommended Actions

A recommended action is available for all alert types. The user can click on any alert from alerts table in each page to see the recommended actions for the alert.

recommended-actions-version-1-modificationdate-1738834361377-api-v2.JPG

Specification Description

The present invention generally relates to the detection anomaly over cables and understanding degradation mechanisms for improving stability in data centers.

This innovation includes the detection of trends, intrusion, and any abnormal behavior of cables.

Moreover, with analysis of degradation over time we can determine better future performance strategies.

Customer Output

Threshold Alerts Tab

threshold-alert1-version-1-modificationdate-1738834361660-api-v2.JPG

threshold-alert2-version-1-modificationdate-1738834362033-api-v2.JPG

Deviation from Usual Behavior Tab

deviation1-version-1-modificationdate-1738834362347-api-v2.JPG

deviation2-version-1-modificationdate-1738834362700-api-v2.JPG

Background Art

Cable Anomaly Detection

  1. There are 5 measurements from the management tool (IB) with four thresholds per measure; see the Ethernet example below.

    Copy
    Copied!
                

    module_voltage Channel_*_ tx_power Channel_*_rx_power Channel_*_tx_bias module_temp

  2. There is a 5D (dimensions) GMM model which clusters channel and threshold behavior.

    image2021-12-11_12-30-2-version-1-modificationdate-1738834332490-api-v2.png

  3. To indicate alert: UFM Cyber-AI is calculating for every new data entry its deviation from channel centroid probabilistically per measurement.

  4. The system is defining the probability rate for the indication above deviation

  5. Each event per measurement is unique to node, port, and SN.

  6. For user convenience, there is the representation of the current measure via pre-defined thresholds in the tachometer

  7. For every chosen entry in the table, the trend graph is updated

  8. The trend graph represents the trend for the chosen measure to detect abnormal behavior over time

Introduction

Analytic jobs are critical components in CyberAI. Each analytic job has a specific task to accomplish and runs periodically in a docker container. They process raw data collected from UFM Telemetry and generate informative data that can be displayed to the user in a form of alerts that can be used in making decisions. The process of data includes splitting the data into chunks of 5 mins, calculating the delta (difference between counters values), aggregating data (hourly, day of week, topology, and PKey), and inference the data for any alerts.

Job Types

  1. File Splitter: This job splits the file if it contains more than one timestamp.

  2. Delta Processing: This job calculates the delta from the current sampling and the previous 5 minutes.

  3. Hourly Aggregation: This job aggregates all delta files in the previous hour into one csv file.

  4. Network Hourly Aggregation: Similar to hourly aggregation but, make average over all network nodes.

  5. DOW Aggregation: Collect the CSV files on the same day of the week (DOW), at the same hour, to be aggregated.

  6. Network DOW Aggregation: Similar to DOW aggregation but makes average over all network nodes.

  7. Network Anomaly: Analyzes the network hourly data with the network DOW aggregation and looks for anomalies.

  8. Topology Aggregation: Merges data collected from hourly aggregation, cables, and UFM topology files, and generates a file to be used by ML hourly aggregation.

  9. ML hourly Anomaly: Analyzes the topology merged file using ML model files and looks for link anomalies alert

  10. ML Weekly Aggregation: Updates the ML model used by ML hourly aggregation based on the weekly collected topology.

  11. PKEY Port Join: Merges the delta output files with the PKEY data and generates a file to be input for the PKEY aggregation.

  12. PKEY Aggregation: Analyzes the joined PKEY data and looks for PKEY (tenant) alerts.

  13. Logical Server Join: Merges the delta output files with the logical server data and generates a file to be input for the logical server aggregation.

  14. Logical Servers Aggregation: Analyzes the joint logical servers data and looks for logical servers alerts.

  15. Cable Daily: Analysis of cable counters files and looks for cable threshold and deviation alerts.

  16. Weekly Aggregation: Makes weekly average on hourly data to be displayed to compare the hourly data with the weekly average of this hour.

Output Sample

image2021-12-11_12-39-0-version-1-modificationdate-1738834332013-api-v2.png

© Copyright 2025, NVIDIA. Last updated on Feb 10, 2025.