NVIDIA UFM Enterprise User Manual v6.23.1

Telemetry

UFM Telemetry allows the collection and monitoring of InfiniBand fabric port statistics, such as network bandwidth, congestion, errors, latency, and more.

  • Real-time monitoring views

  • Monitoring of multiple attributes

  • Intelligent Counters for error and congestion counters

  • InfiniBand port-based error counters

  • InfiniBand congestion XmitWait counter-based congestion measurement

  • InfiniBand port-based bandwidth data

  • Rearrangement via a straightforward drag-and-drop function

  • Resizing by hovering over the panel's border

UFM collects by default two types of telemetry data, each serving different monitoring purposes:

Primary Telemetry (High-Frequency)

  • Default Sample Rate: 30 seconds

  • Use Cases:

    • Real-time monitoring

    • UFM dashboard charts

    • Port threshold event detection

    • Live telemetry sessions

  • Counters: Collects approximately 30 key performance counters covering bandwidth, congestion, and error metrics

  • Historical Data: Collected every 5 minutes and stored in UFM's SQLite database

Secondary Telemetry (Low-Frequency)

  • Default Sample Rate: 300 seconds (5 minutes)

  • Use Cases:

    • Historical analysis

    • Detailed diagnostics

    • Extended monitoring scenarios

  • Counters: Collects approximately 120 extended counters for comprehensive fabric analysis

  • Legacy Mode (via UFM): In this mode, telemetry instances are invoked during UFM startup and fully managed by UFM.

  • Clustered Telemetry (UTM) Mode: In this mode, telemetry instances are managed by the UFM Telemetry Manager (UTM) plugin.

Overview

UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric . This feature provides improved performance and scalability for large-scale deployments through workload distribution.

Key Benefits

  • Better Performance: Workload distribution across multiple instances reduces collection bottlenecks

  • HCA Utilization: Leverages multiple network adapters for parallel data collection

  • Scalability: Handles larger fabric deployments more efficiently

  • Flexibility: Customizable instance distribution based on your infrastructure

Prerequisites

  • UFM Telemetry Manager (UTM) Plugin must be deployed and enabled

Switching from Legacy to Clustered Telemetry (UTM) Mode

Follow these steps to enable Clustered Telemetry using UTM mode:

Step 1: Start UFM

Ensure UFM is running on your system.

Step 2: Deploy UTM Plugin

  1. Navigate to Settings > Plugin Management in the UFM WebUI

  2. Locate the UFM Telemetry Manager (UTM) plugin

  3. Click Enable to activate the plugin

Step 3: Configure Telemetry Mode

  1. Edit the UFM configuration file:

    vi /opt/ufm/files/conf/gv.cfg
  2. Locate the [Telemetry] section and set the following parameters:

    [Telemetry]<p></p>primary_telemetry_legacy_mode = false<p></p>secondary_telemetry_legacy_mode = false

Step 4: Restart UFM

Restart the UFM service to apply changes:

/etc/init.d/ufmd restart

Alternatively, restart only the telemetry service:

/etc/init.d/ufmd ufm_telemetry_stop<p></p>/etc/init.d/ufmd ufm_telemetry_start

Configuration Options

Automatic Configuration (Default)

When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration. This is the recommended approach for most deployments.

Default Behavior:

  • Detects all available HCAs on the system

  • Creates 1 primary and one secondary telemetry instance on the first HCA

  • Configuration is stored in: /opt/ufm/files/conf/utm/{hostname}_instances_matrix.json

Example Auto-Generated Matrix:

Copy
Copied!
            

{ "mlx5_0": { "primary": 1, "secondary": 1 }, "mlx5_1": { "primary": 0, "secondary": 0 } }


Custom Configuration

For advanced deployments, you can customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.

Auto-Detect and Create Configuration

Automatically detect HCAs and create a configuration file:

Copy
Copied!
            

/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Manual Custom Configuration

Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT:

Copy
Copied!
            

/opt/ufm/scripts/generate_telemetry_config.sh \ /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \ mlx5_0:2:1 \ mlx5_1:0:2 \ mlx5_2:1:0

This creates:

  • mlx5_0: 2 primary instances, 1 secondary instance

  • mlx5_1: 0 primary instances, 2 secondary instances

  • mlx5_2: 1 primary instance, 0 secondary instances

Example Custom Matrix:

Copy
Copied!
            

{ "mlx5_0": { "primary": 2, "secondary": 1 }, "mlx5_1": { "primary": 0, "secondary": 2 }, "mlx5_2": { "primary": 1, "secondary": 0 } }

Validate Configuration

Verify your matrix file is correctly formatted:

Copy
Copied!
            

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Get Help

Display usage information and options:

Copy
Copied!
            

/opt/ufm/scripts/generate_telemetry_config.sh --help

Note

Important: After modifying the matrix configuration file, you must restart UFM for changes to take effect.

Advanced Configuration Parameters

The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.

Parameter

Section

Default

Description

dashboard_interval

[Server]

30

Sample rate (seconds) for primary telemetry instances

secondary_sample_rate

[Telemetry]

300

Sample rate (seconds) for secondary telemetry instances

primary_telemetry_legacy_mode

[Telemetry]

true

Set to false to enable UTM mode for primary telemetry

secondary_telemetry_legacy_mode

[Telemetry]

true

Set to false to enable UTM mode for secondary telemetry

Note

Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.

Telemetry Instance

Description

REST API

High-Frequency (Primary) Telemetry Instance

A default telemetry session that collects a predefined set of ~30 counters covering bandwidth, congestion, and error metrics, which UFM analyzes and reports.

These counters are used for:

  • Default Telemetry Session - An ongoing session used by the UFM to display UFM WebUI dashboard charts information and for monitoring and analyzing ports threshold events (the session interval is 30 secs by default)

  • Real-Time Telemetry - allows users to define live telemetry sessions for monitoring small subsets of devices or ports and a selected set of counters. For more information, refer to Telemetry.

  • Historical Telemetry - based on the primary telemetry and collects statistical data from all fabric ports and stores them in an internal UFM SQLite database (the session interval is 5 mins by default)

For Default and Real-time Telemetry: Monitoring REST API

For Historical Telemetry: History Telemetry Sessions REST API → History Telemetry Sessions

Low-Frequency (Secondary) Telemetry Instance

Operates automatically upon UFM startup, offering an extended scope of 120 counters. For a list of the Secondary Telemetry Fields, refer to Low-Frequency (Secondary) Telemetry Fields.

N/A

For direct telemetry endpoint access, which exposes the list of supported counters:

For the High-Frequency (Primary) Telemetry Instance, run the following command:

Copy
Copied!
            

#curl -s 127.0.0.1:9001/csv/cset/converted_enterprise

For the Low-Frequency (Secondary) Telemetry Instance, run the following command:

Copy
Copied!
            

#curl -s 127.0.0.1:9002/csv/xcset/low_freq_debug


Storage Considerations

UFM periodically collects fabric port statistics and saves them in its SQLite database. Before starting up UFM Enterprise, please consider the following disk space utilization for various fabric sizes and duration.

The measurements in the table below were taken with sampling interval set to once per 30 seconds.

Note

Be aware that the default sampling rate is once per 300 seconds. Disk utilization calculation should be adjusted accordingly.

Number of Nodes

Ports per Node

Storage per Hour

Storage per 15 Days

Storage per 30 Days

16

8

1.6 MB

576 MB (0.563 GB)

1152 MB (1.125 GB)

100

8

11 MB

3960 MB (3.867 GB)

7920 MB (7.734 GB)

500

8

50 MB

18000 MB (17.58 GB)

36000 MB (35.16 GB)

1000

8

100 MB

36000 MB (35.16 GB)

72000 MB (70.31 GB)


© Copyright 2025, NVIDIA. Last updated on Nov 20, 2025