Telemetry
UFM Telemetry allows the collection and monitoring of InfiniBand fabric port statistics, such as network bandwidth, congestion, errors, latency, and more.
Real-time monitoring views
Monitoring of multiple attributes
Intelligent Counters for error and congestion counters
InfiniBand port-based error counters
InfiniBand congestion XmitWait counter-based congestion measurement
InfiniBand port-based bandwidth data
Telemetry Session Panels Supported Actions
Rearrangement via a straightforward drag-and-drop function
Resizing by hovering over the panel's border
Understanding Telemetry Types
UFM collects by default two types of telemetry data, each serving different monitoring purposes:
Primary Telemetry (High-Frequency)
Default Sample Rate: 30 seconds
Use Cases:
Real-time monitoring
UFM dashboard charts
Port threshold event detection
Live telemetry sessions
Counters: Collects approximately 30 key performance counters covering bandwidth, congestion, and error metrics
Historical Data: Collected every 5 minutes and stored in UFM's SQLite database
Secondary Telemetry (Low-Frequency)
Default Sample Rate: 300 seconds (5 minutes)
Use Cases:
Historical analysis
Detailed diagnostics
Extended monitoring scenarios
Counters: Collects approximately 120 extended counters for comprehensive fabric analysis
Telemetry Management Methods
Legacy Mode (via UFM): In this mode, telemetry instances are invoked during UFM startup and fully managed by UFM.
Clustered Telemetry (UTM) Mode: In this mode, telemetry instances are managed by the UFM Telemetry Manager (UTM) plugin.
UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric . This feature provides improved performance and scalability for large-scale deployments through workload distribution.
Key Benefits
Better Performance: Workload distribution across multiple instances reduces collection bottlenecks
HCA Utilization: Leverages multiple network adapters for parallel data collection
Scalability: Handles larger fabric deployments more efficiently
Flexibility: Customizable instance distribution based on your infrastructure
Prerequisites
UFM Telemetry Manager (UTM) Plugin must be deployed and enabled
Deployment Types
UFM Clustered Telemetry supports two deployment scenarios. Choose the appropriate configuration method based on your deployment type:
Single node (Standalone) | Single UFM node, telemetry collected locally | Manual |
|
HA Cluster | Multiple nodes with shared configuration, telemetry aggregated across all nodes |
|
|
Key Differences
Bind Address |
|
|
| Not required | Required (all node IPs) |
Configuration Scope | Single node | Shared across cluster |
Important: Choose the correct configuration method for your deployment. Using the wrong method may result in inaccessible telemetry endpoints or duplicate data collection.
Switching From Legacy to Clustered Telemetry (UTM) Mode
Standalone Deployment Configuration
This section applies to single node (standalone) deployments where UFM runs on a single node.
Step 1: Deploy UTM Plugin
Navigate to Settings > Plugin Management in the UFM WebUI
Locate the UFM Telemetry Manager (UTM) plugin
Click Enable to activate the plugin
Step 2: Configure Telemetry Mode
Edit the UFM configuration file:
vi /opt/ufm/files/conf/gv.cfgLocate the [Telemetry] section and set the following parameters:
[Telemetry]
primary_telemetry_legacy_mode = false
secondary_telemetry_legacy_mode = false
Step 3: (Optional) Configure Instance Matrix
For custom HCA distribution, refer to Configuration Options - Instance Matrix.
Step 4: Start or Restart UFM
If UFM is not running, start it:
/etc/init.d/ufmd startIf UFM is already running, restart to apply changes:
/etc/init.d/ufmd restartAlternatively, restart only the telemetry service:
/etc/init.d/ufmd ufm_telemetry_stop<p></p>/etc/init.d/ufmd ufm_telemetry_start
HA Cluster Deployment Configuration
This section applies to High Availability (HA) cluster deployments (only for Active-Active deployments!) where multiple nodes share a common gv.cfg The configuration file and telemetry needs to be aggregated across all cluster nodes.
The configure_utm_mode.py script automates the configuration by:
Setting bind addresses to
0.0.0.0for external telemetry accessConfiguring
additional_cset_urlsin gv.cfg for multi-node telemetry aggregationManaging legacy mode flags
Updating environment files for proper endpoint configuration
Prerequisites
HA cluster must be configured in active-active mode.
/var/lib/ufm_ha/ha_statefile present (or explicit node IPs available)UTM Plugin deployed (can be enabled before or after configuration)
UFM configured in Infra mode
Recommended: Configure Before Starting UFM
This is the preferred approach as it avoids unnecessary service restarts.
Step 1: Deploy UTM Plugin
Ensure the UTM plugin is deployed on all cluster nodes. You can deploy via CLI or by using ufm_infra_feature_flag.py .
Step 2: Run the Configuration Script
Option A: Auto-detect node IPs from HA state file/opt/ufm/files/scripts/configure_utm_mode.py --enableOption B: Specify node IPs explicitly
/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30
Step 3: Start UFM Services.
Start UFM services on all cluster nodes:
ufm_ha_cluster start
Alternative: Configure After UFM is Running
If UFM is already running, you can still configure UTM mode and restart the services.
Step 1: Verify UTM Plugin is Enabled
Ensure the UTM plugin is enabled in Settings > Plugin Management.
Step 2: Run the Configuration Script
/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30Or with auto-detection:
/opt/ufm/files/scripts/configure_utm_mode.py --enableStep 3: Restart UFM Services on All Nodes
systemctl restart ufm-enterprise systemctl restart ufm-infra
Script Usage
Enable UTM Mode
Enable with auto-detected node IPs:
./configure_utm_mode.py --enableEnable with explicit node IPs:
./configure_utm_mode.py --enable --node-ips 10.20.30.40,10.20.30.50Disable UTM Mode
Revert to legacy mode:
./configure_utm_mode.py --disableShow Current Status
Display current telemetry configuration:
./configure_utm_mode.py --status
Command-Line Options
Flag | Description | |
|
| Enable UTM mode for telemetry |
|
| Disable UTM mode (revert to legacy mode) |
|
| Show current telemetry configuration status |
| Comma-separated list of cluster node IPs. If not provided, auto-detects from | |
| Skip updating | |
| Path to gv.cfg file (default: | |
| Set logging level: DEBUG, INFO, WARNING, ERROR (default: INFO) |
Configuration Changes
When enabling UTM mode for HA, the script modifies the following parameters:
gv.cfg [Telemetry] section:
Flag | ||
|
|
|
|
|
|
|
|
|
|
|
|
| (empty) | Space-separated cluster URLs |
Environment files:
Flag | ||
|
|
|
|
|
|
Example Output
Enable command output:
============================================================
UTM mode has been enabled successfully.
============================================================
Configuration changes (shared gv.cfg):
- primary_telemetry_legacy_mode = false
- secondary_telemetry_legacy_mode = false
- primary_ip_bind_addr = 0.0.0.0
- secondary_ip_bind_addr = 0.0.0.0
- additional_cset_urls configured with cluster nodes:
http://10.20.30.1:9001/csv/cset/converted_enterprise
http://10.20.30.2:9001/csv/cset/converted_enterprise
Note: Local node URLs are filtered at runtime by agent_manager.py
to avoid duplicate telemetry collection.
------------------------------------------------------------
IMPORTANT: Please restart UFM services on all nodes to apply changes:
systemctl restart ufm-enterprise
systemctl restart ufm-infra
------------------------------------------------------------
Status command output:
=== Current Telemetry Configuration ===
primary_telemetry_legacy_mode = false
secondary_telemetry_legacy_mode = false
primary_ip_bind_addr = 0.0.0.0
secondary_ip_bind_addr = 0.0.0.0
additional_cset_urls = http://10.20.30.1:9001/csv/cset/converted_enterprise http://10.20.30.2:9001/csv/cset/converted_enterprise
=== Mode Status ===
Current Mode: UTM (non-legacy)
=== Environment Files ===
Primary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9001
Secondary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9002
Configuration Options - Instance Matrix
Both standalone and HA deployments can customize how telemetry instances are distributed across HCAs.
Automatic Configuration (Default)
When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration.
Default Behavior:
Detects all available HCAs on the system
Creates 1 primary and 1 secondary telemetry instance on the first HCA
Configuration is stored in:
/opt/ufm/files/conf/utm/{hostname}_instances_matrix.json
Example Auto-Generated Matrix:
{
"mlx5_0": { "primary": 1, "secondary": 1 },
"mlx5_1": { "primary": 0, "secondary": 0 }
}
Custom Configuration
For advanced deployments, customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.
Auto-Detect and Create Configuration
Automatically detect HCAs and create a configuration file:
/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
Manual Custom Configuration
Specify custom instance counts per
HCA
using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT:
/opt/ufm/scripts/generate_telemetry_config.sh \
/opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \
mlx5_0:2:1 \
mlx5_1:0:2 \
mlx5_2:1:0
This creates:
mlx5_0: 2 primary instances, 1 secondary instance
mlx5_1: 0 primary instances, 2 secondary instances
mlx5_2: 1 primary instance, 0 secondary instances
Example Custom Matrix:
{
"mlx5_0": { "primary": 2, "secondary": 1 },
"mlx5_1": { "primary": 0, "secondary": 2 },
"mlx5_2": { "primary": 1, "secondary": 0 }
}
Validate Configuration
Verify your matrix file is correctly formatted:
/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
Get Help
Display usage information and options:
/opt/ufm/scripts/generate_telemetry_config.sh --help
After modifying the matrix configuration file, you must restart UFM for changes to take effect.
Advanced Configuration Parameters
The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.
Parameter | Section | Default | Description |
|
| 30 | Sample rate (seconds) for primary telemetry instances |
|
| 300 | Sample rate (seconds) for secondary telemetry instances |
|
| true | Set to |
|
| true | Set to |
Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.
Telemetry Instance | Description | REST API |
High-Frequency (Primary) Telemetry Instance | A default telemetry session that collects a predefined set of ~30 counters covering bandwidth, congestion, and error metrics, which UFM analyzes and reports. These counters are used for:
| For Default and Real-time Telemetry: Monitoring REST API For Historical Telemetry: History Telemetry Sessions REST API → History Telemetry Sessions |
Low-Frequency (Secondary) Telemetry Instance | Operates automatically upon UFM startup, offering an extended scope of 120 counters. For a list of the Secondary Telemetry Fields, refer to Low-Frequency (Secondary) Telemetry Fields. | N/A |
For direct telemetry endpoint access, which exposes the list of supported counters:
For the High-Frequency (Primary) Telemetry Instance, run the following command:
curl -s 127.0.0.1:9001/csv/cset/converted_enterprise
For the Low-Frequency (Secondary) Telemetry Instance, run the following command:
curl -s 127.0.0.1:9002/csv/xcset/low_freq_debug
Port Allocation
Default Port Allocation
Primary Telemetry: Base port 9001
Secondary Telemetry: Base port 9002
Multi-Instance Port Strategy
When multiple instances are configured, ports are allocated using an interleaved strategy:
Primary instances: Odd ports (9001, 9003, 9005, 9007...)
Secondary instances: Even ports (9002, 9004, 9006, 9008...)
Example - 2 primary + 2 secondary instances:
Primary: ports 9001, 9003
Secondary: ports 9002, 9004
Port Allocation with Proxy Mode
When enable_utm_proxy = true, ports 9001 and 9002 are reserved for the UTM HTTP proxy, and telemetry instances start from offset ports:
Primary instances: 9003, 9005, 9007, 9009...
Secondary instances: 9004, 9006, 9008, 9010...
Troubleshooting
Verify Telemetry Status
Check if telemetry instances are running:
ps aux | grep -E "(utm|telemetry)" | grep -v grep
Check Matrix Configuration
Validate the instance matrix file:
/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
View Current Mode
Use the configuration script to display current settings:
/opt/ufm/files/scripts/configure_utm_mode.py --status
Check Lock Files
If telemetry startup hangs, check for stale lock files:
ls -la /tmp/utm_matrix_*.lock
Storage Considerations
UFM periodically collects fabric port statistics and saves them in its SQLite database. Before starting up UFM Enterprise, please consider the following disk space utilization for various fabric sizes and duration.
The measurements in the table below were taken with sampling interval set to once per 30 seconds.
Be aware that the default sampling rate is once per 300 seconds. Disk utilization calculation should be adjusted accordingly.
Number of Nodes | Ports per Node | Storage per Hour | Storage per 15 Days | Storage per 30 Days |
16 | 8 | 1.6 MB | 576 MB (0.563 GB) | 1152 MB (1.125 GB) |
100 | 8 | 11 MB | 3960 MB (3.867 GB) | 7920 MB (7.734 GB) |
500 | 8 | 50 MB | 18000 MB (17.58 GB) | 36000 MB (35.16 GB) |
1000 | 8 | 100 MB | 36000 MB (35.16 GB) | 72000 MB (70.31 GB) |