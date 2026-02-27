UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric . This feature provides improved performance and scalability for large-scale deployments through workload distribution.

Better Performance: Workload distribution across multiple instances reduces collection bottlenecks

HCA Utilization: Leverages multiple network adapters for parallel data collection

Scalability: Handles larger fabric deployments more efficiently

Flexibility: Customizable instance distribution based on your infrastructure

UFM Telemetry Manager (UTM) Plugin must be deployed and enabled

UFM Clustered Telemetry supports two deployment scenarios. Choose the appropriate configuration method based on your deployment type:

Single node (Standalone) Single UFM node, telemetry collected locally Manual gv.cfg edit 127.0.0.1 HA Cluster Multiple nodes with shared configuration, telemetry aggregated across all nodes configure_utm_mode.py script 0.0.0.0

Bind Address 127.0.0.1 (localhost only) 0.0.0.0 (external access) additional_cset_urls Not required Required (all node IPs) Configuration Scope Single node Shared across cluster

Note Important: Choose the correct configuration method for your deployment. Using the wrong method may result in inaccessible telemetry endpoints or duplicate data collection.

This section applies to single node (standalone) deployments where UFM runs on a single node.

Navigate to Settings > Plugin Management in the UFM WebUI Locate the UFM Telemetry Manager (UTM) plugin Click Enable to activate the plugin

Edit the UFM configuration file:

vi /opt/ufm/files/conf/gv.cfg

Locate the [Telemetry] section and set the following parameters:

Copy Copied! [Telemetry] primary_telemetry_legacy_mode = false secondary_telemetry_legacy_mode = false

For custom HCA distribution, refer to Configuration Options - Instance Matrix.

If UFM is not running, start it:

/etc/init.d/ufmd start

If UFM is already running, restart to apply changes:

/etc/init.d/ufmd restart

Alternatively, restart only the telemetry service:

/etc/init.d/ufmd ufm_telemetry_stop<p></p>/etc/init.d/ufmd ufm_telemetry_start

This section applies to High Availability (HA) cluster deployments (only for Active-Active deployments!) where multiple nodes share a common gv.cfg The configuration file and telemetry needs to be aggregated across all cluster nodes.

The configure_utm_mode.py script automates the configuration by:

Setting bind addresses to 0.0.0.0 for external telemetry access

Configuring additional_cset_urls in gv.cfg for multi-node telemetry aggregation

Managing legacy mode flags

Updating environment files for proper endpoint configuration

HA cluster must be configured in active-active mode.

/var/lib/ufm_ha/ha_state file present (or explicit node IPs available)

UTM Plugin deployed (can be enabled before or after configuration)

UFM configured in Infra mode

This is the preferred approach as it avoids unnecessary service restarts.

Step 1: Deploy UTM Plugin Ensure the UTM plugin is deployed on all cluster nodes. You can deploy via CLI or by using ufm_infra_feature_flag.py .

Step 2: Run the Configuration Script Option A: Auto-detect node IPs from HA state file /opt/ufm/files/scripts/configure_utm_mode.py --enable Option B: Specify node IPs explicitly /opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30

Step 3: Start UFM Services. Start UFM services on all cluster nodes: ufm_ha_cluster start

If UFM is already running, you can still configure UTM mode and restart the services.

Step 1: Verify UTM Plugin is Enabled Ensure the UTM plugin is enabled in Settings > Plugin Management .

Step 2: Run the Configuration Script /opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30

Or with auto-detection: /opt/ufm/files/scripts/configure_utm_mode.py --enable

Step 3: Restart UFM Services on All Nodes systemctl restart ufm-enterprise systemctl restart ufm-infra

Enable UTM Mode

Enable with auto-detected node IPs: ./configure_utm_mode.py --enable

Enable with explicit node IPs: ./configure_utm_mode.py --enable --node-ips 10.20.30.40,10.20.30.50

Disable UTM Mode

Revert to legacy mode: ./configure_utm_mode.py --disable

Show Current Status

Display current telemetry configuration:

./configure_utm_mode.py --status

Flag Description --enable -e Enable UTM mode for telemetry --disable -d Disable UTM mode (revert to legacy mode) --status -s Show current telemetry configuration status --node-ips IPs Comma-separated list of cluster node IPs. If not provided, auto-detects from /var/lib/ufm_ha/ha_state --skip-additional-urls Skip updating additional_cset_urls configuration --config-file PATH Path to gv.cfg file (default: /opt/ufm/files/conf/gv.cfg ) --log-level LEVEL Set logging level: DEBUG, INFO, WARNING, ERROR (default: INFO)

When enabling UTM mode for HA, the script modifies the following parameters:

gv.cfg [Telemetry] section:

Flag primary_telemetry_legacy_mode true false secondary_telemetry_legacy_mode true false primary_ip_bind_addr 127.0.0.1 0.0.0.0 secondary_ip_bind_addr 127.0.0.1 0.0.0.0 additional_cset_urls (empty) Space-separated cluster URLs

Environment files:

Flag primary_env.cfg PROMETHEUS_ENDPOINT=http://127.0.0.1:9001 PROMETHEUS_ENDPOINT=http://0.0.0.0:9001 secondary_env.cfg PROMETHEUS_ENDPOINT=http://127.0.0.1:9002 PROMETHEUS_ENDPOINT=http://0.0.0.0:9002

Enable command output:

Copy Copied! ============================================================ UTM mode has been enabled successfully. ============================================================ Configuration changes (shared gv.cfg): - primary_telemetry_legacy_mode = false - secondary_telemetry_legacy_mode = false - primary_ip_bind_addr = 0.0 . 0.0 - secondary_ip_bind_addr = 0.0 . 0.0 - additional_cset_urls configured with cluster nodes: http: http: Note: Local node URLs are filtered at runtime by agent_manager.py to avoid duplicate telemetry collection. ------------------------------------------------------------ IMPORTANT: Please restart UFM services on all nodes to apply changes: systemctl restart ufm-enterprise systemctl restart ufm-infra ------------------------------------------------------------

Status command output:

Copy Copied! === Current Telemetry Configuration === primary_telemetry_legacy_mode = false secondary_telemetry_legacy_mode = false primary_ip_bind_addr = 0.0 . 0.0 secondary_ip_bind_addr = 0.0 . 0.0 additional_cset_urls = http: === Mode Status === Current Mode: UTM (non-legacy) === Environment Files === Primary: PROMETHEUS_ENDPOINT=http: Secondary: PROMETHEUS_ENDPOINT=http:

Both standalone and HA deployments can customize how telemetry instances are distributed across HCAs.

When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration.

Default Behavior:

Detects all available HCAs on the system

Creates 1 primary and 1 secondary telemetry instance on the first HCA

Configuration is stored in: /opt/ufm/files/conf/utm/{hostname}_instances_matrix.json

Example Auto-Generated Matrix:

Copy Copied! { "mlx5_0" : { "primary" : 1 , "secondary" : 1 }, "mlx5_1" : { "primary" : 0 , "secondary" : 0 } }

For advanced deployments, customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.

Automatically detect HCAs and create a configuration file:

/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT :

Copy Copied! /opt/ufm/scripts/generate_telemetry_config.sh \ /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \ mlx5_0: 2 : 1 \ mlx5_1: 0 : 2 \ mlx5_2: 1 : 0

This creates:

mlx5_0 : 2 primary instances, 1 secondary instance

mlx5_1 : 0 primary instances, 2 secondary instances

mlx5_2: 1 primary instance, 0 secondary instances

Example Custom Matrix:

Copy Copied! { "mlx5_0" : { "primary" : 2 , "secondary" : 1 }, "mlx5_1" : { "primary" : 0 , "secondary" : 2 }, "mlx5_2" : { "primary" : 1 , "secondary" : 0 } }

Verify your matrix file is correctly formatted:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Display usage information and options:

/opt/ufm/scripts/generate_telemetry_config.sh --help

Note After modifying the matrix configuration file, you must restart UFM for changes to take effect.

The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.

Parameter Section Default Description dashboard_interval [Server] 30 Sample rate (seconds) for primary telemetry instances secondary_sample_rate [Telemetry] 300 Sample rate (seconds) for secondary telemetry instances primary_telemetry_legacy_mode [Telemetry] true Set to false to enable UTM mode for primary telemetry secondary_telemetry_legacy_mode [Telemetry] true Set to false to enable UTM mode for secondary telemetry

Note Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.

Telemetry Instance Description REST API High-Frequency (Primary) Telemetry Instance A default telemetry session that collects a predefined set of ~30 counters covering bandwidth, congestion, and error metrics, which UFM analyzes and reports. These counters are used for: Default Telemetry Session - An ongoing session used by the UFM to display UFM WebUI dashboard charts information and for monitoring and analyzing ports threshold events (the session interval is 30 secs by default)

Real-Time Telemetry - allows users to define live telemetry sessions for monitoring small subsets of devices or ports and a selected set of counters. For more information, refer to Telelmetry.

Historical Telemetry - based on the primary telemetry and collects statistical data from all fabric ports and stores them in an internal UFM SQLite database (the session interval is 5 mins by default) For Default and Real-time Telemetry: Monitoring REST API For Historical Telemetry: History Telemetry Sessions REST API → History Telemetry Sessions Low-Frequency (Secondary) Telemetry Instance Operates automatically upon UFM startup, offering an extended scope of 120 counters. For a list of the Secondary Telemetry Fields, refer to Low-Frequency (Secondary) Telemetry Fields. N/A

For direct telemetry endpoint access, which exposes the list of supported counters:

For the High-Frequency (Primary) Telemetry Instance, run the following command:

Copy Copied! curl -s 127.0 . 0.1 : 9001 /csv/cset/converted_enterprise

For the Low-Frequency (Secondary) Telemetry Instance, run the following command:

Copy Copied! curl -s 127.0 . 0.1 : 9002 /csv/xcset/low_freq_debug





Primary Telemetry : Base port 9001

Secondary Telemetry: Base port 9002

When multiple instances are configured, ports are allocated using an interleaved strategy:

Primary instances : Odd ports (9001, 9003, 9005, 9007...)

Secondary instances: Even ports (9002, 9004, 9006, 9008...)

Example - 2 primary + 2 secondary instances:

Primary: ports 9001, 9003

Secondary: ports 9002, 9004

When enable_utm_proxy = true , ports 9001 and 9002 are reserved for the UTM HTTP proxy, and telemetry instances start from offset ports:

Primary instances : 9003, 9005, 9007, 9009...

Secondary instances: 9004, 9006, 9008, 9010...

Check if telemetry instances are running:

ps aux | grep -E "(utm|telemetry)" | grep -v grep

Validate the instance matrix file:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Use the configuration script to display current settings:

/opt/ufm/files/scripts/configure_utm_mode.py --status

If telemetry startup hangs, check for stale lock files: