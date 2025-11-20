UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric . This feature provides improved performance and scalability for large-scale deployments through workload distribution.

Better Performance: Workload distribution across multiple instances reduces collection bottlenecks

HCA Utilization: Leverages multiple network adapters for parallel data collection

Scalability: Handles larger fabric deployments more efficiently

Flexibility: Customizable instance distribution based on your infrastructure

UFM Telemetry Manager (UTM) Plugin must be deployed and enabled

Follow these steps to enable Clustered Telemetry using UTM mode:

Ensure UFM is running on your system.

Navigate to Settings > Plugin Management in the UFM WebUI Locate the UFM Telemetry Manager (UTM) plugin Click Enable to activate the plugin

Edit the UFM configuration file: vi /opt/ufm/files/conf/gv.cfg Locate the [Telemetry] section and set the following parameters: [Telemetry]<p></p>primary_telemetry_legacy_mode = false<p></p>secondary_telemetry_legacy_mode = false

Restart the UFM service to apply changes:

/etc/init.d/ufmd restart

Alternatively, restart only the telemetry service:

/etc/init.d/ufmd ufm_telemetry_stop<p></p>/etc/init.d/ufmd ufm_telemetry_start

When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration. This is the recommended approach for most deployments.

Default Behavior:

Detects all available HCAs on the system

Creates 1 primary and one secondary telemetry instance on the first HCA

Configuration is stored in: /opt/ufm/files/conf/utm/{hostname}_instances_matrix.json

Example Auto-Generated Matrix:

Copy Copied! { "mlx5_0" : { "primary" : 1 , "secondary" : 1 }, "mlx5_1" : { "primary" : 0 , "secondary" : 0 } }





For advanced deployments, you can customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.

Automatically detect HCAs and create a configuration file:

Copy Copied! /opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT :

Copy Copied! /opt/ufm/scripts/generate_telemetry_config.sh \ /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \ mlx5_0: 2 : 1 \ mlx5_1: 0 : 2 \ mlx5_2: 1 : 0

This creates:

mlx5_0 : 2 primary instances, 1 secondary instance

mlx5_1 : 0 primary instances, 2 secondary instances

mlx5_2: 1 primary instance, 0 secondary instances

Example Custom Matrix:

Copy Copied! { "mlx5_0" : { "primary" : 2 , "secondary" : 1 }, "mlx5_1" : { "primary" : 0 , "secondary" : 2 }, "mlx5_2" : { "primary" : 1 , "secondary" : 0 } }

Verify your matrix file is correctly formatted:

Copy Copied! /opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Display usage information and options:

Copy Copied! /opt/ufm/scripts/generate_telemetry_config.sh --help

Note Important: After modifying the matrix configuration file, you must restart UFM for changes to take effect.

The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.

Parameter Section Default Description dashboard_interval [Server] 30 Sample rate (seconds) for primary telemetry instances secondary_sample_rate [Telemetry] 300 Sample rate (seconds) for secondary telemetry instances primary_telemetry_legacy_mode [Telemetry] true Set to false to enable UTM mode for primary telemetry secondary_telemetry_legacy_mode [Telemetry] true Set to false to enable UTM mode for secondary telemetry

Note Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.

Telemetry Instance Description REST API High-Frequency (Primary) Telemetry Instance A default telemetry session that collects a predefined set of ~30 counters covering bandwidth, congestion, and error metrics, which UFM analyzes and reports. These counters are used for: Default Telemetry Session - An ongoing session used by the UFM to display UFM WebUI dashboard charts information and for monitoring and analyzing ports threshold events (the session interval is 30 secs by default)

Real-Time Telemetry - allows users to define live telemetry sessions for monitoring small subsets of devices or ports and a selected set of counters. For more information, refer to Telemetry.

Historical Telemetry - based on the primary telemetry and collects statistical data from all fabric ports and stores them in an internal UFM SQLite database (the session interval is 5 mins by default) For Default and Real-time Telemetry: Monitoring REST API For Historical Telemetry: History Telemetry Sessions REST API → History Telemetry Sessions Low-Frequency (Secondary) Telemetry Instance Operates automatically upon UFM startup, offering an extended scope of 120 counters. For a list of the Secondary Telemetry Fields, refer to Low-Frequency (Secondary) Telemetry Fields. N/A

For direct telemetry endpoint access, which exposes the list of supported counters:

For the High-Frequency (Primary) Telemetry Instance, run the following command:

Copy Copied! #curl -s 127.0 . 0.1 : 9001 /csv/cset/converted_enterprise

For the Low-Frequency (Secondary) Telemetry Instance, run the following command:

Copy Copied! #curl -s 127.0 . 0.1 : 9002 /csv/xcset/low_freq_debug



