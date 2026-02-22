NVIDIA UFM Enterprise User Manual v6.24.1
GNMI-Telemetry Plugin

Plugin Release Notes

Changes and New Features

Plugin Version

Changes and New Features

1.4.1-8

  • Added support for progressive streaming for telemetry data, enabling incremental data delivery for improved responsiveness and reduced memory footprint during telemetry streaming.

  • Implemented memory enhancements optimized for large-scale fabrics.

  • gNMI size reduction

  • gNMI ARM support

1.3.8-5

  • Added the capability for UFM to stream events in JSON format through gNMI with the option to include the device information (events related to the device or the port). For more information, refer to GNMI-Server Configurations (Added the include_dev_details_in_events parameter). To activate the ability to stream the device information, ensure you have UFM version 6.23.1 is installed.

1.3.8-4

N/A

1.3.8-3

  • Added an option to disable the HTTP server through configuration. Introduced the enable_http_server=true option under the [HTTP-Server] section in gnmi_telemetry.ini.

  • Enhanced general logging and improved default configuration settings.

  • Optimized dynamic XCSET group loading in the gNMI service, significantly reducing start-up time and removing unclear log entries. The gNMI service now fetches only the configured endpoints and ib.port_counters, instead of retrieving all counter sets.


Bug Fixes

Plugin Version

Bug Fixes

1.4.1-8

N/A

1.3.8-5

N/A

1.3.8-4

  • Fixed a bug that caused the gNMI service to crash due to excessive memory consumption.

  • Removed the data snapshot mechanism during data retrieval, which was causing high memory consumption.

  • Cleaned the orphaned goroutines that held locks indefinitely.

1.3.8-3

  • Fixed telemetry fetch overlap that occurred when telemetry notifications were received.

  • Resolved the “port in use” error by adding a fallback mechanism for the HTTP client’s static port.

  • Added queue depth protection — implemented queue depth monitoring with a 50-message limit to prevent out-of-memory (OOM) conditions.

  • Added HTTP connection leak prevention.

  • Improved resource cleanup, including enhanced handling of client references, message queues, and goroutines.

  • Enhanced logging and monitoring for better observability.

  • Fixed stale port TTL timing calculation to ensure ports are only reported as down after the correct grace period.

  • Resolved lock contention issues and prevented potential deadlocks during concurrent data access.


Known Issues

Plugin Version

Known Issues

1.4.1-8

N/A

1.3.8-5

N/A

1.3.8-4

The gNMI plugin Docker container requires a minimum of 12 GB of RAM on the host system for successful deployment and stable operation.


Overview

The GNMI Telemetry Plugin is a server that uses the gNMI protocol to stream data from UFM telemetry. Users can select the data to stream, specify intervals, and choose to include only deltas (on-change mode).

The server supports three functions: Capability, Get, and Subscribe.

Data Streaming: The streamed data is delivered in CSV format. Headers are provided in the first message and included in subsequent messages. Data is presented in hex format to conserve space for unchanged data. Values are displayed as an array of strings, each representing a unique identifier (GUID) and port. Depending on the mode, values may have missing rows if there are no changes in the GUID and port.

Metadata Streaming: The plugin can stream UFM's metadata, providing an inventory of it. For convenience, examples use the gNMIc client, but any gNMI client can be used.

Configuration and Polling Intervals: The polling intervals for each server cache are configurable with the following defaults:

  • Telemetry: every 5 minutes

  • Inventory: every minute

  • Events: every minute

  • Switch rank: every 6 hours

  • UFM Health KPI: every 5 minutes

The service supports telemetry from switch-level data (fset) and port-level data (xcset), querying low_freq_debugxcset by default. Multiple telemetries can be polled simultaneously.

Data Sharding: The service supports sharding the cache data on request, allowing many clients to request the same data while each receives a different part.

Deployment

To deploy the plugin with UFM (SA or HA):

  1. Install the latest version of UFM.

  2. Run UFM with /etc/init.d/ufmd start.

  3. Pull the plugin image from the Docker Hub.

  4. Run /opt/ufm/scripts/manage_ufm_plugins.sh add -p gnmi_telemetry -t <version> to enable the plugin, or use the UFM UI to add the plugin via Settings → Plugin Management → Right Click on GNMI-telemetry → Add → select version → Add.

  5. Check that the plugin is running with docker ps.

  6. If the gNMI default port is unavailable, change the configuration file gnmi_telemetry.ini and restart the plugin.

Configurations

The /opt/ufm/files/conf/plugins/gnmi_telemetry/gnmi_telemetry.ini file centralizes the configuration of the GNMI-Telemetry Plugin, allowing users to customize logging, server behavior, telemetry intervals, security settings, and more. Below is an overview of the available configuration parameters, their default values, and their purpose within the plugin.

Common Configurations

Parameter

Description

Default Value

log_level

Sets the logging level for the plugin (e.g., INFO, DEBUG).

INFO

log_file_max_size_MB

Maximum size of the log file before it rotates, measured in MB.

10

log_file_backup_count

Number of backup log files to retain after rotation.

5

log_file_name

Full path of the log file

/log/gnmi_streaming.log


GNMI-Server Configurations

Parameter

Description

Default Value

grpc_port

Port on which the gRPC server listens for incoming connections.

9339

cpu_list

Specifies which CPU cores the server can use (comma-separated for multiple cores, e.g., 3,5,6).

3

memory_usage_threshold

Specifies the server memory usage threshold for forcing garbage collection in MB

500

data_directory

data directory, this directory used by the server and it will contain all the necessary files that generated by the server.

DON'T MODIFY IT in case the gNMI server is running as ufm-plugin container

/data

include_port_sharding

Determines whether the port number is included in the sharding algorithm. Ports of the node may not be in the same shard

false

include_old_data_on_change

Includes previous data when sending notifications in on_change mode.

true

strict_collected_counters

Enforces strict rules for collecting counters based on client requests.

false

disable_events_inventory

Disables the collection and delivery of events and inventory data.

false

include_dev_details_in_events

If true, the GNMI server will Include device details in events notification response (node_guid, node_name, node_type, peer details, etc.).

Note

This feature is available in UFM version 6.23.1 or later.

true

force_heartbeat_notification

In Subscribe on-change mode, setting this flag to true ensures that the server sends a heartbeat notification at every interval, even if no updates occur during the heartbeats.

false

include_headers_on_every_update

In Subscribe mode, headers are included only in the first update by default. If set to true, headers will be included in every update.

false

serial_telemetry_fetch

If true fetches telemetry serially (one by one) from all the telemetry endpoints.

false

serial_telemetry_interval

The gNMI telemetry fetch interval when applying serial telemetry fetch.

300s

rest_retry_attempts

Number of retry attempts for failed REST API calls

3

rest_retry_backoff

Exponential backoff duration between REST call retry attempts (e.g. 2s, 4s, 6s)

2s

disable_cache_cleanup

If false, the cache will be cleaned up according to the cleanup policy. If true, the cache will not be cleaned up.

false

port_stale_time_multiplier

The multiplier for the stale time of a port. The stale time is the time after which the port is considered stale and will be removed from the cache.

The stale time is calculated as the maximum of the telemetry interval from all the endpoints multiplied by the stale time multiplier, which means

the number of telemetry fetch iterations wait before the port is considered stale and will be removed from the cache.

3

gnmic_timeout_multiplier

Server timeout multiplier for gNMI subscription responses. The actual timeout is calculated as: subscription_interval * gnmic_timeout_multiplier. This determines how long the server waits before disconnecting an idle SAMPLE mode client. For example, with a 10s subscription interval and multiplier of 2, the server waits 20s before timing out.

2

batch_size

Number of telemetry rows per streamed message chunk. Controls progressive streaming behavior:

  • If 0: All telemetry data is sent in a single large message

  • If > 0: Data is sent progressively in chunks of this many rows

The progressive streaming (batch_size > 0) reduces memory footprint and enables faster time-to-first-byte for large datasets. Each chunk is sent as a separate gNMI notification with its own timestamp.

1000

enable_na_telemetry_indicators

Controls whether "N/A" values trigger telemetry update notifications:

  • If false (default): N/A values are filtered out and do not trigger onChange updates or replace existing valid counter values.

  • If true: N/A values are treated as valid data changes and will trigger updates and replace previous values.

Set to true if you need to track when counters become unavailable.

false


HTTP Server Configurations

The gNMI plugin includes a built-in HTTP Server that enables event-driven data synchronization between UFM Telemetry endpoints and the gNMI server. This real-time communication complements the existing periodic telemetry fetching mechanism controlled by the telemetry_interval parameter.

Parameter

Description

Default Value

enable_http_server

If true, the HTTP server will be enabled

true

http_port

Port for the HTTP server to listen on

9338


Telemetry Configurations

Parameter

Description

Default Value

notification_server_path

Path for the telemetry notification HTTP endpoint

/telemetry/notify/

throtling_interval

Throttling interval - minimum time between processing notifications (in seconds)

10s

For more details about the real-time data synchronization between UFM Telemetry and gNMI server and how to enable it in Telemetry, please refer to section UFM Telemetry Notification Subscription.

Time-Intervals Configurations

Parameter

Description

Default Value

events_interval

Time interval for collecting events, specified in seconds (e.g., 60s).

60s

inventory_interval

Time interval for collecting inventory data, specified in seconds (e.g., 60s).

60s

minimum_sample_rate

Minimum sampling interval for telemetry notifications, specified in seconds.

10s

timeout_rest_call

Timeout for REST API calls, specified in seconds.

30s

switch_rank_interval

Interval for monitoring changes in the switch rank datasource file, specified in hours.

6h

ufm_health_kpi_interval

Interval for monitoring system health metrics,specified in seconds.

300s


Telemetry Cluster-Specific Configurations

Each cluster should have its own section named cluster-config-$cluster_name.For example, [cluster-config-low_freq_debug]. you can add multiple sections in case you need to collect data from multiple clusters/telemetries simultaneously.

By default, the plugin comes with a single cluster to collect data fromhttp://127.0.0.1:9002/csv/xcset/low_freq_debug.

Parameter

Description

Default Value

telemetry_endpoint_url

URL for the telemetry endpoint for the cluster.

http://127.0.0.1:9002/csv/xcset/low_freq_debug

default_id_cols

Default identifier columns applied to all rows (ports) for telemetry data.

Node_GUID,Port_Number

id_type_col

Telemetry column used to determine identification schemes based on its value.

N/A

id_cols_<type>

Identifier columns for rows where id_type_col has a specific value (e.g., id_cols_1).

N/A

per_slvl

Enables per_slvl counters for telemetry if supported by the configuration.

false

telemetry_interval

Interval for sending telemetry data to the endpoint, specified in seconds.

300s

filtered_columns

Columns to exclude from telemetry data, separated by commas.

port_guid


UFM Configurations

Parameter

Description

Default Value

default_inventory

Default inventory values if UFM is unavailable, specified as a JSON string.

{"Servers":8,"Switches":4,"HCAs":4,"ActivePorts":16}

ufm_ip

IP address of the UFM instance used for inventory data.

127.0.0.1

ufm_access_token

Access token for authenticating with UFM if running on a different host.

N/A

ufm_users_cache_ttl

Interval for refreshing the UFM users and roles cache. Increase this value if changes to UFM user or roles are infrequent.

10 minutes

switch_rank_datasource_file

File path for the switch rank datasource used by UFM.

/opt/ufm/files/log/opensm-smdb.dump


GNMI-Security Configurations

Parameter

Description

Default Value

secure_mode_enabled

Enables secure mode for gNMI.

true

client_cert_subject_identifier

Specifies the certificate subject identifier (SAN or CN).

SAN

authorized_roles

Comma-separated list of UFM authorized users roles that can access gNMI, if empty, all users are allowed to access

initial_check_time

Time of day for the initial certificate validation check, specified as HH:mm.

01:30

check_interval

Interval for periodic certificate validation checks, specified in hours.

12h


XDR Configurations

Parameter

Description

Default Value

xdr_mode

Enables XDR Mode Setup

false

xdr_ports_types

Types of XDR ports to collect, separated by commas (e.g., legacy,aggregated,plane).

legacy,aggregated,plane


Authentication

The server's authentication is determined by the gNMI protocol. Two configurable items require authentication: the UFM Telemetry URL and the UFM inventory IP.

  • Authentication is not necessary for the UFM telemetry URL. Therefore, only the telemetry URL is required.

  • The inventory is sourced from the UFM of the local host, but can be changed to a different machine in the config file. To do so, token access to that machine is necessary.

Secure Server using mTLS and Certificate Subject Identifier

The gNMI server can be secured using certificate. To secure the server, set the "secure_mode_enabled" flag to "true" in the configuration (default is true).

The certificate must be placed under the /opt/ufm/files/conf/webclient folder and can be changed by modifying the shared volume. The gNMI server periodically checks its certificates for updates, ensuring they remain up-to-date. The client certification naming convention must align with the DNS name (SAN) as the UFM.

The gNMI plugin supports certificate subject identifier (the default value is SAN). Configure the certificate subject identifier under the gNMI-security section to be SAN or CN (Common Name). For example: client_cert_subject_identifier=CN.

Role-Based Access Control

The UFM gNMI plugin supports Role-Based Access Control (RBAC) to enable granular, user-based authorization for gNMI operations. By leveraging existing UFM user management and certificate-based (mTLS) authentication, the plugin enforces access policies according to user role.

Authentication and User Mapping

  • The gNMI server uses mTLS certificates for secure connections, leveraging the UFM certificate infrastructure.

  • Each UFM user is associated with an mTLS certificate by mapping the certificate subject identifier (SAN or CN) to a UFM username. The association information is stored in /opt/ufm/files/conf/webclient/ufm_client_authen.db

  • Refer to Client-Based Authentication for configuring client authentication and user association in UFM.

Role Cache and Management

  • On plugin's startup, the gNMI server queries the UFM Users API to obtain the current list of users and their assigned roles (groups).

  • The user-role mapping is cached to reduce API load, The refresh interval is configurable via theufm_users_cache_ttl parameter (default: 10 minutes).

  • Any changes to UFM users or roles are reflected after the next cache refresh.

Configuring RBAC

  • Use the authorized_roles configuration option to control which UFM roles are permitted to access the gNMI server. Example:authorized_roles=System_Admin,Monitoring_Only,Custom_Role1

  • Only users whose role matches one of the authorized roles will be allowed to access or execute gNMI operations.

Access Control Flow

  1. Connection Attempt: A gNMI client initiates a connection using an mTLS certificate.

  2. Certificate Validation: The gNMI server validates the client certificate.

  3. Subject Identifier Extraction: The server extracts the subject identifier (as configured—SAN or CN) from the certificate.

  4. User Association: Uses the identifier to look up the UFM user in ufm_client_authen.db

  5. Role Lookup: The user’s role is fetched from the cached UFM user-role mapping.

  6. Authorization Check: Access is granted if the user’s role matches one in authorized_roles; otherwise, it is denied.

Supported API Requests

The service supports the following requests:

  • Capability: Describes the YANG files the service supports (UFM telemetry).

  • Get: Requires legal paths; receives the cache data from the service.

  • Subscribe: Requires legal paths and an interval; receives cache data at the specified interval. The first message contains headers extracted from the path, and subsequent messages include only the headersID. In on-change subscribe mode, a heartbeat interval is provided instead of an interval. During the heartbeat interval, if no data changes, no notification is sent; A full notification message, similar to the first message, is sent. If some data changes a notification of the change is sent; No heart message is send.

Capability Request

The capability request provides information about the YANG files that the server supports, including their versions. This request can be fulfilled without requiring a connection to the telemetry or inventory.

Request Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 capability

Response Example:

Copy
Copied!
            

            
gNMI version: 1.3.0-2
supported models:
  - nvidia-ib-amber, Nvidia IB, 1.0.0
  - nvidia-ib-amber-ext, Nvidia IB, 1.0.0
  - nvidia-ib-amber-inventory-counters, Nvidia IB, 1.0.0
  - nvidia-ib-amber-port-counters, Nvidia IB, 1.0.0
supported encodings:
  - JSON
  - JSON_IETF


Supported Paths

Telemetry Request Path Construction

To construct a path for a telemetry request, follow these steps:

  1. Begin with "nvidia/ib".

  2. Specify sharding if desired. For example, to partition the data into 10 pieces and take the second partition, use 2/10.

  3. Specify the node_guid to select, using an asterisk (*) to select all nodes.

  4. Specify the desired ports for the selected nodes, using an asterisk (*) to select all ports.

  5. Select "amber" for amBER telemetry.

  6. Specify the desired counters group. If unknown, this step can be skipped.

  7. Specify the counter, using an asterisk (*) to select all the counters in the cache. If a counters group is used, it will return all counters in the specified group.

Other Information Requests (Events, Inventory)

  1. Begin with "nvidia/ib".

  2. Specify inventory or events.

Switch Rank Information Path Construction

To construct a path for switch rank information, follow these steps:

  1. Begin with "nvidia/ib".

  2. Specify the node_guid to select, using an asterisk (*) to select all nodes.

  3. Select "amber" for amBER telemetry.

  4. Use Switch_rank as the counter name.

Telemetry Messages - Data Format

Telemetry messages consist of two key components: Headers and Values, both representing telemetry data in a CSV format.

  • Headers: Initially provided in a full mode, but transition to a string hash format after the second message when using a subscribe request to reduce message size.

  • Values: Each value begins with a timestamp, followed by the node_guid and port number, and then the counter value in the same order as the headers. If a counter is not present for a node, it will be empty in the message.

In on-change subscribe messages, only nodes with changes and their corresponding modified values are included. All other counters for that node will remain empty.

Request Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist0 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist1 -i 30s

Response Example:

Copy
Copied!
            

            
[
  {
    "source": "localhost:9339",
    "subscription-name": "default-1690282472",
    "timestamp": 1690282475124352000,
    "time": "2023-07-25T13:54:35.124352063+03:00",
    "updates": [
      {
        "Path": "nvidia/ib/amber/reply/sample",
        "values": {
          "nvidia/ib/amber/reply/sample": {
            "Headers": "timestamp,guid,port,hist0,hist1",
            "HeaderID": "5246201354",
            "Values": [
              "240771222771818,0x8168793592c6a790,1,,2",
              "240771222771818,0x47a67159c915493f,1,1,2",
              "240771222771818,0x667203ac69f3f2bf,1,2,",
              "240771222771818,0x113cd807bfed3853,1,0,"
            ]
          }
        }
      }
    ]
  }
]

The second message on the headers will be set to hash values.

GET Request

The Get request retrieves data at a specified path. If the telemetry is devoid of information, the server will respond with an empty response. Otherwise, it will respond with counters it can locate.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0

The request retrieves data from node_guid 0x5255456, specifically in port number 2, with the request counter set to hist0.

Example 2:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist0

The request retrieves the data from all the ports and the node_guids, with the request counter set to hist0.

Example 3:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/*

The request retrieves the data from node_guid 0x5255456, port 2, with the request counters set to "all".

Example for multi path:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get nvidia/ib/guid[guid=*]/port[port_number=*]/amber/CableInfo.transmitter_technology --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/sel_gctrln_en_5_lane0 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/num_plls_7nm --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/rcal_fsm_done --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/LinkErrorRecoveryCounterExtended --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/sel_enc2_ib0_lane2 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/lockdet_err_cnt_unlocked_sticky

Response Example:

Copy
Copied!
            

            
[
  {
    "source": "localhost:9339",
    "timestamp": 1719232374915165200,
    "time": "2024-06-24T15:32:54.915165166+03:00",
    "updates": [
      {
        "Path": "nvidia/ib/amber/reply",
        "values": {
          "nvidia/ib/amber/reply": {
            "Headers": [
              "timestamp",
              "Node_GUID",
              "Port_Number",
              "CableInfo.transmitter_technology",
              "sel_gctrln_en_5_lane0",
              "num_plls_7nm",
              "rcal_fsm_done",
              "LinkErrorRecoveryCounterExtended",
              "sel_enc2_ib0_lane2",
              "lockdet_err_cnt_unlocked_sticky"
            ],
            "Values": [
              "1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
              "1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408",
              "1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
              "1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
              "1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
              "1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
              "1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
              "1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
              "1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350",
              "1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554"
            ]
          }
        }
      }
    ]
  }
]


Subscribe Stream Request

The Subscribe request, similar to the get request, provides data from the specified path. When the telemetry is empty, the server responds with an empty result. If data is available, the server responds with the retrieved counters. The stream delivers information at the specified interval. If no interval is specified, the server transmits the information at the default server rate, which is configurable and defaults to 10s.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0 -i 30s

This request retrieves data from the node_guid 0x5255456, port 2, where the request counter is hist0, and the interval is configured for 30 seconds. If the user wishes to test the stream, the stream mode can be configured to "once," and following a single response, the stream will be stopped.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0 -i 30s --mode once

This request retrieves the data from node_guid 0x5255456, port 2, where the request counter is hist0. The stream shuts down after one response, similar to a Get request.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/* -i 10s

The server responds for the first two notifications, as follows:

Copy
Copied!
            

            
{
  "source": "localhost:9339",
  "subscription-name": "default-1719233128",
  "timestamp": 1719233128171946500,
  "time": "2024-06-24T15:45:28.171946518+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/amber/reply/sample",
      "values": {
        "nvidia/ib/amber/reply/sample": {
          "HeaderID": "970426048",
          "Headers": [
            "timestamp",
            "Node_GUID",
            "Port_Number",
            "Counter1",
            "Counter2",
            "Counter3",
            "Counter4",
            "Counter5",
            "Counter6",
            "Counter7"
          ],
          "Values": [
            "1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
            "1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408",
            "1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554",
            "1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
            "1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
            "1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
            "1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
            "1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
            "1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
            "1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350"
          ]
        }
      }
    }
  ]
}
{
  "source": "localhost:9339",
  "subscription-name": "default-1719233128",
  "timestamp": 1719233138173907700,
  "time": "2024-06-24T15:45:38.173907825+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/amber/reply/sample",
      "values": {
        "nvidia/ib/amber/reply/sample": {
          "HeaderID": "970426048",
          "Values": [
            "1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350",
            "1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554",
            "1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
            "1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
            "1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
            "1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
            "1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
            "1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
            "1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
            "1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408"
          ]
        }
      }
    }
  ]
}


Subscribe On-Change Request

The subscribe on-change request, similar to the standard subscribe request, provides data from the specified path. If the telemetry lacks data, the server responds with an empty result. When data is available, the server responds with the located counters.

The stream delivers information at the specified interval. If no changes occurred between heartbeats, all cached data will be transmitted. However, if a change occurred and was pushed to the client, no data will be transmitted during the heartbeat.

The path construction follows the same pattern as the get request and includes inventory and event paths. Only updated data will be included in the response, while all other parts remain empty but retain the specified format. Similarly, only the nodes that have been updated will be included in the response.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure  sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0  --stream-mode on-change --heartbeat-interval 1m

This request retrieves data from node_guid 0x5255456, port 2, with the request counters set to hist0. It periodically checks for changes every minute, and when changes are detected, it promptly sends the updated values.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure  sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/* --stream-mode on-change --heartbeat-interval 1m

This request involves all nodes and ports, aiming to retrieve all counters from the telemetry. It periodically checks for changes every minute, and when changes are detected, it promptly sends the updated values.

The below is an example of the response to a particular GUID, which represents an on-change request for a few counters. However, only specific counters have been updated, those who have not updated have a value of 0. Because the flag include_old_data_on_changedefault is true

Copy
Copied!
            

            
1706532307824,0x0002c903007e5220,1,0,0,0,41447490564,617155163,41423305825,617155163,24184739,17,0,0,0,0,0

The same example with the flag set to false will give this:

Copy
Copied!
            

            
1706532307824,0x0002c903007e5220,1,,,,41447490564,617155163,41423305825,617155163,24184739,17,,,,,

Only the values that have changed return while the others are empty values. To get this format of data, one need to change the include_old_data_on_change in the config file to false.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/* --stream-mode on-change --heartbeat-interval 24h

The server responds for the first 2 notifications are the following (where include_old_data_on_change is true), one can see the last two columns have not changed but still return the data before, the second message was send due to some rows have changed, those rows

Copy
Copied!
            

            
{
  "source": "localhost:9339",
  "subscription-name": "default-1719236764",
  "timestamp": 1719236764654659600,
  "time": "2024-06-24T16:46:04.654659517+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/amber/reply/onchange",
      "values": {
        "nvidia/ib/amber/reply/onchange": {
          "HeaderID": "912200528",
          "Headers": [
            "timestamp",
            "Node_GUID",
            "Port_Number",
            "Counter1",
            "Counter2",
            "Counter3",
            "Counter4",
            "Counter5",
            "Counter6",
            "Counter7"
          ],
          "Values": [
            "1719236753818594,0x7e680fb8f81a1950,1,100531,107250,100999,107455,109258,3716,5329",
            "1719236753818594,0x0176438fe4ee507c,1,104269,108884,104887,108502,105366,4540,6673",
            "1719236753818594,0x2e36224302959e79,1,101228,100555,105616,102767,108899,87,9953",
            "1719236753818594,0x8e62a55d7571a9b8,1,100684,108124,106670,102400,106689,2910,4203",
            "1719236753818594,0x0be75a9e97016f5e,1,102227,102735,108903,103547,108705,2629,1830",
            "1719236753818594,0x8307bfad0672adbd,1,106033,103906,106185,107450,105736,2567,6914",
            "1719236753818594,0x2cbe66ec0b1af84c,1,105958,106959,100349,107704,105073,8330,4962",
            "1719236753818594,0x6b6da39a9ec4bbfc,1,104340,106752,109134,103796,103500,7136,3493",
            "1719236753818594,0x6d122dbdd99cfb60,1,104941,107630,104190,105392,109582,5480,7934",
            "1719236753818594,0xeed4bd9cd3b7f325,1,102416,100164,106731,102033,103807,3048,6316"
          ]
        }
      }
    }
  ]
}
{
  "source": "localhost:9339",
  "subscription-name": "default-1719236764",
  "timestamp": 1719237054620929500,
  "time": "2024-06-24T16:50:54.620929561+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/amber/reply/onchange",
      "values": {
        "nvidia/ib/amber/reply/onchange": {
          "HeaderID": "912200528",
          "Values": [
            "1719237054172043,0xeed4bd9cd3b7f325,1,117416,115164,121731,117033,118807,3048,6316",
            "1719237054172043,0x2e36224302959e79,1,116228,115555,120616,117767,123899,87,9953",
            "1719237054172043,0x8e62a55d7571a9b8,1,115684,123124,121670,117400,121689,2910,4203",
            "1719237054172043,0x7e680fb8f81a1950,1,115531,122250,115999,122455,124258,3716,5329",
            "1719237054172043,0x0176438fe4ee507c,1,119269,123884,119887,123502,120366,4540,6673"
          ]
        }
      }
    }
  ]
}


Inventory Requests

Inventory messages are conveyed in separate updates, presenting the inventory details of the UFM associated with the provided IP. These messages display comprehensive information, including the total count of various components within the UFM, such as switches, routers, servers, and more, along with details about active ports and the total number of ports, including disabled ones. Moreover, inventory requests include the size of the telemetry, which is not always the same as the active ports. In cases where the plugin is unable to establish contact with the UFM, it will revert to using default values defined in the configuration file. It is worth noting that the path for inventory requests differs from the conventional path structure, as they do not rely on specific nodes or ports. Consequently, inventory requests are initiated after "nvidia/ib."

Example:

Copy
Copied!
            

            
 gnmic -a localhost:9339 --insecure get –path nvidia/ib/inventory/*

Response:

Copy
Copied!
            

            
[
  {
    "source": "localhost:9339",
    "timestamp": 1698824237536878000,
    "time": "2023-11-01T09:37:17.536878067+02:00",
    "updates": [
      {
        "Path": "nvidia/ib/inventory",
        "values": {
          "nvidia/ib/inventory": {
            "ActivePorts": 4,
            "Cables": 2,
            "Gateways": 0,
            "HCAs": 2,
            "Routers": 0,
            "Servers": 2,
            "Switches": 1,
            "TotalPorts": 38,
            "TelemetrySize": 4,
            "timestamp": 1698824211535069000
          }
        }
      }
    ]
  }
]


Events Requests

Events messages are provided in separate updates, offering insights into the events occurring within the UFM associated with the specified IP. Given that the event metadata remains consistent, even when numerous events are part of a request, the message format adopts a CSV-like structure. The Headers section contains essential metadata regarding UFM events, while the Values section contains the raw event data. Users can subscribe to these events with the on-change feature enabled, receiving only the events triggered within the subscription interval. Notably, the path structure for event requests differs from the typical node or port-based structure and is requested after "nvidia/ib."

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get –path nvidia/ib/events/*

Response:

Copy
Copied!
            

            
[
  {
    "source": "localhost:9339",
    "timestamp": 1698824809647515600,
    "time": "2023-11-01T09:46:49.647515575+02:00",
    "updates": [
      {
        "Path": "nvidia/ib/events",
        "values": {
          "nvidia/ib/events": {
            "Headers": [
              "id",
              "object_name",
              "write_to_syslog",
              "description",
              "type",
              "event_type",
              "severity",
              "timestamp",
              "counter",
              "category",
              "object_path",
              "name"
            ],
            "Values": [
              "7718,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:25:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
              "7717,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:24:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
              "7716,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:23:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
              "7491,ec0d9a0300d42e54,false,Mcast group is deleted: ff12601bffff0000, 00000002,Computer,67,Info,2023-10-31 06:39:21,N/A,Fabric Notification,default / Computer: r-ufm59,MCast Group Deleted"
            ]
          }
        }
      }
    ]
  }
]


Switch Rank Requests

Switch rank updates are conveyed in separate messages, presenting the rank of the switches in the UFM. This data is derived from a file in the UFM and is updated by the server every 6 hours by default. The switch_rank counter is associated only with switch-level data, so there is no need to specify a port in the path. However, this counter is not connected to the telemetry cache of switch-level data. Note that if the ufm_ip is changed, the switch_rank information will not be available.

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=*]/amber/switch_rank

Response:

Copy
Copied!
            

            
{
  "source": "localhost:9339",
  "timestamp": 1719296207323383300,
  "time": "2024-06-25T09:16:47.323383222+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/guid[guid=*]/amber/amber/switch_rank",
      "values": {
        "nvidia/ib/guid/amber/amber/switch_rank": {
          "Headers": "Timestamp,Node_GUID,switch_rank",
          "Values": [
            "1719296205612,0x0002c903007e5220,0"
          ]
        }
      }
    }
  ]
}


UFM Health KPI Requests

UFM Health KPI messages are provided in separate updates, offering insights into the UFM Health metrics occurring within the UFM associated with the specified IP. The response value is Prometheus formatted, as a one big string. Users can subscribe to these UFM Health KPI with the on-change feature enabled, receiving the whole UFM Health metrics if there is a change in one item. Notably, the path structure for UFM health KPI requests differs from the typical node or port-based structure and is requested after "nvidia/ib."

Example:

Copy
Copied!
            

            
gnmic -a localhost:9339 --insecure get --path nvidia/ib/ufm_health_kpi/*

Response:

Copy
Copied!
            

            
{
  "source": "localhost:9339",
  "timestamp": 1719296207323383300,
  "time": "2024-06-25T09:16:47.323383222+03:00",
  "updates": [
    {
      "Path": "nvidia/ib/ufm_health_kpi",
      "values": {
        "nvidia/ib/ufm_health_kpi": {
          "value": "# HELP server_cpu_usage_percent_avg Average of Server CPU usage percent
					# TYPE server_cpu_usage_percent_avg gauge
					server_cpu_usage_percent_avg{duration="Last 5 minutes"} 1.5545454545454547
					server_cpu_usage_percent_avg{duration="Last 1 hour"} 1.4975206611570255
					server_cpu_usage_percent_avg{duration="Last 24 hour"} 1.505277777777778
...
					events_history_counter{duration="Last week",event_name="Director Switch is Down"} 0.0
					events_history_counter{duration="Last week",event_name="Node is Up"} 0.0
					events_history_counter{duration="Last week",event_name="Node is Down"} 0.0
					events_history_counter{duration="Last week",event_name="Link is Up"} 0.0
					events_history_counter{duration="Last week",event_name="Link is Down"} 0.0",
        }
      }
    }
  ]
}

UFM Telemetry Notification Subscription

The gNMI plugin includes a built-in Telemetry Notification Server that enables event-driven data synchronization between UFM Telemetry endpoints and the gNMI server. This real-time communication complements the existing periodic telemetry fetching mechanism controlled by the telemetry_interval parameter. See Telemetry Configurations.

The Telemetry Notification Server allows UFM Telemetry to push updates to the gNMI server immediately when new data becomes available, reducing latency and enhancing responsiveness compared to periodic polling alone.

The gNMI plugin provides a telemetry notification server that facilitates event-driven data updates, enabling real-time synchronization between UFM Telemetry endpoints and the gNMI server's in addition to the normal periodic fetching that's based on `telemetry_interval`.

UFM Telemetry Integration

To enable UFM Telemetry to notify the gNMI server when new data is ready, you must add the below configuration line to the UFM Telemetry .ini file (For example: /opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini)

Copy
Copied!
            

            
plugin_env_UFM_TELEMETRY_NOTIFY_ENDPOINTS=http://localhost:9338/telemetry/notify/<Telemetry_HTTP_PORT>

Replace <Telemetry_HTTP_PORT> with the actual HTTP port of the telemetry endpoint (e.g., 9002 for a secondary telemetry instance).

After updating the configuration, restart the UFM Telemetry service to apply the changes.

Copy
Copied!
            

            
# For UFM Bare-metal
/etc/initd/ufmd ufm_telemetry_restart
# For UFM Docker
docker exec ufm /etc/initd/ufmd ufm_telemetry_restart

Troubleshooting

#

Use Case

Result

Root Cause

1

UFM restarts

No gNMI data is streamed using gnmic. The notification include only headers with no data.

The gNMI inner cache is empty due to unresponsive telemetry
