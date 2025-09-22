On This Page
- Deployment
- Configurations
- Authentication
- Secure Server using mTLS and Certificate Subject Identifier
- Role-Based Access Control
- Supported API Requests
- Capability Request
- Supported Paths
- Telemetry Messages - Data Format
- GET Request
- Subscribe Stream Request
- Subscribe On-Change Request
- Inventory Requests
- Events Requests
- Switch Rank Requests
- UFM Health KPI Requests
- UFM Telemetry Notification Subscription
- Troubleshooting
GNMI-Telemetry Plugin
The GNMI Telemetry Plugin is a server that uses the gNMI protocol to stream data from UFM telemetry. Users can select the data to stream, specify intervals, and choose to include only deltas (on-change mode).
The server supports three functions: Capability, Get, and Subscribe.
Data Streaming: The streamed data is delivered in CSV format. Headers are provided in the first message and included in subsequent messages. Data is presented in hex format to conserve space for unchanged data. Values are displayed as an array of strings, each representing a unique identifier (GUID) and port. Depending on the mode, values may have missing rows if there are no changes in the GUID and port.
Metadata Streaming: The plugin can stream UFM's metadata, providing an inventory of it. For convenience, examples use the gNMIc client, but any gNMI client can be used.
Configuration and Polling Intervals: The polling intervals for each server cache are configurable with the following defaults:
Telemetry: every 5 minutes
Inventory: every minute
Events: every minute
Switch rank: every 6 hours
UFM Health KPI: every 5 minutes
The service supports telemetry from switch-level data (fset) and port-level data (xcset), querying
low_freq_debugxcset by default. Multiple telemetries can be polled simultaneously.
Data Sharding: The service supports sharding the cache data on request, allowing many clients to request the same data while each receives a different part.
To deploy the plugin with UFM (SA or HA):
Install the latest version of UFM.
Run UFM with
/etc/init.d/ufmd start.
Pull the plugin image from the Docker Hub.
Run
/opt/ufm/scripts/manage_ufm_plugins.sh add -p gnmi_telemetry -t <version>to enable the plugin, or use the UFM UI to add the plugin via Settings → Plugin Management → Right Click on GNMI-telemetry → Add → select version → Add.
Check that the plugin is running with
docker ps.
If the gNMI default port is unavailable, change the configuration file
gnmi_telemetry.iniand restart the plugin.
The
/opt/ufm/files/conf/plugins/gnmi_telemetry/gnmi_telemetry.ini file centralizes the configuration of the GNMI-Telemetry Plugin, allowing users to customize logging, server behavior, telemetry intervals, security settings, and more. Below is an overview of the available configuration parameters, their default values, and their purpose within the plugin.
Common Configurations
Parameter
Description
Default Value
Sets the logging level for the plugin (e.g., INFO, DEBUG).
INFO
Maximum size of the log file before it rotates, measured in MB.
10
Number of backup log files to retain after rotation.
5
Full path of the log file
/log/gnmi_streaming.log
GNMI-Server Configurations
Parameter
Description
Default Value
Port on which the gRPC server listens for incoming connections.
9339
Specifies which CPU cores the server can use (comma-separated for multiple cores, e.g., 3,5,6).
3
data directory, this directory used by the server and it will contain all the necessary files that generated by the server.
DON'T MODIFY IT in case the gNMI server is running as ufm-plugin container
/data
Determines whether the port number is included in the sharding algorithm. Ports of the node may not be in the same shard
false
Includes previous data when sending notifications in
true
Enforces strict rules for collecting counters based on client requests.
false
Disables the collection and delivery of events and inventory data.
false
In Subscribe on-change mode, setting this flag to
false
In Subscribe mode, headers are included only in the first update by default. If set to true, headers will be included in every update.
false
If true fetches telemetry serially (one by one) from all the telemetry endpoints.
false
The gNMI telemetry fetch interval when applying serial telemetry fetch.
300s
Number of retry attempts for failed REST API calls
3
Exponential backoff duration between REST call retry attempts (e.g. 2s, 4s, 6s)
2s
If false, the cache will be cleaned up according to the cleanup policy. If true, the cache will not be cleaned up.
false
The multiplier for the stale time of a port. The stale time is the time after which the port is considered stale and will be removed from the cache.
The stale time is calculated as the maximum of the telemetry interval from all the endpoints multiplied by the stale time multiplier, which means
the number of telemetry fetch iterations wait before the port is considered stale and will be removed from the cache.
3
HTTP Server Configurations
The gNMI plugin includes a built-in HTTP Server that enables event-driven data synchronization between UFM Telemetry endpoints and the gNMI server. This real-time communication complements the existing periodic telemetry fetching mechanism controlled by the
telemetry_interval parameter.
Parameter
Description
Default Value
Port for the notification server to listen on
9338
Telemetry Configurations
Parameter
Description
Default Value
Path for the telemetry notification HTTP endpoint
/telemetry/notify/
Throttling interval - minimum time between processing notifications (in seconds)
10s
For more details about the real-time data synchronization between UFM Telemetry and gNMI server and how to enable it in Telemetry, please refer to section UFM Telemetry Notification Subscription.
Time-Intervals Configurations
Parameter
Description
Default Value
Time interval for collecting events, specified in seconds (e.g.,
60s
Time interval for collecting inventory data, specified in seconds (e.g.,
60s
Minimum sampling interval for telemetry notifications, specified in seconds.
10s
Timeout for REST API calls, specified in seconds.
30s
Interval for monitoring changes in the switch rank datasource file, specified in hours.
6h
Interval for monitoring system health metrics,specified in seconds.
300s
Telemetry Cluster-Specific Configurations
Each cluster should have its own section named
cluster-config-$cluster_name.For example,
[cluster-config-low_freq_debug]. you can add multiple sections in case you need to collect data from multiple clusters/telemetries simultaneously.
By default, the plugin comes with a single cluster to collect data from http://127.0.0.1:9002/csv/xcset/low_freq_debug.
Parameter
Description
Default Value
URL for the telemetry endpoint for the cluster.
Default identifier columns applied to all rows (ports) for telemetry data.
Node_GUID,Port_Number
Telemetry column used to determine identification schemes based on its value.
N/A
Identifier columns for rows where
N/A
Enables
false
Interval for sending telemetry data to the endpoint, specified in seconds.
300s
Columns to exclude from telemetry data, separated by commas.
port_guid
UFM Configurations
Parameter
Description
Default Value
Default inventory values if UFM is unavailable, specified as a JSON string.
{"Servers":8,"Switches":4,"HCAs":4,"ActivePorts":16}
IP address of the UFM instance used for inventory data.
127.0.0.1
Access token for authenticating with UFM if running on a different host.
N/A
Interval for refreshing the UFM users and roles cache. Increase this value if changes to UFM user or roles are infrequent.
10 minutes
File path for the switch rank datasource used by UFM.
/opt/ufm/files/log/opensm-smdb.dump
GNMI-Security Configurations
Parameter
Description
Default Value
Enables secure mode for gNMI.
true
Specifies the certificate subject identifier (SAN or CN).
SAN
Comma-separated list of UFM authorized users roles that can access gNMI, if empty, all users are allowed to access
Time of day for the initial certificate validation check, specified as HH:mm.
01:30
Interval for periodic certificate validation checks, specified in hours.
12h
XDR Configurations
Parameter
Description
Default Value
Enables XDR Mode Setup
false
Types of XDR ports to collect, separated by commas (e.g.,
legacy,aggregated,plane
The server's authentication is determined by the gNMI protocol. Two configurable items require authentication: the UFM Telemetry URL and the UFM inventory IP.
Authentication is not necessary for the UFM telemetry URL. Therefore, only the telemetry URL is required.
The inventory is sourced from the UFM of the local host, but can be changed to a different machine in the config file. To do so, token access to that machine is necessary.
The gNMI server can be secured using certificate. To secure the server, set the "
secure_mode_enabled" flag to "true" in the configuration (default is true).
The certificate must be placed under the
/opt/ufm/files/conf/webclient folder and can be changed by modifying the shared volume. The gNMI server periodically checks its certificates for updates, ensuring they remain up-to-date. The client certification naming convention must align with the DNS name (SAN) as the UFM.
The gNMI plugin supports certificate subject identifier (the default value is SAN). Configure the certificate subject identifier under the gNMI-security section to be SAN or CN (Common Name). For example:
client_cert_subject_identifier=CN.
The UFM gNMI plugin supports Role-Based Access Control (RBAC) to enable granular, user-based authorization for gNMI operations. By leveraging existing UFM user management and certificate-based (mTLS) authentication, the plugin enforces access policies according to user role.
Authentication and User Mapping
The gNMI server uses mTLS certificates for secure connections, leveraging the UFM certificate infrastructure.
Each UFM user is associated with an mTLS certificate by mapping the certificate subject identifier (SAN or CN) to a UFM username. The association information is stored in
/opt/ufm/files/conf/webclient/ufm_client_authen.db
Refer to Client-Based Authentication for configuring client authentication and user association in UFM.
Role Cache and Management
On plugin's startup, the gNMI server queries the UFM Users API to obtain the current list of users and their assigned roles (groups).
The user-role mapping is cached to reduce API load, The refresh interval is configurable via the
ufm_users_cache_ttlparameter (default: 10 minutes).
Any changes to UFM users or roles are reflected after the next cache refresh.
Configuring RBAC
Use the
authorized_rolesconfiguration option to control which UFM roles are permitted to access the gNMI server. Example:
authorized_roles=System_Admin,Monitoring_Only,Custom_Role1
Only users whose role matches one of the authorized roles will be allowed to access or execute gNMI operations.
Access Control Flow
Connection Attempt: A gNMI client initiates a connection using an mTLS certificate.
Certificate Validation: The gNMI server validates the client certificate.
Subject Identifier Extraction: The server extracts the subject identifier (as configured—SAN or CN) from the certificate.
User Association: Uses the identifier to look up the UFM user in
ufm_client_authen.db
Role Lookup: The user’s role is fetched from the cached UFM user-role mapping.
Authorization Check: Access is granted if the user’s role matches one in
authorized_roles; otherwise, it is denied.
The service supports the following requests:
Capability: Describes the YANG files the service supports (UFM telemetry).
Get: Requires legal paths; receives the cache data from the service.
Subscribe: Requires legal paths and an interval; receives cache data at the specified interval. The first message contains headers extracted from the path, and subsequent messages include only the headersID. In on-change subscribe mode, a heartbeat interval is provided instead of an interval. During the heartbeat interval, if no data changes, no notification is sent; A full notification message, similar to the first message, is sent. If some data changes a notification of the change is sent; No heart message is send.
Capability Request
The capability request provides information about the YANG files that the server supports, including their versions. This request can be fulfilled without requiring a connection to the telemetry or inventory.
Request Example:
gnmic -a localhost:9339 capability
Response Example:
gNMI version:
1.3.
0-
2
supported models:
- nvidia-ib-amber, Nvidia IB,
1.0.
0
- nvidia-ib-amber-ext, Nvidia IB,
1.0.
0
- nvidia-ib-amber-inventory-counters, Nvidia IB,
1.0.
0
- nvidia-ib-amber-port-counters, Nvidia IB,
1.0.
0
supported encodings:
- JSON
- JSON_IETF
Supported Paths
Telemetry Request Path Construction
To construct a path for a telemetry request, follow these steps:
Begin with "
nvidia/ib".
Specify sharding if desired. For example, to partition the data into 10 pieces and take the second partition, use 2/10.
Specify the
node_guidto select, using an asterisk (*) to select all nodes.
Specify the desired ports for the selected nodes, using an asterisk (*) to select all ports.
Select "
amber" for amBER telemetry.
Specify the desired counters group. If unknown, this step can be skipped.
Specify the counter, using an asterisk (*) to select all the counters in the cache. If a counters group is used, it will return all counters in the specified group.
Other Information Requests (Events, Inventory)
Begin with "
nvidia/ib".
Specify
inventoryor
events.
Switch Rank Information Path Construction
To construct a path for switch rank information, follow these steps:
Begin with
"nvidia/ib".
Specify the
node_guidto select, using an asterisk (*) to select all nodes.
Select
"amber"for amBER telemetry.
Use
Switch_rankas the counter name.
Telemetry Messages - Data Format
Telemetry messages consist of two key components: Headers and Values, both representing telemetry data in a CSV format.
Headers: Initially provided in a full mode, but transition to a string hash format after the second message when using a subscribe request to reduce message size.
Values: Each value begins with a timestamp, followed by the
node_guidand port number, and then the counter value in the same order as the headers. If a counter is not present for a node, it will be empty in the message.
In on-change subscribe messages, only nodes with changes and their corresponding modified values are included. All other counters for that node will remain empty.
Request Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist0 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist1 -i 30s
Response Example:
[
{
"source":
"localhost:9339",
"subscription-name":
"default-1690282472",
"timestamp":
1690282475124352000,
"time":
"2023-07-25T13:54:35.124352063+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply/sample",
"values": {
"nvidia/ib/amber/reply/sample": {
"Headers":
"timestamp,guid,port,hist0,hist1",
"HeaderID":
"5246201354",
"Values": [
"240771222771818,0x8168793592c6a790,1,,2",
"240771222771818,0x47a67159c915493f,1,1,2",
"240771222771818,0x667203ac69f3f2bf,1,2,",
"240771222771818,0x113cd807bfed3853,1,0,"
]
}
}
}
]
}
]
The second message on the headers will be set to hash values.
GET Request
The Get request retrieves data at a specified path. If the telemetry is devoid of information, the server will respond with an empty response. Otherwise, it will respond with counters it can locate.
Example:
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0
The request retrieves data from node_guid
0x5255456, specifically in port number 2, with the request counter set to hist0.
Example 2:
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/hist0
The request retrieves the data from all the ports and the node_guids, with the request counter set to hist0.
Example 3:
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/*
The request retrieves the data from node_guid
0x5255456, port 2, with the request counters set to "all".
Example for multi path:
gnmic -a localhost:9339 --insecure get nvidia/ib/guid[guid=*]/port[port_number=*]/amber/CableInfo.transmitter_technology --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/sel_gctrln_en_5_lane0 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/num_plls_7nm --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/rcal_fsm_done --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/LinkErrorRecoveryCounterExtended --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/sel_enc2_ib0_lane2 --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/lockdet_err_cnt_unlocked_sticky
Response Example:
[
{
"source":
"localhost:9339",
"timestamp":
1719232374915165200,
"time":
"2024-06-24T15:32:54.915165166+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply",
"values": {
"nvidia/ib/amber/reply": {
"Headers": [
"timestamp",
"Node_GUID",
"Port_Number",
"CableInfo.transmitter_technology",
"sel_gctrln_en_5_lane0",
"num_plls_7nm",
"rcal_fsm_done",
"LinkErrorRecoveryCounterExtended",
"sel_enc2_ib0_lane2",
"lockdet_err_cnt_unlocked_sticky"
],
"Values": [
"1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
"1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408",
"1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
"1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
"1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
"1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
"1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
"1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
"1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350",
"1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554"
]
}
}
}
]
}
]
Subscribe Stream Request
The Subscribe request, similar to the get request, provides data from the specified path. When the telemetry is empty, the server responds with an empty result. If data is available, the server responds with the retrieved counters. The stream delivers information at the specified interval. If no interval is specified, the server transmits the information at the default server rate, which is configurable and defaults to 10s.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0 -i 30s
This request retrieves data from the node_guid
0x5255456, port 2, where the request counter is hist0, and the interval is configured for 30 seconds. If the user wishes to test the stream, the stream mode can be configured to "once," and following a single response, the stream will be stopped.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0 -i 30s --mode once
This request retrieves the data from node_guid
0x5255456, port 2, where the request counter is hist0. The stream shuts down after one response, similar to a Get request.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/* -i 10s
The server responds for the first two notifications, as follows:
{
"source":
"localhost:9339",
"subscription-name":
"default-1719233128",
"timestamp":
1719233128171946500,
"time":
"2024-06-24T15:45:28.171946518+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply/sample",
"values": {
"nvidia/ib/amber/reply/sample": {
"HeaderID":
"970426048",
"Headers": [
"timestamp",
"Node_GUID",
"Port_Number",
"Counter1",
"Counter2",
"Counter3",
"Counter4",
"Counter5",
"Counter6",
"Counter7"
],
"Values": [
"1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
"1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408",
"1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554",
"1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
"1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
"1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
"1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
"1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
"1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
"1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350"
]
}
}
}
]
}
{
"source":
"localhost:9339",
"subscription-name":
"default-1719233128",
"timestamp":
1719233138173907700,
"time":
"2024-06-24T15:45:38.173907825+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply/sample",
"values": {
"nvidia/ib/amber/reply/sample": {
"HeaderID":
"970426048",
"Values": [
"1719232345757948,0x20f92ed58991d56c,1,6805,1632,5407,2038,1865,7279,8350",
"1719232345757948,0x1dac004a426bb5f5,1,8351,5757,7925,6181,3260,3081,1554",
"1719232345757948,0x48b60e6f3670eaca,1,9477,3847,1184,5527,4783,2102,8192",
"1719232345757948,0xabccdad7f8a3eda6,1,7976,6143,8257,3770,6166,6690,2835",
"1719232345757948,0x6d9ec4bb5fa45736,1,9051,2982,7145,3604,9256,1061,2638",
"1719232345757948,0x028cf9e0f9ed7c32,1,5623,7483,2263,2265,6890,4875,5564",
"1719232345757948,0x92a984c1a491b72a,1,6732,7795,6411,8569,3370,705,5536",
"1719232345757948,0x8b4b404acd2f34da,1,7610,7128,10064,1880,4834,3411,6724",
"1719232345757948,0x91f87bf42deb3e03,1,5091,7826,6290,8615,4247,8586,6214",
"1719232345757948,0x7b8c2e08907250ce,1,2891,3293,5774,4398,3681,3548,7408"
]
}
}
}
]
}
Subscribe On-Change Request
The subscribe on-change request, similar to the standard subscribe request, provides data from the specified path. If the telemetry lacks data, the server responds with an empty result. When data is available, the server responds with the located counters.
The stream delivers information at the specified interval. If no changes occurred between heartbeats, all cached data will be transmitted. However, if a change occurred and was pushed to the client, no data will be transmitted during the heartbeat.
The path construction follows the same pattern as the get request and includes inventory and event paths. Only updated data will be included in the response, while all other parts remain empty but retain the specified format. Similarly, only the nodes that have been updated will be included in the response.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=0x5255456]/port[port_number=2]/amber/port_counters/hist0 --stream-mode on-change --heartbeat-interval 1m
This request retrieves data from node_guid
0x5255456, port 2, with the request counters set to hist0. It periodically checks for changes every minute, and when changes are detected, it promptly sends the updated values.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/port_counters/* --stream-mode on-change --heartbeat-interval 1m
This request involves all nodes and ports, aiming to retrieve all counters from the telemetry. It periodically checks for changes every minute, and when changes are detected, it promptly sends the updated values.
The below is an example of the response to a particular GUID, which represents an on-change request for a few counters. However, only specific counters have been updated, those who have not updated have a value of 0. Because the flag
include_old_data_on_changedefault is true
1706532307824,
0x0002c903007e5220,
1,
0,
0,
0,
41447490564,
617155163,
41423305825,
617155163,
24184739,
17,
0,
0,
0,
0,
0
The same example with the flag set to
false will give this:
1706532307824,
0x0002c903007e5220,
1,,,,
41447490564,
617155163,
41423305825,
617155163,
24184739,
17,,,,,
Only the values that have changed return while the others are empty values. To get this format of data, one need to change the
include_old_data_on_change in the config file to false.
Example:
gnmic -a localhost:9339 --insecure sub --path nvidia/ib/guid[guid=*]/port[port_number=*]/amber/* --stream-mode on-change --heartbeat-interval 24h
The server responds for the first 2 notifications are the following (where
include_old_data_on_change is true), one can see the last two columns have not changed but still return the data before, the second message was send due to some rows have changed, those rows
{
"source":
"localhost:9339",
"subscription-name":
"default-1719236764",
"timestamp":
1719236764654659600,
"time":
"2024-06-24T16:46:04.654659517+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply/onchange",
"values": {
"nvidia/ib/amber/reply/onchange": {
"HeaderID":
"912200528",
"Headers": [
"timestamp",
"Node_GUID",
"Port_Number",
"Counter1",
"Counter2",
"Counter3",
"Counter4",
"Counter5",
"Counter6",
"Counter7"
],
"Values": [
"1719236753818594,0x7e680fb8f81a1950,1,100531,107250,100999,107455,109258,3716,5329",
"1719236753818594,0x0176438fe4ee507c,1,104269,108884,104887,108502,105366,4540,6673",
"1719236753818594,0x2e36224302959e79,1,101228,100555,105616,102767,108899,87,9953",
"1719236753818594,0x8e62a55d7571a9b8,1,100684,108124,106670,102400,106689,2910,4203",
"1719236753818594,0x0be75a9e97016f5e,1,102227,102735,108903,103547,108705,2629,1830",
"1719236753818594,0x8307bfad0672adbd,1,106033,103906,106185,107450,105736,2567,6914",
"1719236753818594,0x2cbe66ec0b1af84c,1,105958,106959,100349,107704,105073,8330,4962",
"1719236753818594,0x6b6da39a9ec4bbfc,1,104340,106752,109134,103796,103500,7136,3493",
"1719236753818594,0x6d122dbdd99cfb60,1,104941,107630,104190,105392,109582,5480,7934",
"1719236753818594,0xeed4bd9cd3b7f325,1,102416,100164,106731,102033,103807,3048,6316"
]
}
}
}
]
}
{
"source":
"localhost:9339",
"subscription-name":
"default-1719236764",
"timestamp":
1719237054620929500,
"time":
"2024-06-24T16:50:54.620929561+03:00",
"updates": [
{
"Path":
"nvidia/ib/amber/reply/onchange",
"values": {
"nvidia/ib/amber/reply/onchange": {
"HeaderID":
"912200528",
"Values": [
"1719237054172043,0xeed4bd9cd3b7f325,1,117416,115164,121731,117033,118807,3048,6316",
"1719237054172043,0x2e36224302959e79,1,116228,115555,120616,117767,123899,87,9953",
"1719237054172043,0x8e62a55d7571a9b8,1,115684,123124,121670,117400,121689,2910,4203",
"1719237054172043,0x7e680fb8f81a1950,1,115531,122250,115999,122455,124258,3716,5329",
"1719237054172043,0x0176438fe4ee507c,1,119269,123884,119887,123502,120366,4540,6673"
]
}
}
}
]
}
Inventory Requests
Inventory messages are conveyed in separate updates, presenting the inventory details of the UFM associated with the provided IP. These messages display comprehensive information, including the total count of various components within the UFM, such as switches, routers, servers, and more, along with details about active ports and the total number of ports, including disabled ones. Moreover, inventory requests include the size of the telemetry, which is not always the same as the active ports. In cases where the plugin is unable to establish contact with the UFM, it will revert to using default values defined in the configuration file. It is worth noting that the path for inventory requests differs from the conventional path structure, as they do not rely on specific nodes or ports. Consequently, inventory requests are initiated after "
nvidia/ib."
Example:
gnmic -a localhost:9339 --insecure get –path nvidia/ib/inventory/*
Response:
[
{
"source":
"localhost:9339",
"timestamp":
1698824237536878000,
"time":
"2023-11-01T09:37:17.536878067+02:00",
"updates": [
{
"Path":
"nvidia/ib/inventory",
"values": {
"nvidia/ib/inventory": {
"ActivePorts":
4,
"Cables":
2,
"Gateways":
0,
"HCAs":
2,
"Routers":
0,
"Servers":
2,
"Switches":
1,
"TotalPorts":
38,
"TelemetrySize":
4,
"timestamp":
1698824211535069000
}
}
}
]
}
]
Events Requests
Events messages are provided in separate updates, offering insights into the events occurring within the UFM associated with the specified IP. Given that the event metadata remains consistent, even when numerous events are part of a request, the message format adopts a CSV-like structure. The Headers section contains essential metadata regarding UFM events, while the Values section contains the raw event data. Users can subscribe to these events with the on-change feature enabled, receiving only the events triggered within the subscription interval. Notably, the path structure for event requests differs from the typical node or port-based structure and is requested after "
nvidia/ib."
Example:
gnmic -a localhost:9339 --insecure get –path nvidia/ib/events/*
Response:
[
{
"source":
"localhost:9339",
"timestamp":
1698824809647515600,
"time":
"2023-11-01T09:46:49.647515575+02:00",
"updates": [
{
"Path":
"nvidia/ib/events",
"values": {
"nvidia/ib/events": {
"Headers": [
"id",
"object_name",
"write_to_syslog",
"description",
"type",
"event_type",
"severity",
"timestamp",
"counter",
"category",
"object_path",
"name"
],
"Values": [
"7718,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:25:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
"7717,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:24:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
"7716,Grid,false,Disk space usage in /opt/ufm/files/log is above the threshold of 90.0%.,Grid,525,Critical,2023-11-01 07:23:54,N/A,Maintenance,Grid,Disk utilization threshold reached",
"7491,ec0d9a0300d42e54,false,Mcast group is deleted: ff12601bffff0000, 00000002,Computer,67,Info,2023-10-31 06:39:21,N/A,Fabric Notification,default / Computer: r-ufm59,MCast Group Deleted"
]
}
}
}
]
}
]
Switch Rank Requests
Switch rank updates are conveyed in separate messages, presenting the rank of the switches in the UFM. This data is derived from a file in the UFM and is updated by the server every 6 hours by default. The
switch_rank counter is associated only with switch-level data, so there is no need to specify a port in the path. However, this counter is not connected to the telemetry cache of switch-level data. Note that if the
ufm_ip is changed, the switch_rank information will not be available.
Example:
gnmic -a localhost:9339 --insecure get --path nvidia/ib/guid[guid=*]/amber/switch_rank
Response:
{
"source":
"localhost:9339",
"timestamp":
1719296207323383300,
"time":
"2024-06-25T09:16:47.323383222+03:00",
"updates": [
{
"Path":
"nvidia/ib/guid[guid=*]/amber/amber/switch_rank",
"values": {
"nvidia/ib/guid/amber/amber/switch_rank": {
"Headers":
"Timestamp,Node_GUID,switch_rank",
"Values": [
"1719296205612,0x0002c903007e5220,0"
]
}
}
}
]
}
UFM Health KPI Requests
UFM Health KPI messages are provided in separate updates, offering insights into the UFM Health metrics occurring within the UFM associated with the specified IP. The response value is Prometheus formatted, as a one big string. Users can subscribe to these UFM Health KPI with the on-change feature enabled, receiving the whole UFM Health metrics if there is a change in one item. Notably, the path structure for UFM health KPI requests differs from the typical node or port-based structure and is requested after "
nvidia/ib."
Example:
gnmic -a localhost:9339 --insecure get --path nvidia/ib/ufm_health_kpi/*
Response:
{
"source":
"localhost:9339",
"timestamp":
1719296207323383300,
"time":
"2024-06-25T09:16:47.323383222+03:00",
"updates": [
{
"Path":
"nvidia/ib/ufm_health_kpi",
"values": {
"nvidia/ib/ufm_health_kpi": {
"value": "# HELP server_cpu_usage_percent_avg Average of Server CPU usage percent
# TYPE server_cpu_usage_percent_avg gauge
server_cpu_usage_percent_avg{duration=
"Last 5 minutes"}
1.5545454545454547
server_cpu_usage_percent_avg{duration=
"Last 1 hour"}
1.4975206611570255
server_cpu_usage_percent_avg{duration=
"Last 24 hour"}
1.505277777777778
...
events_history_counter{duration=
"Last week",event_name=
"Director Switch is Down"}
0.0
events_history_counter{duration=
"Last week",event_name=
"Node is Up"}
0.0
events_history_counter{duration=
"Last week",event_name=
"Node is Down"}
0.0
events_history_counter{duration=
"Last week",event_name=
"Link is Up"}
0.0
events_history_counter{duration=
"Last week",event_name=
"Link is Down"}
0.0",
}
}
}
]
}
UFM Telemetry Notification Subscription
The gNMI plugin includes a built-in Telemetry Notification Server that enables event-driven data synchronization between UFM Telemetry endpoints and the gNMI server. This real-time communication complements the existing periodic telemetry fetching mechanism controlled by the
telemetry_interval parameter. See Telemetry Configurations.
The Telemetry Notification Server allows UFM Telemetry to push updates to the gNMI server immediately when new data becomes available, reducing latency and enhancing responsiveness compared to periodic polling alone.
The gNMI plugin provides a telemetry notification server that facilitates event-driven data updates, enabling real-time synchronization between UFM Telemetry endpoints and the gNMI server's in addition to the normal periodic fetching that's based on `
telemetry_interval`.
UFM Telemetry Integration
To enable UFM Telemetry to notify the gNMI server when new data is ready, you must add the below configuration line to the UFM Telemetry
.ini file (For example: /opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini)
plugin_env_UFM_TELEMETRY_NOTIFY_ENDPOINTS=http:
//localhost:9338/telemetry/notify/<Telemetry_HTTP_PORT>
Replace
<Telemetry_HTTP_PORT> with the actual HTTP port of the telemetry endpoint (e.g.,
9002 for a secondary telemetry instance).
After updating the configuration, restart the UFM Telemetry service to apply the changes.
# For UFM Bare-metal
/etc/initd/ufmd ufm_telemetry_restart
# For UFM Docker
docker exec ufm /etc/initd/ufmd ufm_telemetry_restart
Troubleshooting
#
Use Case
Result
Root Cause
1
UFM restarts
No gNMI data is streamed using gnmic. The notification include only headers with no data.
The gNMI inner cache is empty due to unresponsive telemetry