NVIDIA UFM Enterprise User Manual v6.24.2
UFM Telemetry Manager (UTM) Plugin

Plugin Release Notes

Changes and New Features

Plugin Version

Feature

1.21.4

N/A


Bug Fixes

Plugin Version

Bug Fix

1.21.4

N/A


Overview

Managed telemetry is a mode of high availability and improved performance of UFM Telemetry processes.

Governed by UFM Telemetry Manager (UTM) several UFM Telemetry Instances (TIs) run on one or more machines, each collecting a subset of the cluster fabric.

UTM manages the following aspects:

  • monitoring of TI states: down, initializing, running, paused

  • TI management commands: add, remove, pause, start, restart

  • partitioning of fabric based on TIs health and fabric changes

  • assigning fabric segments to TIs

  • telemetry coverage check of a cluster

communications-version-1-modificationdate-1772131575097-api-v2.png

Deployment

The UTM plugin is designed to operate either as a UFM plugin or in standalone mode.

UTM plugin mode is deployed via UFM UI. Telemetry instances might be deployed by UFM or manually by deployment scripts.

Standalone mode deploys the whole setup (UTM, host TI list, switch telemetry image) with the deployment scripts.

As a first step, get the UTM image. If it runs in UFM mode, upload it to the UFM machine.

Copy
Copied!
            

            
docker pull mellanox/ufm-plugin-utm

UFM Plugin Mode

The UTM plugin can be added either via the Command Line Interface or Web-UI.

CLI Deployment

To add the plugin, run:

Copy
Copied!
            

            
/opt/ufm/scripts/manage_ufm_plugins.sh add -p utm

To remove the plugin, run:

Copy
Copied!
            

            
/opt/ufm/scripts/manage_ufm_plugins.sh remove -p utm


Web-UI Deployment

  1. Navigate to the UFM web UI and click on Settings in the left panel.

  2. Go to the "Plugin Management" tab.

  3. Right-click on the UTM plugin row and select "Add."

    add_utm_plugin-version-1-modificationdate-1772131575740-api-v2.png

  4. Go to the option on the left called "Telemetry Status" to see the UTM UI page.

  5. Operate with several options:

    1. The default UFM TIs. Depending on UFM configuration, TIs might run in legacy mode or within UTM.

    2. Start telemetry instances manually using UTM deployment scripts. See section "Manual Deployment".

To stop the UTM plugin, go to "Plugin Management", right-click on the UTM plugin line and click on disable.

Note

If non-default UFM credentials are used, UTM may fail to access the UFM REST API. To resolve this, configure the ufm section of the utm_config.ini file with ufm_user= and ufm_pass= to restore the connection between UTM and UFM.

Default UFM Telemetry Monitoring in Legacy Mode

UFM Telemetry has high and low-frequency (Primary and Secondary, respectively) TIs that are running by default.

To enable meaningful monitoring:

  1. Set plugin_env_CLX_EXPORT_API_SHOW_STATISTICS=1 in the config files:

    Copy
    Copied!
                
    
            
    /opt/ufm/files/conf/telemetry_defaults/launch_ibdiagnet_config.ini
/opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini

  2. Restart telemetry instances with the new config. If UFM Enterprise runs as a docker container, this command should be executed inside the container.

    Copy
    Copied!
                
    
            
    /etc/init.d/ufmd ufm_telemetry_restart

  3. Give TIs some time to update performance metrics. The time depends on the update interval of default TIs.

Default UFM Telemetry instances in UTM

If legacy mode is disabled in UFM configuration, UTM will run Primary and Secondary telemetries automatically.

Manual Deployment

Additional telemetry instances for UFM plugin mode or the whole standalone setup (UTM and TIs) are deployed using UTM Deployment scripts.

UTM Deployment Scripts

Get deployment scripts and examples by mounting the local folder UTM_DEPLOYMENT_SCRIPTS (/tmp/utm_deployment_scripts in this example) and running get_deployment_scripts.sh :

Copy
Copied!
            

            
$ export UTM_DEPLOYMENT_SCRIPTS=/tmp/utm_deployment_scripts
$ docker run -v "$UTM_DEPLOYMENT_SCRIPTS:/deployment_scripts" --rm --name utm-deployment-scripts -ti mellanox/ufm-plugin-utm:latest /bin/sh /get_deployment_scripts.sh

The content of the script folder consists of:

  • Examples - Contains run/stop scripts for both standalone and UFM plugin modes. Each example script is an example of actual deployment script usage.

  • hostlist.txt - Specifies the hosts, ports, and HCAs for TIs to be deployed

  • Scripts - Contains actual deployment scripts. Entry-point script deploy_managed_telemetry.sh triggers the rest two scripts, depending on input arguments.

    Copy
    Copied!
                
    
            
    $ cd $UTM_DEPLOYMENT_SCRIPTS
$ tree
.
├── examples
│   ├── run_standalone.sh
│   ├── run_with_plugin.sh
│   ├── stop_standalone.sh
│   └── stop_with_plugin.sh
├── hostlist.txt
├── README.md
└── scripts
    ├── deploy_bringup.sh
    ├── deploy_managed_telemetry.sh
    └── deploy_ufm_telemetry.sh

Note

All example/deployment scripts should run from the UTM_DEPLOYMENT_SCRIPTS folder.


Hostlist File

Please note the following:

  • The hostlist.txt file should be set before running any script.

  • The hostname and port will be used for communication and HCA for telemetry collection.

  • UTM only supports a single fabric for managed TIs, even if different HCAs on the same machine are connected to different fabrics.

  • Both local and remote hosts are supported for TI deployments.

Copy
Copied!
            

            
$ cat hostlist.txt 
 
# List lines in the following format:
# host:port:hca
#
# where:
#  - host is IP or hostname. Use localhost or 127.0.0.1 for local deployment
#  - port to run telemetry on.
#  - hca is the target host device from which telemetry collects. Run `ssh $host ibstat`
#        to find the active device on the target host.
 
localhost:8123:mlx5_0
localhost:8124:mlx5_0

Main Deployment Script

For a more customizable setup beyond what the example scripts offer, users have the option to manually run ./scripts/deploy_managed_telemetry.sh. This primary deployment script can deploy multiple TIs and optionally UTM as well.

Use deploy_managed_telemetry.sh --help to get help.

Copy
Copied!
            

            
./deploy_managed_telemetry.sh --help
./deploy_managed_telemetry.sh options:     mandatory: 
 
    mandatory: 
        --hostlist-file=     Path to a file that lists hostname:port:hca lines
    mandatory run options (use only one at the same time):  
        -r,  --run           Deploy and run managed telemetry setup
        -s,  --stop          Stop all processes and cleanup
    mandatory telemetry deployment options (use only one at the same time):  
        -t=, --ufmt-image=   UFM telemetry docker image or tgz/tar.gz-image
      or: 
        --bringup-package=   Bringup tar.gz package
    optional: 
        -m=, --utm-image=       UTM docker image or tgz/tar.gz-image. Runs UTM only if it is set. Configures UTM according hostlist file
        --utm-as-plugin=        if UTM runs as a plugin, set this flag
        -d=, --data-root=       Root directory for run data   |  Default: '/tmp/managed_telemetry/'
        --switch-telem-image=   Switch telemetry image (tar.gz-file or docker image). UTM will be able to deploy it to managed switches if set
        --common-data-dir=      Common data folder for TIs
    -h,  --help              Print this message


Start TIs for UFM Plugin Mode:

  1. Prepare TI setup using utm_deployment_scriptsexample scripts:

  2. Change directory:

    Copy
    Copied!
                
    
            
    cd $UTM_DEPLOYMENT_SCRIPTS

  3. Open and configure hostlist.txt

  4. Deploy and run TIs according to hostlist.txt and set these TIs to be monitored by UTM:

    Copy
    Copied!
                
    
            
    sudo ./examples/run_with_plugin.sh

  5. To stop and cleanup TIs setup and unset TIs to be monitored by UTM:

    Copy
    Copied!
                
    
            
    sudo ./examples/stop_with_plugin.sh

    Note

    This script does not stop UTM plugin!

Standalone Mode

In standalone mode, UTM periodically tracks fabric changes by itself and does not require UFM Enterprise.

Deploy via example scripts:

  1. Change directory

    Copy
    Copied!
                
    
            
    cd $UTM_DEPLOYMENT_SCRIPTS

  2. Open and configure hostlist.txt

  3. Deploy and run TIs according to hostlist.txt and run UTM:

    Copy
    Copied!
                
    
            
    sudo ./examples/run_standalone.sh

  4. To stop and cleanup TIs setup and UTM, run:

    Copy
    Copied!
                
    
            
    sudo ./examples/stop_standalone.sh

Deployment without Scripts

This section provides detailed instructions for manually deploying UTM and managed TIs to ensure coverage of all potential corner cases where the convenience script may not be effective.

UTM Deployment

UTM can be started with two docker run commands.

  1. Set utm_config, utm_data, utm_log, and utm_imagevariables.

  2. Initialize UTM config:

    Initialize UTM

    Copy
    Copied!
                
    
            
    docker run -v $utm_config:/config \
           -v $utm_data:/data \
           --rm --name utm-init \
           --device=/dev/infiniband/ \
           $utm_image /init.sh

  3. Run UTM

    Run UTM

    Copy
    Copied!
                
    
            
    docker run -d --net=host  \
           --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \
           --device=/dev/infiniband/ \
           -v $utm_config:/config \
           -v $utm_data:/data \
           -v $utm_log:/log \
           --rm --name utm $utm_image

Managed/Standalone TIs Manual Deployment

TI can be represented either as a UFM Telemetry docker container or as a UFM Telemetry bring-up package.

To run the docker container in managed mode, launch_ibdiagnet_config.ini should have the following flags enabled:

Copy
Copied!
            

            
plugin_env_CLX_EXPORT_API_SHOW_STATISTICS=1
plugin_env_UFM_TELEMETRY_MANAGED_MODE=1

To run UFM Telemetry with Distributed Telemetry, enable its receiver and specify HCA to work on:

Copy
Copied!
            

            
plugin_env_CLX_EXPORT_API_RUN_DT_RECEIVER=1
plugin_env_CLX_EXPORT_API_DT_RECEIVER_HCA=$HCA

To run bringup in managed mode, create enable_managed.ini file with the same flags and use custom_config option of collection_start:

Copy
Copied!
            

            
collection_start custom_config=./enable_managed.ini

UTM Configuration File

The UTM configuration file utm_config.ini is placed under the configuration folder (which is referred to asUTM_CONFIG later on this document).

In the case of UFM plugin mode, UTM_CONFIG= /opt/ufm/files/conf/plugins/utm/.

In the case of standalone mode, the default value is UTM_CONFIG =/tmp/managed_telemetry/utm/config and can be changed via --data-root argument of deployment script.

When changes are made to the configuration file, UTM initiates a restart of its main process to apply the updated configuration.

Users may wish to adjust timeout and update rate configurations based on their specific setups. However, it is important to note that the remaining configurations are tailored to enable UTM to function as a UFM plugin and should not be modified.

Distributed Telemetry

To enable distributed telemetry set dt_enable=1 in the corresponding section.

Note

Distributed Telemetry requires Switch Telemetry docker image tagged as switch-telemetry:{version} and placed under $UTM_CONFIG/telem_files/ as switch-telemetry_{version}.tar.gz

UTM scans this file at its start.

Example deployment scripts handle it for both UFM plugin and standalone modes.

For more details refer to NVIDIA UFM Telemetry Documentation→ Distributed Telemetry - Switch Telemetry Agent

GUI

To access the GUI within the UFM web UI, navigate to the Telemetry status section in the left panel.

The UI is accessible whether it is running as a part of UFM Enterprise or standalone via the endpoint: http://127.0.0.1:8888/files/index.html.

The GUI comprises several zones:

  • The top pane displays general information and provides options to add a server name/IP and port for monitoring. Users can set the GUI refresh interval in the top right corner.

  • The middle panes showcase TI groups, with the default group being basic. Unmanaged (standalone) TIs can be monitored and are placed in the "Unmanaged" group.

  • Each group pane presents monitoring information for each TI.

  • The bottom pane exhibits system events. Utilize the bottom right menu to navigate through the events history.

Screenshot_2024-05-05_at_14.34.07-version-1-modificationdate-1772131576150-api-v2.png

TI Management

In managed mode, UTM can dispatch commands to TIs. By right-clicking on the TI line, users can:

  • Pause a currently running TI. This action redistributes fabric sharding among the active TIs.

  • Resume a paused TI.

  • Exclude a TI from monitoring. Although the TI remains on the machine, it enters a paused state and is removed from its group. It's important to note that empty TI groups are automatically removed.

Telemetry Status Fields

The table below lists each column of a Telemetry Group panel:

Field Name

Description

URL

TI URL in format http://{hostname}:{port} or http://{IP}:{port}

Mode

standalone or managed / platform

DT receiver

With or without a Distributed Telemetry receiver. If 0, cannot receive DT data from a switch TI

Status

Down, Running, Initializing, Paused, or Restarting

Uptime

TI uptime in human-readable format

Collected host/switch ports

Ports collected from the host/switch.

By default data that did not change from the last sample is not being re-exported.

Such data is shown in the host part ad +num_old_ports.In the screenshot above. first TI of the "unmanaged" group sampled 0 new data samples and found 35 old ones.Nothing is being sampled from Distributed Telemetry, because this TI runs without DT receiver. The resulting format is:0+35/-

Configured host/switch ports

Ports configured to be sampled from a host and corresponding switches in total.

For more details refer to Distributed Telemetry documentation.

Enabled/discovered ports

Enabled and discovered ports of the Fabric.

Iteration time

Total iteration time of UFM Telemetry data collection

Export time

Export time in the last iteration of UFM Telemetry data collection. Included to Iteration time

Port counters time

Time spent only on port counters telemetry collection. Included to Iteration time

Ports/sec

Speed of new port counters data collection during the last iteration of UFM Telemetry.


REST API

All the GUI features including TI management and monitoring can be accessed via REST API.

Accessing UTM API Commands Based on Operating Mode

The method to access UTM API commands varies depending on the mode:

  • In UFM Plugin Mode: Use the UFM REST API proxy:

    Copy
    Copied!
                
    
            
    curl -s -k https://{UFM_HOST_IP}/ufmRest/plugin/utm/{COMMAND} -u {user}:{pass}

  • In Standalone Mode: Access the UTM HTTPS endpoint on the default port 8888:

    Copy
    Copied!
                
    
            
    curl -s -k https://{UTM_HOST_API}:8888/{COMMAND}

Command List

For simplicity, the following commands are provided for standalone mode.

  • Get the list of supported user endpoints:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -s -k https://127.0.0.1:8888/help

  • Get the status of the monitored TIs in JSON format:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k https://127.0.0.1:8888/status

  • Add TI http://127.0.0.1:8123 to the my_group monitoring group:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k 'https://127.0.0.1:8888/add_server?url=http://127.0.0.1:8123&group=my_group'

  • Add TI http://127.0.0.1:8123 to default monitoring group:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k https://127.0.0.1:8888/add_server?url=http://127.0.0.1:8123

  • Remove TI from monitoring (running TI will be paused):

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k https://127.0.0.1:8888/remove_server?url=http://127.0.0.1:8123

  • Pause running TI:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k https://127.0.0.1:8888/pause_server?url=http://127.0.0.1:8123

  • Resume paused TI:

    Standalone UTM help

    Copy
    Copied!
                
    
            
    curl -k https://127.0.0.1:8888/start_server?url=http://127.0.0.1:8123

