NVIDIA UFM Enterprise User Manual v6.17.0
NVIDIA UFM Enterprise User Manual v6.17.0

Changes and New Features History

Note

The items listed in the table below apply to all UFM license types.

Feature

Description

Rev 6.16.0

Syslog Streaming

Added the option for setting UFM syslog streaming facility. For more information, refer to Configuring Syslog.

Switch Cables REST API

Added the option to query specific switch cables (using Ports API).

Switch Power Information

Added support for switch and modules power usage data in UFM telemetry and REST API​. For more information, refer to Devices Window and Inventory Window.

UFM Data Streaming

Added the ability to change the UFM Data streaming log facility. For more information, refer to Configuring Syslog and Configuring UFM Logging.

Kerberos Authentication

Added the ability for Kerberos authentication, a strong network authentication protocol for client-server applications. For more information, refer to Kerberos Authentication and Enabling Kerberos Authentication.

SM Settings

Changed the default maximal number of VLs to 2 (VL0 – VL1)​. For more information, refer to Appendix – UFM Subnet Manager Default Properties.

Cable Management

Added support for showing transceiver information for downed links. For more information, refer to Cables Window and Network Map.

Secondary Telemetry

Added the secondary_slvl_support flag and information on the default counters. For more information, refer to Secondary Telemetry.

Congestion Control

Added support for SM congestion control settings. For more information, refer to Appendix - OpenSM Configuration Files for Congestion Control.

UFM HA

Enhanced reliability and added support for setting UFM HA on LVM (Logical Volume Manager). For more information, refer to UFM High-Availability Documentation.

Plugins

Packet Mirroring Collector (PMC) Plugin: Added support for event on PF indicating a QP closing with error on any other GVMI/VF. For more information, refer to Packet Mirroring Collector (PMC) Plugin.

PDR Deterministic Plugin: Updated instructions. For more information, refer to PDR Deterministic Plugin.

GNMI-Telemetry Plugin: Added gNMI telemetry streaming support ​(supporting secured mode streaming). For more information, refer to GNMI-Telemetry Plugin.

NDT Plugin (Subnet Merger): Added the option to validate the extended fabric using cable validation tool. For more information, refer to the NDT Plugin.

Rev 6.15.2

UFM SM

New routing algorithm for asymmetric QFT topologies

Rev 6.15.1

SHARP Reservation

Added support for Auto-cleanup of zombie SHARP reservations

Rev 6.15.0

Defining Node Description

To prevent the formation of incorrect multi-NIC groups based on these default labels, this feature offers the option to establish a blacklist containing possible node descriptions that should be avoided when grouping Multi-NIC HCAs during host startup. For more information, refer to Defining Node Description Black-List.

Network Reports

Added the ability to view topology change events related to devices and links. For more information, refer to Events History, Device Status Events and Link Status Events.

User Authentication

Introduced a new user authentication login page. For more information, refer to Azure Authentication Login Page and Enabling Azure AD Authentication.

Added support for a separate authentication server. For more information, refer to UFM Authentication Server and Enabling UFM Authentication Server.

Secondary Telemetry

Added the ability to expose SHARP telemetry in UFM Telemetry. For more information, refer to Exposing Switch Aggregation Nodes Telemetry.

Added the ability to stop SHARP telemetry endpoint using CLI commands. For more information, refer to Stopping Telemetry Endpoint Using CLI Command.

REST APIs

Enhanced the logging REST API by adding the ability to get event logs in JSON file format. For more information, refer to Get Events Logs in JSON Format.

Added the ability to expose managed switch power consumption in Web UI. For more information, refer to Get Managed Switches Power Consumption.

Added ability to filter the event logs by source. For more information, refer to Create Log History.

Added the ability to generate enterprise network reports. For more information, refer to Events History, Device Status Events and Link Status Events.

Introduced REST APIs for various authentication types. For more information, refer to Examples of REST APIs Using Various Authentication Types.

Added the ability to update UFM Configuration REST API. For more information, refer to UFM Configuration REST API.

Added the option to expose cable information. For more information, refer to Get Ports with Cable Information.

Improved dynamic telemetry by adding the ability to instantiate a new instance and delete a running instance. For more information, refer to UFM Dynamic Telemetry Instances REST API.

Added the option to set “down” ports as unhealthy. For more information, refer to Unhealthy Ports REST API.

Added forge InfiniBand anti-spoofing support. For more information, refer to Forge InfiniBand Anti-Spoofing REST API.

Added the ability to expose the "site_name" field in all supported REST APIs. For more information, refer to REST API Complementary Information.

Plugins

Added support for the gNMI-Telemetry plugin that employs the gNMI protocol to stream data from UFM telemetry. In addition, added support for secure mode based on client authentication. For more information, refer to the GNMI-Telemetry Plugin.

Added support for ALM configuration for controlling isolation/de-isolation. For more information, refer to ALM Configurations.

REST over RDMA Plugin: Moved to Ubuntu 22-based docker container, OFED 5.8-3.0.7.0, ucx_py 0.35.0 and Python 3.10.

Supported Transceivers

Added support for FR4 transceivers

Rev 6.14.2

Cable and Transceivers Burning

UFM supports second-source cable transceivers burn.

Module REST API

Added HW revision field in GET module REST API response.

Telemetry

Added support for the MRCS register read in UFM Telemetry.

UFM Reports

UFM Daily report will be disabled by default after upgrade or clean installation.

Rev 6.14.0

UFM Upgrade

Added support for in-service upgrade procedure for UFM HA. Refer to the following sections:

User Authorization

Added support for user-defined roles based on REST APIs subsets. Refer to Rest Roles Access Control.

User Authentication

Added support for user authentication based on Azure Active Directory. Refer to Azure AD Authentication.

Plugins Management

Added support for loading UFM plugin to both master and standby nodes in case of UFM HA deployment. Refer to Plugin Management.

Unhealthy Ports Policy Management

Added support for unhealthy ports policy management via UFM Web UI. Refer to Health Policy Management.

REST over RDMA Plugin

Added support for remote ibdiagnet authentication. Refer to rest-rdma Plugin.

SHARP Reservation

Added support for synchronous SHARP reservation REST API (in addition to the existing asynchronous REST API). Refer to the NVIDIA SHARP REST API.

Secondary Telemetry

Added support for secondary telemetry running by default upon UFM startup, fetching NVIDIA Amber counters. Refer to Secondary Telemetry.

Added support for down ports telemetry. Refer to Secondary Telemetry.

PCI Analysis

Added support for PCI analysis as part of UFM Fabric Analysis Report (added new events for degraded hosts PCI devices). Refer to Appendix - Supported Port Counters and Events.

UFM System Dump

Added human readable time to the dmsg de-message output as part of UFM system dump. 

Factory Reset

Added support for UFM Factory Reset. Refer to UFM Factory Reset.

Rev 6.13.0

Network Fast Recovery

Added the ability to automatically isolate a malfunctioning switch port as detected by the switch. Refer to Enabling Network Fast Recovery

Multi-Subnet UFM

Added support for multiple UFM instances, wherein multiple instances are aggregated, managed and controlled by a centralized UFM instance. Refer to Multi-Subnet UFM.

Switch ASIC Failure Detection

Added support for a new indication (UFM event) that identifies a failure of a specific switch ASIC. Refer to Configuring Partial Switch ASIC Failure Events.

UFM High-Availability Enhancements

Added support for configuring high-availability with dual-link connections to improve the high-availability robustness.

Automatic Switch Grouping

Added support for enabling automatic grouping of 1U switches by UFM, as per a pre-defined user-configured mapping. Refer to Appendix - Switch Grouping.

SHARP Trees APIs

Incorporated support for a new UFM REST API that presents the current active SHARP trees. Refer to NVIDIA SHARP Resource Allocation REST API.

SHARP Reservation APIs

Added support for SHARP Reservation API enhancements. Refer to NVIDIA SHARP Resource Allocation REST API.

Operating System Update support

Implemented functionality to support the installation and upgrade of a standalone UFM after the upgrade of operating system packages (e.g., using yum update/apt upgrade). Furthermore, upgrading operating system packages will not impact a standalone UFM installation.

Email Time-Zone Settings

Added the ability to configure time-zone settings for UFM email notifications, ensuring that sent events or daily reports align with the configured time zone. Refer to Email.

Switch Connectivity Failure Indication

Incorporated support for a new UFM event indication that identifies failed communication with a specified managed switch. Appendix - Supported Port Counters and Events

Dynamic Telemetry

Added APIs that enable the creation and management of UFM Telemetry instances, allowing users to select desired counters and ports as per their requirements. Refer to UFM Dynamic Telemetry Instances REST API.

TFS (Telemetry Fluent Streaming) Plugin

Added support for UFM telemetry data streaming from multiple endpoints to Fluent Bit. Refer to Telemetry to Fluent Streaming (TFS) Plugin REST API.

Added support for enabling white/black counters lists within the TFS Plugin. Refer to Telemetry to Fluent Streaming (TFS) Plugin REST API.

DTS (DPU Telemetry) Plugin

Added support for displaying DPUs data within the UFM Web UI. Refer to DTS Plugin. 

Cyber-AI Plugin

Added support for displaying Cyber-AI software within the UFM Web UI. Refer to UFM Cyber-AI Plugin.

Packet Mirroring Collector (PMC) Plugin

Added the Packet Mirroring Collector (PMC) plugin that allows users to catch and collect mirrored pFRN and congestion notifications from switches for enhanced real-time network visibility. Refer to Packet Mirroring Collector (PMC) Plugin.

SNMP Traps Listener Plugin

Added the capability to enable registration and monitoring of SNMP traps from managed switches, in addition to updating UFM with the relevant trap information. Refer to SNMP Plugin.

Bright Cluster Integration Plugin

Added support for integration of data from Bright Cluster Manager (BCM) into UFM, providing a more comprehensive network perspective. Refer to UFM Bright Cluster Integration Plugin.

UFM System Dump

UFM System Dump collection enhancement. Refer to UFM System Dump Tab.

Expanding Non-Blocking Fabric (NDT Plugin extension)

Added a feature that facilitates seamless expansion of the IB fabric, ensuring uninterrupted functionality and optimal performance throughout the fabric. Refer to NDT Format – Merger.

PDR (Packet Drop Rate) Plugin

Added a new functionality that enables automatic detection and isolation of port failures through monitoring of PDR (Packet Drop Rate), BER (Bit Error Rate), and high cable temperatures. Refer to PDR Deterministic Plugin.

Rev 6.12.0

Managed Switches - Sysinfo Mechanism

Added the ability to save switches inventory data into JSON format files and present the latest fetched switches data upon UFM start-up. The saved switches data is available UFM upon system dump. Refer to Appendix - Managed Switches Configuration Info Persistency

REST over RDMA Plugin

Introduced security improvements (allowed read-only options in remote ibdiagnet) and added support for Telemetry API. Refer to rest-rdma Plugin.

Events and Notifications

Added support for indicating potential switch ASIC failure by detecting a defined percentage of unhealthy switch ports. Refer to Additional Configuration (Optional)

SHARP AM Multi-Port

Added support for detecting IB fabric interface failure and automatic failover to an alternative active port in SHARP Aggregation Manager (AM). Refer to Multi-port SM

UFM System Dump

Added support for downloading the generated UFM system dump. Refer to UFM System Dump Tab

UFM REST API

Added support for adding or removing hosts to Partition key (PKey) assignments (when adding/removing hosts, all the related host GUIDs are assigned to/removed from the PKey). Refer to Add Host REST API

UFM System Dump Improvements including Creating New System Dump API

UFM SLURM Integration

Enhanced UFM SLURM integration; allow flexible configuration of PKey and SHARP resources usage. Refer to Appendix - UFM SLURM Integration

UFM HA

Improved UFM HA configuration by setting UFM HA nodes using IP addresses only (removed the need of using hostnames and sync interface names). Refer to Configuring UFM Docker in HA Mode and Installing UFM Server Software for High Availability

Managed Switch Operations

Added support for persistent enablement/disablement of managed switches ports. Refer to Ports Window

UFM SDK

Created a script to get TopX data by category. Refer to UFM Aggregation TopX README.md file

Proxy Authentication

Added option to delegate authentication to a proxy. Refer to Delegate Authentication to a Proxy

UFM Initial Settings

Removed the requirement to set the IPoIB address to the main IB interface used by UFM/SM (gv.cfg → fabric_interface)

Port auto-isolation

Symbol BER warning does not trigger port auto-isolation, only symbol BER error

MFT Package

Integrated with MFT version 4.23.0-104

Rev 6.11.0

UFM Discovery and Device Management

  • InBand autosicovery of switchs' IP addresses using ibdiagnet

  • Discovering the device's PSID and FW version using ibdiagnet by default instead of using an SM vendor plugin

CPU Affinity

Enabling the user to control CPU affinity of UFM's major processes

gRPC API

Added support for streaming UFM REST API data over gRPC as part of new UFM plugin. Refer to GRPC-Streamer Plugin

Telemetry

  • Added support for flexible counters infrastructure (ability to change counter sets that are sampled by the UFM)

  • Updated the set of available counters for Telemetry (removed General counters from default view: Row BER, Effective BER and Device Temperature.

    Now available through the secondary telemetry instance). Refer to Secondary Telemetry

EFS UFM Plugin

Added support for streaming UFM events data to FluentD destination as part of a new UFM plugin. Refer to UFM Telemetry Fluentd Streaming (TFS) Plugin

General UI Enhancements

• Displayed columns of all tables are persistent per user, with the option to restore defaults. Refer to Displayed Columns

• Improved look and feel in Network Map. Refer to Network Map

• Added Reveal Uptime to the general tab in the devices information tabs. Refer to Device General Tab

High Availability Deployment

REST APIs

Added support for PKey filtering for default session data. Refer to Get Default Monitoring Session Data by PKey Filtering.

Added support for filtering session data by groups. Refer to Monitoring Sessions REST API.

Added support for resting all unhealthy ports at once. Refer to Mark All Unhealthy Ports as Healthy at Once

Added support for presenting system uptime in UFM REST API. Refer to Systems REST API.

Deployment Installation

UFM installation is now based on Conda-4.12 (or newer) for python3.9 environment and third party packages deployments.

NVIDIA SHARP Software

Updated NVIDIA SHARP software version to v3.1.1.

UFM Logical Elements

UFM Logical Elements (Environments, Logical Servers, Networks) views are deprecated and will no longer be available starting from UFM v6.12.0 (January 2023 release)

Rev 6.10.0

System health enhancements

Add support for the periodic fabric health report, and reflected the ports' results in UFM's dashboard

UFM Plugins Management

Add support for plugin management via UFM web UI

UFM Extended Status

  • Add support for showing UFM's current processes status (via shell script)

  • Added REST API for exposing UFM readiness

Failover to Other Ports

Add support for SM and UFM Telemetry failover to other ports on the local machine

UFM Appliance Upgrade

Added a set of REST APIs for supporting the UFM Appliance upgrade

Configuration Audit

Add support for tracking changes made in major UFM configuration files (UFM, SM, SHARP, Telemetry)

UFM Plugins

Add support for new SDK plugins

Telemetry

Add support for statistics processing based on UFM telemetry csv format

UFM High Availability Installation

UFM high availability installation has changed and it is now based on an independent high availability package which should be deployed in addition to the UFM Enterprise standalone package. for further details about the new UFM high availability installation, please refer to - Installing UFM Server Software for High Availability

© Copyright 2024, NVIDIA. Last updated on Aug 27, 2024.