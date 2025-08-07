On This Page
- General Settings in gv.cfg
- Enabling SHARP Aggregation Manager
- Enabling MAD Limiter
- Enabling Predefined Groups
- Enabling Multi-NIC Host Grouping
- Defining Node Description Black-List
- Running UFM Over IPv6 Network Protocol
- Configuring SM Plugins via UFM
- Multi-port SM
- Configuring UDP Buffer
- Virtualization
- Static SM LID
- Configuring Log Rotation
- Configuring UFM Logging
- Configuring UFM Over Static IPv4 Address
- Configuring Syslog
- Excluding Unhealthy Ports from Fabric Health Report
- Configuration Examples in gv.cfg
- Managing Dynamic Telemetry
- SM Trap Handler Configuration
- Setting CPU Affinity on UFM
- Quality of Service (QoS) Support
- UFM Failover to Another Port
- Configuring Managed Switches Info Persistency
- Configuring Partial Switch ASIC Failure Events
- Enabling Network Fast Recovery
- Disabling Rest Roles Access Control
- Enabling/Disabling Authentication
- Adjusting UFM Configuration Files Based on Fabric Size
- Setting Up SSL and CA Certificates in UFM
- Default Bare-Metal Cloud Mode
- Faster Detection of Fan/PSU Removal on Unmanaged Switches
- Setting UFM Configurations Without Requiring UFM Restart
- Setting up Telemetry in UFM
- Enabling UFM Telemetry
- Enabling UFM Telemetry Manager Plugin
- Changing Telemetry Endpoint Protocol
- Changing UFM Telemetry Default Configuration
- Supporting Generic Counters Parsing and Display
- Supporting Multiple Telemetry Instances Fetch
- Low-Frequency (Secondary) Telemetry
- Validating UFM Configuration Files
Optional Configurations
Configure general settings in the
conf/gv.cfg file.
When running UFM in HA mode, the gv.cfg file is replicated to the standby server.
Enabling SHARP Aggregation Manager
SHARP Aggregation Manager is disabled by default. To enable it, set:
[Sharp]
sharp_enabled = true
Upon startup of UFM or SHARP Aggregation Manager, UFM will resend all existing tenant allocations to SHARP AM.
Enabling MAD Limiter
MAD Limiter is a security tool designed to mitigate Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) attacks targeting the management node.
MAD Limiter is disabled by default. To enable it, set:
[MADLimiter]
mad_limiter_enabled =
false
Enabling Predefined Groups
By default, pre-defined groups are enabled. In very large-scale fabrics, pre-defined groups can be disabled in order to allow faster startup of UFM.
enable_predefined_groups = true
Enabling Multi-NIC Host Grouping
Starting with UFM Enterprise v6.4.1, multi-NIC host grouping is enabled by default for fresh installations. However, for users upgrading from earlier versions, this feature remains disabled by default.
It is recommended to set the value of this parameter before running UFM for the first time.
multinic_host_enabled = true
Defining Node Description Black-List
Node descriptions from the black-list should not be used for Multi-NIC grouping.
During the process of host reboot or initialization/bring up, the majority of HCAs receive a default label rather than an actual, real description. To prevent the formation of incorrect multi-NIC groups based on these default labels, this feature offers the option to establish a blacklist containing possible node descriptions that should be avoided when grouping Multi-NIC HCAs during host startup. Once a legitimate node description is assigned to the host, the HCAs are organized into multi-NIC hosts based on their respective descriptions. It is recommended to configure this parameter before initiating the UFM for the first time.
For instance, nodes initially identified with descriptions listed in the
exclude_multinic_desc will not be initially included in Multi-NIC host groups until they obtain an updated, genuine node description.
Modify the
exclude_multinic_desc parameter in the cv.fg file:
exclude_multinic_desc = localhost,generic_name_1,generic_name_2
Running UFM Over IPv6 Network Protocol
The default multicast address is configured to an IPv4 address. To run over IPv6, this must be changed to the following in section UFMAgent of gv.cfg.
[UFMAgent]
...
# if ufmagent works in ipv6 please set this multicast address to FF05:0:0:0:0:0:0:15F
mcast_addr = FF05:0:0:0:0:0:0:15F
Configuring SM Plugins via UFM
UFM allows users to configure Subnet Manager (SM) plugins through the
event_plugin_name and
event_plugin_options parameters in the configuration file. When UFM starts the SM, it automatically launches the specified plugins with the provided options.
To define the SM plugins, use the following parameter:
# Event plugin name(s)
event_plugin_name osmufmpi lossymgr
Add the plug-in options file to the
event_plugin_options option:
# Options string that would be passed to the plugin(s)
event_plugin_options --lossy_mgr -f <lossy-mgr-options-file-name>
These plugin settings are applied only when UFM is operating in Management mode and is copied to the
opensm.conf file accordingly.
Multi-port SM
SM can use up to eight-port interfaces for fabric configuration. These interfaces can be provided via
/opt/ufm/conf/gv.cfg. The users can specify multiple IPoIB interfaces or bond interfaces in /opt/ufm/conf/gv.cfg, subsequently, the UFM translates them to GUIDs and adds them to the SM configuration file (/opt/ufm/conf/opensm/opensm.conf). If users specify more than eight interfaces, the extra interfaces are ignored.
[Server]
# disabled (default) | enabled (configure opensm with multiple GUIDs) | ha_enabled (configure multiport SM with high availability)
multi_port_sm = disabled
# When enabling multi_port_sm, specify here the additional fabric interfaces for OpenSM conf
# Example: ib1,ib2,ib5 (OpenSM will support the first 8 GUIDs where first GUID will
# be extracted the fabric_interface, and remaining GUIDs from additional_fabric_interfaces
additional_fabric_interfaces =
UFM treats bonds as a group of IPoIB interfaces. So, for example, if bond0 consists of the interfaces ib4 and ib8, then expect to see GUIDs for ib4 and ib8 in opensm.conf.
Duplicate interface names are ignored (e.g. ib1,ib1,ib1,ib2,ib1 = ib1,ib2).
Configuring UDP Buffer
This section is relevant only in cases where
telemetry_provider=ibpm. (By default, telemetry_provider=telemetry).
To work with large-scale fabrics, users should set the set_udp_buffer flag under the [IBPM] section to "yes" for the UFM to set the buffer size (default is "no").
# By deafult, UFM does not set the UDP buffer size. For large scale fabrics
# it is recommended to increase the buffer size to 4MB (4194304 bits).
set_udp_buffer = yes
# UDP buffer size
udp_buffer_size = 4194304
Virtualization
The virtualization feature allows for supporting virtual ports in UFM.
[Virtualization]
# By enabling this flag, UFM will discover all the virtual ports assigned for all hypervisors in the fabric
enable = false
# Interval for checking whether any virtual ports were changed in the fabric
interval = 60
Static SM LID
Users may configure a specific value for the SM LID so that the UFM SM uses it upon UFM startup.
[SubnetManager]
# 1- Zero value (Default): Disable static SM LID functionality and allow the SM to run with any LID.
# Example: sm_lid=0
# 2- Non-zero value: Enable static SM LID functionality so SM will use this LID upon UFM startup.
sm_lid=0
To configure an external SM (UFM server running in sm_only mode), users must manually configure the opensm.conf file (
/opt/ufm/conf/opensm/opensm.conf) and align the value of
master_sm_lid to the value used for
sm_lid in gv.cfg on the main UFM server.
Configuring Log Rotation
This section enables setting up the log files rotate policy. By default, log rotation runs once a day by cron scheduler.
[logrotate]
#max_files specifies the number of times to rotate a file before it is deleted (this definition will be applied to
#SM and SHARP Aggregation Manager logs, running in the scope of UFM).
#A count of 0 (zero) means no copies are retained. A count of 15 means fifteen copies are retained (default is 15)
max_files = 15
#With max_size, the log file is rotated when the specified size is reached (this definition will be applied to
#SM and SHARP Aggregation Manager logs, running in the scope of UFM). Size may be specified in bytes (default),
#kilobytes (for example: 100k), or megabytes (for example: 10M). if not specified logs will be rotated once a day.
max_size = 3
Configuring UFM Logging
The [Logging] section in the gv.cfg enables setting the UFM logging configurations.
Field
Default Value
Value Options
Description
WARNING
CRITICAL, ERROR, WARNING, INFO, DEBUG
The definition of the maub logging level for UFM components.
WARNING
CRITICAL, ERROR, WARNING, INFO, DEBUG
The logging level for SM client log messages
INFO
CRITICAL, ERROR, WARNING, INFO, DEBUG
The logging level for UFM events log messages
INFO
CRITICAL, ERROR, WARNING, INFO, DEBUG
The Logging level for REST API related log messages
INFO
CRITICAL, ERROR, WARNING, INFO, DEBUG
The logging level for UFM authentication log messages
/opt/ufm/files/log
N/A
It is possible to change the default path to the UFM log directory.
The configured log_dir must have read, write and execute permission for ufmapp user (ufmapp group).
In case of HA, UFM should be located in the directory which is replicated between the UFM master and standby servers.
A change of the default UFM log directory may affect UFM dump creation and inclusion of UFM logs in dump.
100000
N/A
The maximum number of lines in log files to be shown in UI output for UFM logging.
FALSE
TRUE, FALSE
Enabling this parameter will stream all events to the UFM logs and syslog, regardless of whether the event is alarmable.
[Logging]
# Optional logging levels: CRITICAL, ERROR, WARNING, INFO, DEBUG.
level = WARNING
smclient_level = WARNING
event_log_level = INFO
rest_log_level = INFO
authentication_service_log_level = INFO
# The configured log_dir must have read, write and execute permission
for ufmapp user (ufmapp group).
log_dir = /opt/ufm/files/log
max_history_lines =
100000
stream_all_events = FALSE
Configuring UFM Over Static IPv4 Address
Follow this procedure to to run UFM on a static IP configuration instead of DHCP:
Modify the defined management Ethernet interface network script to be static. Run:
# vi /etc/sysconfig/network-scripts/ifcfg-enp1s0
Update the required interface with the static IP configuration (IP address, netmask, broadcast, and gateway):
NAME=
"enp1s0"DEVICE=
"enp1s0"ONBOOT=
"yes"BOOTPROTO=
"static"IPADDR=
"10.209.37.153"NETMASK=
"255.255.252.0"BROADCAST=
"10.209.39.255"GATEWAY=
"10.209.36.1"TYPE=Ethernet DEFROUTE=
"yes"
Add host entries to the /etc/hosts file. Run:
# vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.209.37.153 <hostname>
Check hostname. Run:
# vi /etc/hostname <hostname>
Set up DNS resolution at /etc/resolv.conf. Run:
# vi /etc/resolv.conf search mtr.labs.mlnx nameserver 8.8.8.8
Restart network service. Run:
service network restart
Check Configuration. Run:
# hostname <hostname> # hostname -i 10.209.37.153
Configuring Syslog
This configuration enables the UFM to send log messages to syslog, including remote syslog. The configuration described below is located in the [Logging] section of the gv.cfg file.
Field
Default Value
Value Options
Description
false
True or False
Enables/disables UFM syslog option
/dev/log # for remote rsyslog_hostname:514
N/A
UFM syslog configuration (syslog_addr)
For working with local syslog, set value to: /dev/log
For working with external machine, set value to: host:port
Important note: the default remote syslog server port is 514
As the UFM log messages could be sent to remote server, change the rsyslog configuration on the remote server
The /etc/rsyslog.conf file should be edited and two sections should be uncommented as shown below:
# Provides UDP syslog reception
$ModLoad imudp
$UDPServerRun 514
# Provides TCP syslog reception
$ModLoad imtcp
$InputTCPServerRun 514
Restart the remote syslog service, run:
false
True or False
Sets syslog option for main UFM process logging messages - False - Not to send. True: Send
false
True or False
Sets syslog option for OpenSM logging messages - False - Not to send. True: Send
false
True or False
Sets syslog option for events logging messages - False - Not to send. True: Send
false
True or False
Sets syslog option for UFM REST API logging messages - False - Not to send. True: Send
false
True or False
Set syslog option for UFM authentication logging messages. False - Not to send. True: Send
WARNING
CRITICAL, ERROR, WARNING, INFO, DEBUG
Sets global syslog messages logging level. The syslog level is common for all the UFM components.
The syslog level that is sent to syslog is the highest among the syslog level and component log level defined in the above section.
LOG_USER
LOG_KERN, LOG_USER, LOG_MAIL, LOG_DAEMON, LOG_AUTH,
LOG_SYSLOG, LOG_LPR, LOG_NEWS, LOG_UUCP ,LOG_CRON,
LOG_AUTHPRIV, LOG_FTP, LOG_NTP,LOG_SECURITY, LOG_CONSOLE, LOG_SOLCRON
Includes the remote syslog package header value for log message facility.
syslog =
false
#syslog configuration (syslog_addr)
# For working with local syslog, set value to: /dev/log
# For working with external machine, set value to: host:port
syslog_addr = /dev/log
# The configured log_dir must have read, write and execute permission
for ufmapp user (ufmapp group).
log_dir = /opt/ufm/files/log
# Main ufm log.
ufm_syslog =
false
smclient_syslog =
false
event_syslog =
false
rest_syslog =
false
authentication_syslog =
false
syslog_level = WARNING
# Syslog facility. By
default - LOG_USER
# possible facility codes: LOG_KERN,LOG_USER,LOG_MAIL,LOG_DAEMON,LOG_AUTH,LOG_SYSLOG,
# LOG_LPR,LOG_NEWS,LOG_UUCP,LOG_CRON,LOG_AUTHPRIV,LOG_FTP, LOG_NTP,LOG_SECURITY,LOG_CONSOLE,LOG_SOLCRON
#
for reference https:
//en.wikipedia.org/wiki/Syslog
syslog_facility = LOG_USER
Excluding Unhealthy Ports from Fabric Health Report
In gv.cfg file there is a section named UnhealthyPorts and parameters in this section are used for unhealthy ports managing in UFM.
Unhealthy port state could be defined by used using UI or REST API request or reported by OpenSM or ibutilities.
UFM has an ability to check periodically fabric ports healthiness and to report unhealthy ports out or to perform automatically predefined isolation action for unhealthy ports.
In addition, using
exclude_unhealthy_ports key in UnhealthyPorts section unhealthy ports could be excluded from ibdiagnet report.
By default, the value for this parameter is set as false. It means that unhealthy ports will appear in ibdiagnet reports, but if need to exclude unhealthy port from ibdiagnet reports
this parameter should be set to true and UFM server should be restarted so this action will take effect.
UFM starting flow will configure indiagnet configuration file with appropriate parameters and unhealthy ports will not appear in UFM health and Fabric health reports.
[UnhealthyPorts]
enable_ibdiagnet =
true
log_level = INFO
syslog =
false
# scheduling_mode possible values: fixed_time/interval.
# If fixed_time - ibdiagnet will run every
24 hours on the specified time - <fixed_time>.
# If interval - ibdiagnet will run first time after <start_delay> minutes from UFM startup and every <interval> hours (
default scheduling mode).
scheduling_mode = interval
# First ibdiagnet start delay interval (minutes)
start_delay =
5
# ibdiagnet run interval (hours)
interval =
3
# ibdiagnet run at a fixed time (example:
23:
17:
35)
fixed_time =
23:
00:
00
# By enabling
this flag all the discovered high ber ports will be marked as unhealthy automatically by UFM
high_ber_ports_auto_isolation =
false
# Auto isolation mode - which type of ports should be isolated.
# Options:
switch-
switch,
switch-host,all (
default:
switch-
switch).
auto_isolation_mode =
switch-
switch
# Trigger Partial Switch ASIC Failure whenever number of unhealthy ports exceed the defined percent of the total number of the
switch ports.
switch_asic_fault_threshold =
20
# exclude unhealthy ports from ibdiagnet reports
exclude_unhealthy_ports=
false
Configuration Examples in gv.cfg
The following show examples of configuration settings in the gv.cfg file:
Polling interval for Fabric Dashboard information
ui_polling_interval = 30
[Optional] UFM Server local IP address resolution (by default, the UFM resolves the address by gethostip). UFM Web UI should have access to this address.
ws_address = <specific IP address>
HTTP/HTTPS Port Configuration
# WebServices Protocol (http/https) and Port ws_port = 8088 ws_protocol = http
Connection (port and protocol) between the UFM server and the APACHE server
ws_protocol = <http or https> ws_port = <port number>
For more information, see Launching a UFM Web UI Session.
SNMP
get-communitystring for switches (fabric wide or per switch)
# default snmp access point for all devices [SNMP] port = 161 gcommunity = public
Enhanced Event Management (Alarmed Devices Group)
[Server] auto_remove_from_alerted = yes
Log verbosity
[Logging] # optional logging levels #CRITICAL, ERROR, WARNING, INFO, DEBUG level = INFO
For more information, see UFM Logs.
Settings for saving port counters to a CSV file
[CSV] write_interval = 60 ext_ports_only = no
Max number of CSV files (UFM Advanced)
[CSV] max_files = 1Note
The access credentials that are defined in the following sections of the conf/gv.cfg file are used only for initialization:
SSH_Server
SSH_Switch
TELNET
IPMI
SNMP
MLNX_OS
To modify these access credentials, use the UFM Web UI. For more information, see "Device Access".
Configuring the UFM communication protocol with MLNX-OS switches. The available protocols are:
http
https (default protocol for secure communication)
F
or configuring the UFM communication protocol after fresh installation and prior to the first run, set the MLNX-OS protocol as shown below.
Example:
[MLNX_OS]
protocol = https
port = 443
Once UFM is started, all UFM communication with MLNX-OS switches will take place via the configured protocol.
For changing the UFM communication protocol while UFM is running, perform the following:
Set the desired protocol of MLNX-OS in the conf/gv.cfg file (as shown in the example above).
Restart UFM.
Update the MLNX-OS global access credentials configuration with the relevant protocol port. Refer to Device Access for help.
For the http protocol - default port is 80.
For the https protocol - default port is 443.
Update the MLNX-OS access credentials with the relevant port in all managed switches that have a valid IP address.
Managing Dynamic Telemetry
The management of dynamic telemetry instances involves the facilitation of user requests for the creation of multiple telemetry instances. As part of this process, the UFM enables users to establish new UFM Telemetry instances according to their preferred counters and configurations. These instances are not initiated by the UFM but rather are monitored for their operational status through the use of the UFM Telemetry bring-up tool
For more information on the supported REST APIs, please refer to UFM Dynamic Telemetry Instances REST API.
The configuration parameters can be found in the gv.cfg configuration file under the
DynamicTelemetrysection.
Name
Description
Default value
max_instances
Maximum number of simultaneous running UFM Telemetries.
5
new_instance_delay
Delay time between the start of two UFM Telemetry instances, in minutes.
5
update_discovery_delay
The time to wait before updating the discovery file of each telemetry instance if the fabric has changed, in minutes.
10
endpoint_timeout
Telemetry endpoint timeout, in seconds.
5
bringup_timeout
Telemetry bringup tool timeout, in seconds.
60
initial_exposed_port
Initial port for the available range of ports (range(initial_exposed_port, initial_exposed_port + max_instances)).
9003
instances_sessions_compatibility_interval
Every instances_sessions_compatibility_interval minutes the UFM verifies compliance between instances and sessions to avoid zombie sessions. if 0 is configured this process won't be activate
10
SM Trap Handler Configuration
The SMTrap handler is the SOAP server that handles traps coming from OpenSM.
There are two configuration values related to this service:
osm_traps_debounce_interval– defines the period the service holds incoming traps
osm_traps_throttle_val– once
osm_traps_debounce_intervalelapses, the service transfers
osm_traps_throttle_valto the Model Main
By default, the SM Trap Handler handles up to 1000 SM traps every 10 seconds.
Setting CPU Affinity on UFM
This feature allows setting the CPU affinity for the major processes of the UFM (such as ModelMain, SM, SHARP, Telemetry).
In order to increase the UFM's efficiency, the number of context-switches is reduced. When each major CPU is isolated, users can decrease the number of context-switches, and the performance is optimized.
The CPU affinity of these major processes is configured in the following two levels:
Level 1- The major processes initiation.
Level 2- Preceding initiation of the model's main subprocesses which automatically uses the configuration used in level 1 and designates a CPU for each of the sub-processes.
According to user configuration, each process is assigned with affinity.
By default, this feature is disabled. In order to activate the feature, configure
Is_cpu_affinity_enabled with true, check how many CPUs you have on the machine, and set the desired affinity for each process.
For example:
[CPUAffinity]
Is_cpu_affinity_enabled=
true
Model_main_cpu_affinity=
1-
4
Sm_cpu_affinity=
5-
13
SHARP_cpu_affinity=
14-
22
Telemetry_cpu_affinity=
22-
23
The format should be a comma-separated list of CPUs. For example: 0,3,7-11.
The ModelMain should have four cores, and up to five cores. The SM should have as many cores as you can assign. You should isolate between the ModelMain cores and the SM cores.
SHARP can be assigned with the same affinity as the SM. The telemetry should be assigned with three to four CPUs.
Quality of Service (QoS) Support
Infiniband Quality of Service (QoS) is disabled by default in the UFM SM configuration file.
To enable it and benefit from its capabilities, set the qos flag to TRUE in the /opt/ufm/files/conf/opensm/opensm.conf file.
Example:
# Enable QoS setup
qos FALSE
The QoS parameters settings should be carefully reviewed before enablement of the qos flag. Especially, sl2vl and VL arbitration mappings should be correctly defined.
For information on Enhanced QoS, see Appendix – Used Ports.
UFM Failover to Another Port
When the UFM Server is connected by two or more InfiniBand ports to the fabric, you can configure UFM Subnet Manager failover to one of the other ports. When failure is detected on an InfiniBand port or link, failover occurs without stopping the UFM Server or other related UFM services, such as mysql, http, DRDB, and so on. This failover process prevents failure in a standalone setup, and preempts failover in a High Availability setup, thereby saving downtime and recovery.
Network Configuration for Failover to IB Port
To enable UFM failover to another port:
Configure bonding between the InfiniBand interfaces to be used for SM failover. In an HA setup, the UFM active server and the UFM standby server can be connected differently; but the bond name must be the same on both servers.
Set the value of fabric_interface to the bond name. using the /opt/ufm/scripts/change_fabric_config.sh command as described in Configuring General Settings in gv.cfg. If ufma_interface is configured for IPoIB, set it to the bond name as well. These changes will take effect only after a UFM restart. For example, if bond0 is configured on the ib0 and ib1 interfaces, in gv.cfg, set the parameter fabric_interface to bond0.
If IPoIB is used for UFM Agent, add bond to the ufma_interfaces list as well.
When failure is detected on an InfiniBand port or link, UFM initiates the give-up operation that is defined in the Health configuration file for OpenSM failure. By default:
UFM discovers the other ports in the specified bond and fails over to the first interface that is up (SM failover)
If no interface is up:
In an HA setup, UFM initiates UFM failover
In a standalone setup, UFM does nothing
If the failed link becomes active again, UFM will select this link for the SM only after SM restart.
Configuring Managed Switches Info Persistency
UFM uses a periodic system information-pulling mechanism to query managed switches inventory data. The inventory information is saved in local JSON files for persistency and tracking of managed switches' status.
Upon UFM start up, UFM loads the saved JSON files to present them to the end user via REST API or UFM WebUI.
After UFM startup is completed, UFM pulls all managed switches data and updates the JSON file and the UFM model periodically (the interval is configurable). In addition, the JSON files are part of UFM system dump.
The following parameters allow configuration of the feature via gv.cfg fie:
[SrvMgmt]
# how often UFM should send json requests
for sysinfo to switches (in seconds)
systems_poll =
180
# To create UFM model in large setups might take a lot of time.
# This is an initial delay (in minutes) before starting to pull sysinfo from switches.
systems_poll_init_timeout =
5
# to avoid sysinfo dump overloading and multiple writing to host
# switches sysinfo will be dumped to disc in json format every set in
this variable
# sysinfo request. If set to
0 - will not be dumped,
if set to
1 - will be dumped every sysinfo request
#
this
case (as example defined below) dump will be created every fifth sysinfo request, so
if system_poll is
180 sec (
3 minutes) sysinfo dump to the file will e performed every
15 minutes.
sysinfo_dump_interval =
5
# location of the sysinfo dump file (it is in /opt/ufm/files/logs (it will be part of UFM dump)
sysinfo_dump_file_path = /opt/ufm/files/log/sysinfo.dump
Configuring Partial Switch ASIC Failure Events
UFM can identify switch ASIC failure by detecting pre-defined portion of the switch ports, reported as unhealthy. By default, this portion threshold is set to 20% of the total switch ports. Thus, the UFM will trigger the partial switch ASIC event in case the number of unhealthy switch ports exceeds 20% of the total switch ports.
You can configure UFM t
o
control Partial Switch ASIC Failure events.
To configure, you may use the gv.cfg file by updating the value of
switch_asic_fault_threshold
parameter
under the UnhealthyPorts section.
For an example, in case the switch has 32 ports, once 7 ports are detected as unhealthy ports, the UFM will trigger the partial switch ASIC event. Example:
Enabling Network Fast Recovery
To enable the Network Fast Recovery feature, ensure that all switches in the fabric use the following MLNX-OS/firmware versions:
MLNX-OS version 3.10.6004 and up
Quantum firmware versions:
Quantum FW v27.2010.6102 and up
Quantum2 FW v31.2010.6102 and up
Fast recovery is a switch-firmware based facility for isolation and mitigation of link-related issues. This system operates in a distributed manner, where each switch is programmed with a simple set of rule-based triggers (conditions) and corresponding action protocols. These rules permit the switch to promptly react to substandard links within its locality, responding at a very short reaction time - as little as approximately 100 milliseconds. The policy is provided and managed via the UFM and SM channel. Moreover, every autonomous action taken by a switch in the network is reported to the UFM.
The immediate reactions taken by the switch enable SHIELD and pFRN. These mechanisms collaborate to rectify routing within the proximity of the problematic link before it can disrupt transactions at the transport layer. Importantly, this process occurs rapidly, effectively limiting the spreading of congestion to a smaller segment of the network.
To enable the Network Fast Recovery feature, you must activate the appropriate trigger (condition) in the
gv.cfg file. This allows you to specify which of the four supported triggers UFM should handle.
Additionally, you can configure how UFM responds when a port becomes unhealthy due to a Network Fast Recovery condition:
no_reset– Monitor only, without taking any action.
physical_resetor
logical_reset– Monitor and respond by performing a reset.
The selected behavior will apply to all configured conditions.
As stated in the gv.cfg file, the feature is disabled by default and the below are the supported fields and options:
[NetworkFastRecovery]
# Fast Recovery configuration.
# Supported values:
#
0: Ignore fast recovery related MADs and configuration (
default)
#
1: Disable fast recovery
#
2: Enable fast recovery
fast_recovery_mode =
0
# This will be supported by the Network Fast Recovery.
network_fast_recovery_conditions = SWITCH_DECISION_CREDIT_WATCHDOG,SWITCH_DECISION_RAW_BER,SWITCH_DECISION_EFFECTIVE_BER,SWITCH_DECISION_SYMBOL_BER
# Network Fast Recovery action configuration.
# Supported values:
# no_reset: Only monitor the conditions (
default)
# physical_reset : Monitor and react by performing physical reset
# logical_reset : Monitor and react by performing logical reset
network_fast_recovery_action = no_reset
To enable the Network Fast Recovery feature, the value of
fast_recovery_mode should be set to 2. For the change to take effect, restart of UFM Enterprise is required.
Parameter
Description
The Switch decided to close the port due to Credit watchdog
The Switch decided to close the port due to High raw errors
The Switch decided to close the port due to High effective errors (after FEC)
The Switch decided to close the port due to High symbol errors (after PLR)
By default, the Network Fast Recovery feature operates in monitoring mode. This means the switch does not reset ports, however, it reports issues related to them. To view these port-related issues, you must deploy the UFM PMC (Packet Monitoring Collector) plugin and use its UI to access the relevant network events.
For more details on the PMC plugin, including deployment instructions and how to view Network Fast Recovery events, please refer to Packet Level Monitoring Collector (PMC) Plugin.
Disabling Rest Roles Access Control
By default, the Rest Roles Access Control feature is enabled. It can be disabled by setting the
roles_access_control_enabled flag to false:
[RolesAccessControl]
roles_access_control_enabled =
true
Enabling/Disabling Authentication
Kerberos Authentication
By default, Kerberos Authentication is disabled. To enable it, set the
kerberos_auth_enabledflag to true. Additionally, provide the required configurations such as
kerberos_cred_key_path,
kerberos_use_local_name and
kerberos_auto_sign_up.
[KerberosAuth]
# This section responsible to manage kerberos authentication
# Set to
true to enable the kerberos auth feature, and set to
false to disable it. Default is
false.
kerberos_auth_enabled =
false
# The path of the keytab file containing credentials
for GSSAPI authentication.
kerberos_cred_key_path = /etc/kadm5.keytab
# Set to
true to configure the Apache server to map authenticated principal names (which represent different clients) to local usernames,
# and set to
false to use the principle names as usernames. Default is
true (
this value will be reflected in the
'GssapiLocalName' directive in Apache).
kerberos_use_local_name =
true
# Set to
true to enable auto sign up of users who
do not exist in UFM DB. Default is
true.
kerberos_auto_sign_up =
true
# The
default role assigned to create users
if they
do not exist when
'kerberos_auto_sign_up' is set to
true.
kerberos_default_role = System_Admin
kerberos_auth_enabled: By default, Kerberos authentication remains disabled. To activate it, the user must set this flag to 'true' and then restart UFM.
kerberos_cred_key_path: This specifies the path to the keytab file containing credentials for GSSAPI authentication.
kerberos_use_local_name: Set to true to configure the Apache server to map authenticated principal names (which represent different clients) to local usernames, and set to false to use the principal names as usernames. Default is true (this value will be reflected in the '
GssapiLocalName' directive in Apache).
kerberos_auto_signup: For successful authentication via Kerberos, the user must already exist within the UFM database, otherwise, the authentication will be refused by UFM. If this property is set to 'true,' UFM will create the non-existing users in the UFM DB.
kerberos_default_role: The default role is assigned to create users if they do not exist when '
kerberos_auto_sign_up' is set to true.
Finally, restart the UFM to use Kerberos authentication.
UFM Authentication Server
By default, UFM Authentication Server is enabled. To disable it, you need to set the "
auth_service_enabled" parameter to '
false' and then restart the UFM service to initiate the authentication server. Additionally, you can use enable/disable flags for Basic, Session, and Token authentication:
[AuthService]
auth_service_enabled =
true
auth_service_interface =
127.0.
0.1
auth_service_port =
8087 # the serving port
for the authentication server
basic_auth_enabled =
true
session_auth_enabled =
true
token_auth_enabled =
true
Azure AD Authentication
By default, Azure AD Authentication is disabled. To enable it, set the
azure_auth_enabled flag to 'true'. Additionally, provide the required configurations from the Azure AD Application such as TENANT_ID, CLIENT_ID and CLIENT_SECRET which can be found under the "Overview" section of the registered application in the Azure portal. Finally, the UFM Authentication Server should be enabled to use the Azure AD Authentication.
[AzureAuth]
azure_auth_enabled =
false
# TENANT ID of app registration
TENANT_ID =
# Application (client) ID of app registration
CLIENT_ID =
# Application's generated client secret
CLIENT_SECRET =
Changing Maximum SSL Request Size when Using Client Certificate
Activate Client Certificate Authentication. For more details, refer to the Client Authentication REST API.
In the Server section in the
gv.cfgfile, there is a configuration option for controlling the maximum request size when using client certificates:
The maximum request size, specified in bytes, is set to a default value of 1,572,864 (1536 KB / 1.5 MB). If not explicitly defined, the system will default to Apache's value of 131,072 bytes (128 KB).
max_ssl_request_size = 1572864
This configuration is expressed in bytes.
The Setting Up SSL and CA Certificates in UFM and Changing Maximum SSL Request Size when Using Client Certificate features cannot be used simultaneously. Removing, updating, or configuring will delete the certificates.
Adjusting UFM Configuration Files Based on Fabric Size
This function allows users to automate the process of updating the UFM configuration files by parsing a primary configuration file called large_scale_subnet.cfg file and applying the values to multiple target files or resetting to default values using the small_scale_subnet.cfg.
The below are instructions on how to use a Python script to parse a configuration file (
large_scale_subnet.cfg) and update the values of specific parameters in multiple target UFM configuration files (
gv.cfg, reports.cfg, opensm.cfg, and
sharp_am.cfg). The script can operate in two modes:
Large Scale Subnet Mode: This mode directly updates the UFM configuration files based on the parsed configuration from the
large_scale_subnet.cfgfile.
Small Scale Subnet (Default) Mode: Sets the UFM configuration files to their default values by parsing the
small_scale_subnet.cfg file.
Configuration File and Parameters
The primary configuration file contains all the parameters, and their values must be updated over the multiple UFM configuration files.
Primary Configuration Files
/opt/ufm/files/conf/ large_scale_subnet.cfg
/opt/ufm/files/conf/ small_scale_subnet.cfg
Target UFM Configuration Files
/opt/ufm/files/conf/gv.cfg
/opt/ufm/files/conf/reports.cfg
/opt/ufm/files/conf/opensm/opensm.conf
/opt/ufm/files/conf/sharp/sharp_am.cfg
Example structure of
large_scale_subnet.cfg and
small_scale_subnet.cfg:
[GV]
[GV.Server]
# disabled (
default) | enabled (configure opensm with multiple GUIDs) | ha_enabled (configure multiport SM with high availability).
multi_port_sm = ha_enabled
# report_events that will determine which trap to send to ufm all/security/none
report_events = security
[GV.FabricAnalysis]
# initial_delay (in minutes) - the initial delay
for running fabric analysis
for the first time after UFM was started
initial_delay =
10
[GV.logrotate]
#max_files specifies the number of times to rotate a file before it is deleted.
#A count of
0 (zero) means no copies are retained. A count of
10 means fifteen copies are retained (
default is
10)
max_files =
10
[REPORTS]
[REPORTS.FabricHealth]
# Fabric health report timeout
timeout =
1800
[REPORTS.TopologyCompare]
# Topology compare report timeout
timeout =
1800
[REPORTS.FabricAnalysis]
# Fabric analysis report timeout
timeout =
1800
[OPENSM]
#Amount of physical port to handle in one shot
virt_max_ports_in_process =
512
max_op_vls =
2
qos = TRUE
# Single MAD Sl2vl
for all ports
use_optimized_slvl = TRUE
# Timeout
for
long MAD config time. might need to change
1000
long_transaction_timeout =
500
routing_engine = ar_updn
use_ucast_cache = TRUE
root_guid_file = /opt/ufm/files/conf/opensm/root_guid.conf
pgrp_policy_file = /opt/ufm/files/conf/opensm/pgrp_policy.conf
[SHARP]
ib_qpc_sl =
1
fabric_update_interval =
10
lst_file_timeout =
10
lst_file_retries =
30
max_tree_radix =
80
generate_dump_files = TRUE
dynamic_tree_allocation = TRUE
dynamic_tree_algorithm =
1
smx_keepalive_interval =
10
Script Usage Example in the CLI:
Large Scale Subnet Mode:
/opt/ufm/scripts/set_ufm_scale_profile.sh --mode large_scale_subnet --force_update
Small Scale Subnet (Default) Mode:
/opt/ufm/scripts/ set_ufm_scale_profile.sh --mode small_scale_subnet
The
force_updatescript parameter adds any parameters found in
large_scale_subnet.cfg and
small_scale_subnet.cfg that are not present in the UFM configuration files. For example, if a user adds a new parameter called
test_param= 500 to the
large_scale_subnet.cfg file under the [Server] section and this parameter does not exist in the
gv.cfgfile, running the script with the
--force_update option will add
test_param= 500 to the [Server] section of the
gv.cfg file.
Expected Output
Large Scale Subnet Mode:
The script reads
large_scale_subnet.cfg.
It updates the parameters in the target UFM configuration files based on the parsed data.
It logs messages for any skipped parameters.
2. Small Scale Subnet - Default Mode:
The script reads
small_scale_subnet.cfg.
It updates the parameters in the target UFM configuration files based on the parsed data.
It logs messages for any skipped parameters or adds the parameter to the configuration file if the
force_updatewas True.
Note: In case of the script running failure, the script will reset the UFM configuration files to their default values.
Setting Up SSL and CA Certificates in UFM
This feature allows you to set up SSL and CA certificates, issued and signed by a publicly trusted certificate authority (CA) in UFM .
To utilize this feature, please ensure you have the following prerequisites:
UFM Docker image
Certificate files:
A certificate file named
server.crt
A key file named
server.key
An optional chain of authority file named
ca-intermediate.crt
All of these files must be located in the directory
/opt/ufm/files/conf/webclient.
Once these prerequisites are met, UFM will utilize the provided certificates.
The Setting Up SSL and CA Certificates in UFM and Changing Maximum SSL Request Size when Using Client Certificate features cannot be used simultaneously. Removing, updating, or configuring will delete the certificates.
Default Bare-Metal Cloud Mode
The Default Bare-Metal Cloud mode facilitates strong networking and tenant isolation across multiple nodes, guided by specific network configurations and event monitoring. It outlines key principles for network membership, event tracking, and component state management to ensure secure and expected system behavior. The feature is disabled by default.
To configure UFM running with Default Bare-Metal could mode, perform the following:
Change the following flag in the
gv.cfgfile:
bare_metal_cloud_mode =
true
Restart UFM telemetry or restart UFM.
Enabling this flag configures UFM to assign default management network memberships as limited. It also activates additional validations for PKey members, including checks to ensure all HCAs are associated with the same PKey and that there are no duplicate PKey assignments using index 0.
Faster Detection of Fan/PSU Removal on Unmanaged Switches
To enable quicker detection of fan and PSU module removals from unmanaged switches (e.g., MQM9790), update the following settings in the
gv.cfg file:
enable_high_freq_fru_check =
true
# Time interval between consecutive runs of fabric analysis
for unmanaged switches using minimal MADs to detect FAN and PSU module removals.
# Setting a value below
1 minute will disable the feature.
high_freq_fru_interval_min =
2
After making these changes, restart UFM telemetry or the entire UFM service.
Enabling this feature allows UFM to periodically perform lightweight fabric analysis, resulting in faster indication of fan and PSU removals on unmanaged switches.
Setting UFM Configurations Without Requiring UFM Restart
This section outlines the UFM, OpenSM, and SHARP configuration parameters that can be modified during runtime without requiring a process restart.
UFM Configuration
The following UFM configuration parameters in the
/opt/ufm/files/conf/gv.cfg file can be updated without restarting the UFM ModelMain process:
Section Name
Parameter Name
Logging
logrotate
DailyReport
Multisubnet
Server
OpenSM Configurations
The following OpenSM parameters in the
/opt/ufm/files/conf/opensm/opensm.conf file can be changed without requiring a restart of the OpenSM process:
m_key, sm_key, sa_key, allowed_sm_guids, m_key_lease_period, m_key_protection_level, m_key_lookup, m_key_per_port
sweep_interval, max_wire_smps, max_wire_smps2, max_smps_timeout, max_sa_reports_queued, max_sa_reports_on_wire
sa_etm_allow_untrusted_proxy_requests, sa_etm_allow_untrusted_guidinfo_rec, sa_etm_allow_guidinfo_rec_by_vf, sa_etm_max_num_mcgs, sa_etm_max_num_srvcs, sa_etm_max_num_event_subs
sa_rate_threshold, sa_check_sgid_spoofing, max_msg_fifo_timeout, sm_priority, qos_config_vl_enabled, max_op_vls, max_op_vls_ca, max_op_vls_sw, max_op_vls_rtr
suppress_mc_pkey_traps, force_link_speed, force_link_speed_ext, force_link_speed_ext2, force_link_width, fdr10, support_mepi_speeds, mepi_enabled_speeds
reassign_lids, ignore_other_sm, disable_multicast, subnet_timeout, packet_life_time, vl_stall_count, leaf_vl_stall_count, head_of_queue_lifetime, leaf_head_of_queue_lifetime
local_phy_errors_threshold, overrun_errors_threshold, use_mfttop, sminfo_polling_timeout, polling_retry_number, force_heavy_sweep, port_profile_switch_nodes, sweep_on_trap
routing_engine, enable_queries_during_routing, connect_roots, calculate_missing_routes, max_cas_on_spine, find_roots_color_algorithm, dfp_find_roots_color_algorithm
log_max_size, log_num_backlogs, log_flags, force_log_flush, accum_log_file, no_partition_enforcement, part_enforce, keep_pkey_indexes, sm_assigned_guid
qos, suppress_sl2vl_mad_status_errors, override_create_mcg_sl, port_shifting, scatter_ports, updn_lid_tracking_mode, updn_lid_tracking_converge_routes
updn_lid_tracking_prefer_total_routes, dfp_max_cas_on_spine, max_seq_redisc, aguid_inout_notice, sm_assign_guid_func, mc_primary_root_guid, mc_secondary_root_guid
max_reverse_hops, routing_threads_num, max_threads_per_core, guid_routing_order_no_scatter, use_scatter_for_switch_lid, offsweep_balancing_enabled, offsweep_balancing_window
sa_db_dump, sm_db_dump, torus_config, do_mesh_analysis, exit_on_fatal, honor_guid2lid_file, sm_inactive, babbling_port_policy, drop_subscr_on_report_fail
drop_event_subscriptions, drop_unreachable_event_subscriptions, ipoib_mcgroup_creation_validation, mcgroup_join_validation, use_original_extended_sa_rates_only
max_rate_enum, reports, use_optimized_slvl, use_optimized_port_mask_slvl, fsync_high_avail_files, default_mcg_mtu, default_mcg_rate
qos_max_vls, qos_high_limit, qos_vlarb_high, qos_vlarb_low, qos_sl2vl, qos_ca_max_vls, qos_ca_high_limit, qos_ca_vlarb_high, qos_ca_vlarb_low, qos_ca_sl2vl
qos_sw0_max_vls, qos_sw0_high_limit, qos_sw0_vlarb_high, qos_sw0_vlarb_low, qos_sw0_sl2vl, qos_swe_max_vls, qos_swe_high_limit, qos_swe_vlarb_high, qos_swe_vlarb_low, qos_swe_sl2vl
qos_sw2sw_max_vls, qos_sw2sw_high_limit, qos_sw2sw_vlarb_high, qos_sw2sw_vlarb_low, qos_sw2sw_sl2vl, qos_rtr_max_vls, qos_rtr_high_limit, qos_rtr_vlarb_high, qos_rtr_vlarb_low, qos_rtr_sl2vl
mlnx_congestion_control, cc_key_enable, cc_key_lease_period, cc_key_protect_bit, vs_key_enable, vs_key_lease_period, vs_key_ci_protect_bits
n2n_key_enable, n2n_key_lease_period, n2n_key_protect_bit, key_mgr_seed, enable_quirks, no_clients_rereg, client_rereg_mode, consolidate_ipv6_snm_req
consolidate_ipv4_mask, lash_start_vl, sm_sl, log_prefix, max_msg_fifo_len, max_alt_dr_path_retries, sa_pr_full_world_queries_allowed
hm_ports_health_policy_file, force_heavy_sweep_window, validate_smps, enable_inc_mc_routing, allow_sm_port_reset, port_speed_change_action, port_ext_speed_change_action
port_ext_speed2_change_action, port_mepi_speed_change_action, port_mtu_change_action, port_vl_change_action, port_ame_bit_change_action
support_mlnx_enhanced_link, mlnx_enhanced_link_enable, adaptive_timeout_sl_mask, mepi_cache_enabled, virt_max_ports_in_process, virt_default_hop_limit
enable_virt_rec_ext, improved_lmc_path_distribution, syslog_log_flags, sweep_every_hup_signal, osm_stats_interval, osm_stats_dump_limit, osm_perflog_dump_limit
enable_performance_logging, quasi_ftree_indexing, sharp_enabled, rtr_aguid_enable, rtr_pr_flow_label, rtr_pr_tclass, rtr_pr_sl, rtr_pr_mtu, rtr_pr_rate
aguid_default_hop_limit, verbose_bypass_policy_file, enable_subnet_lst, dor_hyper_cube_mode, additional_gi_supporting_devices, additional_mepi_force_devices
activity_report_subjects, enhanced_qos_vport0_unlimit_default_rl, adv_routing_engine, ar_mode, shield_mode, ar_sl_mask, enable_ar_by_device_cap, enable_ar_group_copy
ar_transport_mask, ar_tree_asymmetric_flow, ar_tree_asymmetric_flow_threshold, ar_tree_asymmetric_flow_threshold_limit, routing_flags, dump_ar
cache_ar_group_id, dfp_down_up_turns_mode, enable_vl_packing, topo_config_enabled, rtr_selection_function, rtr_selection_seed, rtr_selection_algo_parameters
get_mft_tables, hbf_sl_mask, hbf_hash_type, hbf_seed_type, hbf_seed, hbf_hash_fields, hbf_weights, pfrn_sl, pfrn_mask_clear_timeout, pfrn_mask_force_clear_timeout
pfrn_over_router_enabled, tenants_policy_enabled, reply_lid_smps_in_dr, respond_unknown_lid_traps, fabric_mode_profile, issu_mode, issu_timeout, issu_pre_upgrade_time
SHARP Configuration
For parameters found in the
/opt/ufm/files/conf/sharp/sharp_am.cfg file, no configuration changes can be applied during runtime. Any updates to this file require restarting the
sharp_am process for changes to take effect.
Setting up telemetry deploys UFM Telemetry as bare metal on the same machine. Historical data is sent to SQLite database on the server and live data becomes available via UFM UI or REST API.
Enabling UFM Telemetry
The UFM Telemetry feature is enabled by default and the provider is the UFM Telemetry. The user may change the provider via flag in
conf/gv.cfg
The user may also disable the History Telemetry feature in the same section.
[Telemetry]
history_enabled=True
Enabling UFM Telemetry Manager Plugin
UFM Telemetry can be managed via the UFM Telemetry Manager (UTM) Plugin. To enable UTM mode, deploy the plugin, set one or all of the following flags in
/opt/ufm/files/conf/gv.cfg
under the [Telemetry] section to
false
and restart UFM:
[Telemetry]
primary_telemetry_legacy_mode =
true
secondary_telemetry_legacy_mode =
true
dynamic_telemetry_legacy_mode =
true
To edit UFM Telemetry configuration, edit
/opt/ufm/files/conf/telemetry_defaults/primary_env.cfg for primary,
/opt/ufm/files/conf/secondary_telemetry_defaults/secondary_env.cfg for secondary,
/opt/ufm/files/conf/dynamic_telemetry_defaults/dynamic_env.cfg for dynamic.
Here is a list of available UTM Plugin APIs:
Command Method Description
------------------------ -------- -----------------------------------------------------------------------------------------------------------------------------------
/status GET Return status of managed telemetry setup
/status.html GET Return status in HTML format
/switches GET Return all switches
/hcas GET Return all HCAs
/guids GET Return all GUIDS PORTS pairs
/add_server GET Add a
new telemetry instance at
'url' added to
'group'.
Query parameters:
url=[target ip, required],
group=
'default'
/remove_server GET Remove paused or running telemetry instance at URL (required query parameter url=URL)
/pause_server GET Pause an existing telemetry instance at URL (required query parameter url=URL)
/start_server GET Start a paused telemetry instance at URL (required query parameter url=URL)
/restart_server GET Restart paused or running telemetry instance at URL (required query parameter url=URL)
Apply
new configuration
if query parameters are set:
xcset_name= regenerate
new config ini file
for IB Telemetry
http_port= replace http port
sample_rate= update sample rate
/get_server_xcsets GET Get list of available xcsets available on server at URL (required query parameter url=URL)
/deploy_switch_agents GET Deploy agent image to switches to run telemetry within it later
ip_list=[required] csv list of target switches IPs or
'all' to deploy to all switches
Examples:
1. deploy to all the managed switches:
curl -X GET
127.0.
0.1:
8888/deploy_switch_agents?ip_list=all
2. deploy to switches with IPs
127.0.
1.1 and
127.0.
1.2:
curl -X GET
127.0.
0.1:
8888/deploy_switch_agents?ip_list=
127.0.
1.1,
127.0.
1.2
/remove_switch_agents GET Remove deployed
switch agents and
switch telemetry
ip_list=[required] csv list of target switches IPs or
'all' to remove from all deployed switches
/switch_mon_list GET Set IPs of managed switches that will be periodically monitored.
ip_list=[required] csv list of target switches IPs or
'all' to deploy to all switches
Examples:
1. deploy to all the managed switches:
curl -X GET
127.0.
0.1:
8888/switch_mon_list?ip_list=all
2. deploy to switches with IPs
127.0.
1.1 and
127.0.
1.2:
curl -X GET
127.0.
0.1:
8888/switch_mon_list?ip_list=
127.0.
1.1,
127.0.
1.2
/managed_switches_status GET Get info in JSON-format about managed switches in Distributed Telemetry.
Query parameters:
monitored_only=[
1|
0] show only monitored switches (the switches that were set via /switch_mon_list endpoint). Default
0.
/start_switch_telemetry GET (Re)configure and (re)start
switch telemetry. Query parameters:
ip=[target ip], set
switch IP to (re)start a single
switch telemetry,
otherwise all the managed switches with installed
switch agents will (re)start telemetry
sample_rate,
restart_every,
counter_set,
http_port,
dt_udp_handshake,
dt_udp_data_ack,
gnmi
/stop_switch_telemetry GET Stop all running
switch telemetry instance
ip=[target ip], set
switch IP to stop a single
switch telemetry,
otherwise all the running
switch telemetries will be stopped
/set_switch_creds POST Set
switch credentials.
Required query parameters:
ip=
user=
pass=
/xcset/all GET Get cached name-hash xcset pairs
/xcset GET Get xcset content by name
/xcset DELETE Delete xcset from cache
/xcset POST Add/overwrite cached xcset
Example:
curl -X POST
0.0.
0.0:
8888/xcset?name=test --data-binary
@path/to/file.xcset
/filter POST Post [x|f]set name to the host telemetry instance. This data will be accessible via /[xc,c,f]set/name info endpoint.
Required args:
name=[required] set name
ext=[required] set extension (fset, cset, xcset)
session_id=[required] session ID of the host telemetry instance
Examples:
curl -X POST
0.0.
0.0:
8888/filter?name=test&ext=fset&session_id --data-binary
@path/to/file.fset
curl -X POST
0.0.
0.0:
8888/filter?name=test&ext=cset&session_id --data-binary
@path/to/file.cset
curl -X POST
0.0.
0.0:
8888/filter?name=test&ext=xcset&session_id --data-binary
@path/to/file.xcset
/host/create_telemetry POST Deploy and start internal host telemetry.
Note: works only when telemetry bringup is installed to UTM image
Required query parameters:
hca=
http_port=
sample_rate=
Optional query parameters:
xcset_name= regenerate
new config ini file
for IB Telemetry
env_params= csv line of env parameters
ibdiag_opts= csv line of ibdiag options
group= set telemetry group name
Optional POST parameters:
env_file path to env file. Format is key=value per line.
Usage:
curl ... --data-binary
@path/to/file
/host/remove_telemetry GET Stop and remove host telemetry.
Note: works only when telemetry bringup is installed to UTM image
One of the following query params is required:
session_id= to stop by session ID
group= to stop all instances from the group
/host/get_sessions GET Get session id to telemetry instance mapping
/help.html GET Print help information in HTML format
/help GET Print help information
For example, to update an existing
csetin UTM mode, get the
session_id using
/host/get/sessions API and update the cset with a post API:
curl -k -u admin:
123456 -X POST
"https://localhost/ufmRest/plugin/utm/filter?name=minimal&ext=cset&session_id=92" --data-binary @/opt/ufm/conf/telemetry/prometheus_configs/cset/minimal.cset
Changing Telemetry Bind Address
By default, UFM Telemetry will bind to 127.0.0.1. Users can modify this setting for both primary and secondary telemetry instances using the appropriate flags in
conf/gv.cfg.
[Telemetry]
primary_ip_bind_addr=0.0.0.0
secondary_ip_bind_addr=0.0.0.0
Changing Telemetry Endpoint Protocol
The default protocol for the Telemetry endpoint is HTTP. However, this can be switched to HTTPS by meeting two conditions:
Ensure that the certificate files, named
server.keyand
server.crt, are present in the following directories:
For bare-metal UFM users:
/var/opt/ufm/webclient
For Docker/Appliance UFM users:
/opt/ufm/files/conf/webclient
Modify the protocol setting in the configuration file
conf/gv.cfgunder the
[Telemetry]section:
[Telemetry] prometheus_protocol = https
Changing UFM Telemetry Default Configuration
There is an option to configure parameters on a telemetry configuration file which takes effect after restarting the UFM or failover in HA mode.
The
launch_ibdiagnet_config.ini default file is located under
/opt/ufm/conf/telemetry_defaults and is copied to the telemetry configuration location ( (
/opt/ufm/conf/telemetry) upon startup UFM.
All values taken from the default file take effect at the deployed configuration file except for the following:
Note that normally the user does not have to do anything and they get two pre-configured instances – one for low frequency and one for higher-frequency sampling of the network.
Value
Description
-
-
The port on which HTTP endpoint is configured
Configures how data is indexed and stored in memory
Configures network watcher to inform ibdiagnet that network topology has changed (as ibdiagnet lacks the ability to re-discover network changes)
Specifies where the counterset files, which define the data to be retrieved and the corresponding counter names.
The number of iterations to run before ‘restarting’, i.e. rediscovering fabric.
A file that is ‘touched’ to indicate that an ibdiagnet restart is necessary
The following attributes are configurable via the gv.cfg:
sample_rate (gv.cfg → dashboard_interval) – only if manual_config is set to false
prometheus_port
Supporting Generic Counters Parsing and Display
As of UFM v6.11.0, UFM can support any numeric counters from the HTTP endpoint. The list of supported counters are fetched upon starting the UFM from all the endpoints that are configured.
Some of the implemented changes are as follows:
Counter naming – all counters naming convention is extracted from the HTTP endpoint. The default
csetfile is configured as follows:
“
Infiniband_LinkIntegrityErrors=^LocalLinkIntegrityErrorsExtended$” to get this name to the UFM.
Counters received as floats should contain an "_f" suffix such as:
Infiniband_CBW_f=^infiniband_CBW$
Attribute units – To see units of a specific counter on the UI graphs, configure the
csetfile to have the counter returned as “
counter_name_u_unit”.
Telemetry History:
The SQLite history table
(/opt/ufm/files/sqlite/ufm_telemetry.db – telemetry_calculated), contains the new naming convention of the telemetry counters.
In the case of an upgrade, all previous columns that were configured are renamed following the new naming convention, and then, the data is saved.if a new counter that is not in the table needs to be supported, the table is altered upon UFM start.
New counter/
csetto fetch – if there is a new
cset/counter that needs to be supported AFTER the UFM already started, preform system restart.
Created New API/UfmRestV2/telemetry/counters for the UI visualization. This API returns a dictionary containing the counters that the UFM supports, based on the fetched URLs and their units (if known).
Supporting Multiple Telemetry Instances Fetch
This functionality allows users to establish distinct Telemetry endpoints that are defined to their preferences.
Users have the flexibility to set the following aspects:
Specify a list of counters they wish to pull. This can be achieved by selecting from an existing, predefined counters set (
cset file) or by defining a new one.
Set the interval at which the data should be pulled.
Upon initiating the Telemetry endpoint, users can access the designated URL to fetch the desired counter data.
To enable this feature, under the [Telemetry] section in
gv.cfg,the flag named “
additional_cset_url” holds the list of additional URLs to be fetched.
Only csv extensions are supported.
Each UFM Telemetry instance run by UFM can support multiple cset (counters set) in parallel.If the user would like to have a second cset file fetched by UFM and exposed by the same UFM Telemetry instance, the new cset file should be placed under
/opt/ufm/files/conf/telemetry/prometheus_configs/cset/and configured in gv.cfg to fetch its data as described above.
Low-Frequency (Secondary) Telemetry
As a default configuration, a second UFM Telemetry instance runs, granting access to an extended set of counters that are not available in the default telemetry session. The default telemetry session is used for the UFM Web UI dashboard and user-defined telemetry views. These additional counters can be accessed via the following API endpoint: http://<UFM_IP>:9002/csv/xcset/low_freq_debug.
It is important to note that these exposed counters are not accessible through UFM's REST APIs.All the configurations for the second telemetry can be found under
/opt/ufm/files/conf/secondary_telemetry/, where the defaults are located under
/opt/ufm/files/conf/secondary_telemetry_defaults/. The second telemetry instance also allows telemetry data to be exposed on disabled ports, although this feature can be disabled if desired.
The relevant flags in the gv.cfg file are as follows:
secondary_telemetry= true (To enable or disable the entire feature)
secondary_endpoint_port= 9002 (The endpoint's exposed port)
secondary_disabled_ports= true (If set to true, secondary telemetry will expose data on disabled ports)
secondary_slvl_support= false (if set to true, low-frequency (secondary) Telemetry will collect counters per slvl, the corresponding supported xcset can be found under
/opt/ufm/files/conf/secondary_telemetry/prometheus_configs/cset/low_freq_debug_per_slvl.xcset)
The counters that are supported by default, collected, and exposed can be located in the directory
/opt/ufm/files/conf/secondary_telemetry/prometheus_configs/cset/low_freq_debug_per_slvl.xcset.
For the list of low-frequency (secondary) telemetry fields and available counters, please refer to Low-Frequency (Secondary) Telemetry Fields.
Low-Frequency (Secondary) Telemetry - Exposing IPv6 Counters
To allow the low-frequency (secondary) telemetry instance to expose counters on its IPv6 interfaces, perform the following:
Change the following flag in the gv.cfg:
secondary_ip_bind_addr =
0:
0:
0:
0:
0:
0:
0:
0
Restart UFM telemetry or restart UFM.
Stopping Telemetry Endpoint Using CLI Command
To stop low-frequency (secondary) telemetry endpoint only using the CLI you may run the following command:
/etc/init.d/ufmd ufm_telemetry_secondary_stop
Exposing Switch Aggregation Nodes Telemetry
To expose switches SHARP aggregation nodes telemetry, follow the below steps:
Configure the low-frequency (secondary) telemetry instance. Run:
vi /opt/ufm/files/conf/secondary_telemetry_defaults/launch_ibdiagnet_config.ini
Set the following:
arg_16=--sharp --sharp_opt dsc
plugin_env_CLX_EXPORT_API_SKIP_SHARP_PM_COUNTERS=0
Add the wanted attributes to the default
xcsetor to a new one:
New
xcset–
vi /opt/ufm/files/conf/secondary_telemetry/prometheus_configs/cset/<name
foryour choise>.xcset
After restarting, query curl
http://<UFM_IP>:9002/csv/xcset/<chosen_name>
Existing
xcset–
vi /opt/ufm/files/conf/secondary_telemetry/prometheus_configs/cset/low_freq_debug.xcset
Add the following attributes:
packet_sent
ack_packet_sent
retry_packet_sent
rnr_event
timeout_event
oos_nack_rcv
rnr_nack_rcv
packet_discard_transport
packet_discard_sharp
aeth_syndrome_ack_packet
hba_sharp_lookup
hba_received_pkts
hba_received_bytes
hba_sent_ack_packets
rcds_sent_packets
hba_sent_ack_bytes
rcds_send_bytes
hba_multi_packet_message_dropped_pkts
hba_multi_packet_message_dropped_bytes
Restart telemetry:
/etc/init.d/ufmd ufm_telemetry_stop /etc/init.d/ufmd ufm_telemetry_start
Exposing Performance Histogram Counters for Egress Queue Depth Indications (Secondary) Telemetry
To enable the secondary telemetry instance to expose performance histogram counters for all VLs, perform the following:
Change the following flag in the
gv.cfgfile:
queue_depth_indications_all_vls =
true
If this flag remains set to
false, the secondary telemetry instance will only collect counters for VLs 0 and 1.
Restart UFM telemetry or restart UFM.
After the secondary telemetry instance restarts, you can find the collected counters at:
/opt/ufm/conf/secondary_telemetry/prometheus_configs/cset/low_freq_debug_per_slvl.xcset
Validating UFM Configuration Files
This functionality allows users to validate the supported configuration files against the schema of the current UFM version. This validation runs automatically over all supported UFM configuration files upon UFM start.
The validation is enabled by default. The relevant flag in the gv.cfg file is
config_validation_mode. The supported configuration files are
gv.cfg and
opensm.conf.
The expected values for the parameter:
0: Disabled (no validation)
1: Enabled with warnings (errors are logged, UFM continues).
2: Enabled with exit (validation errors cause UFM to terminate startup).
The default value is set to 2
If UFM fails to start due to a validation error, the error details will not appear in the UFM logs. Instead, check the
systemctl logs using:
journalctl -u ufm-enterprise
Manual Execution Examples:
To validate a specific configuration file:
/opt/ufm/scripts/validate_ufm_config_files.sh --config-file /opt/ufm/files/conf/opensm/opensm.conf --schema-file /opt/ufm/files/conf/schemas/opensm_schema.json
To validate all configuration files:
/opt/ufm/scripts/validate_ufm_config_files.sh