Known Issues History

NVIDIA UFM Enterprise User Manual v6.15.1

Ref #

Issue

N/A

Description: Execution of UFM Fabric Health Report (via UFM Web UI / REST API) will trigger ibdiagnet to use SLRG register which might cause some of the Switch and HCA's firmware to stuck and cause the HCA's ports to stay at "Init" state.

Keywords:

Discovered in Release: 6.14.0

3538640

Description: Fixed ALM plugin log rotate function.

Keywords: ALM, Plugin, Log rotate

Discovered in Release: 6.13.0

3532191

Description: Fixed UFM hanging (database is locked) after corrective restart of UFM health.

Keywords: Hanging, Database, Locked

Discovered in Release: 6.13.0

3555583

Description: Resolved REST API links' inability to return hostname for computer nodes.

Keywords: REST API, Links, Hostname, Computer Nodes

Discovered in Release: 6.12.1

3549795

Description: Fixed ufm_ha_cluster status to show DRBD sync status.

Keywords: ufm_ha_cluster, DRBD, Sync Status

Discovered in Release: 6.13.0

3549793

Description: Fixed UFM HA installation failure.

Keywords: HA, Installation

Discovered in Release: 6.13.0

3547517

Description: Fixed UFM logs REST API returning empty result when SM logs exist on the disk.

Keywords: Logs, SM logs, Empty

Discovered in Release: 6.11.0

3546178

Description: Fixed SHARP jobs failure when SHARP reservation feature is enabled.

Keywords: SHARP, Jobs, Reservation

Discovered in Release: 6.13.0

3541477

Description: Fixed UFM module temperature alerting on wrong thresholds.

Keywords: Module Temperature, Alert Threshold

Discovered in Release: 6.13.0

3191419

Description: Fixed UFM default session API returning port counter values as NULL.

Keywords: Null, Port Counter, Value, API

Discovered in Release: 6.9.0

3560659

Description: Fixed proper update in [MngNetwork] mtu_limit in gv.cfg when restarting UFM.

Keywords: mtu_limit, gv.cfg, Update, UFM restart

Discovered in Release: 6.13.1

3534374

Description: Fixed configure_ha_nodes.sh failure when deploying UFM6.13.x HA on Ubuntu22.04.

Keywords: configure_ha_nodes.sh, HA, Ubuntu22.04

Discovered in Release: 6.13.0

3496853

Description: Fixed daily report not being sent properly.

Keywords: Daily Report, Failure

Discovered in Release: 6.13.0

3469639

Description: Fixed REST RDMA server failure every couple of days, causing inability to retrieve ibdiagnet data.

Keywords: REST RDMA, ibdiagnet

Discovered in Release: 6.12.0

3455767

Description: Fixed incorrect combination of multiple devices in monitoring.

Keywords: Monitoring, Incorrect combination

Discovered in Release: 6.12.0

3511410

Description: Collect system dump for DGX host does not work due to missing sshpass utility.

Workaround: Install sshpass utility on the DGX .

Keywords: System Dump, DGX, sshpass utility

3432385

Description: UFM does not support HDR switch configured with hybrid split mode, where some of the ports are split and some are not.

Workaround:  UFM can properly operate when all or none of the HDR switch ports are configured as split.

Keywords: HDR Switch, Ports, Hybrid Split Mode

3472330

Description: On bare-metal high availability (HA), when initiating a UFM system dump from either the master or standby node, the collection process will not include the HA dumps (pacemaker and DRBD).

Workaround:  To extract the HA system dump from bare-metal, run the following command from the master/standby nodes:

Copy
Copied!
            

/usr/bin/vsysinfo -S all -e -f /etc/ufm/ufm-ha-sysdump.conf -O /tmp/HA_sysdump

The extracted HA system dump are stored in /tmp/HA_sysdump.gz.tar

Keywords: UFM System Dump, HA, Bare-Metal

3461658

Description: After the upgrade from UFM Enterprise v6.13.0 GA to UFM Enterprise v6.13.1 FUR, the network fast recovery path in opensm.conf is not automatically updated and remains with a null value (fast_recovery_conf_file (null))

Workaround:  If you wish to enable the network fast recovery feature in UFM, make sure to set the appropriate path for the current fast recovery configuration file (/opt/ufm/files/conf/opensm/fast_recovery.conf) in the opensm.conf file located at /opt/ufm/files/conf/opensm, before starting UFM.

Keywords:  Network fast recovery, Missing, Configuration

N/A

Description: Enabling a port for a managed switch fails in case that port is not disabled in a persistent way (this may occur in ports that were disabled on previous versions of UFM - prior to UFM v6.12.0)

Workaround: Set "persistent_port_operation=false” in gv.cfg to use non-persistent (legacy) disabling or enabling of the port. UFM restart is required.

Keywords: Disable, Enable, Port, Persistent

3346321

Description:  Failover to another port (multi-port SM) will not work as expected in case UFM was deployed as a docker container

Workaround: Failover to another port (multi-port SM) works properly on UFM Bare-metal deployments

Keywords: Failover to another port, Multi-port SM

3348587

Description: Replacement of defected nodes in the HA cluster does not work when PCS version is 0.9.x

Workaround: N/A

Keywords: Defected Node, HA Cluster, pcs version

3336769

Description: UFM-HA: In case the back-to-back interface is disabled or disconnected, the HA cluster will enter a split-brain state, and the "ufm_ha_cluster status" command will stop functioning properly.

Workaround: To resolve the issue:

  1. Connect or enable the back-to-back interface

  2. Run

    Copy
    Copied!
                

    pcs cluster start --all

  3. Follow instructions in Split-Brain Recovery in HA Installation.

Keywords: HA, Back-to-back Interface

3361160

Description: Upgrading UFM Enterprise from versions 6.8.0, 6.9.0 and 6.10.0 results in cleanup of UFM historical telemetry database (due to schema change). This means that the new telemetry data will be stored based on the new schema.

Workaround: To preserve the historical telemetry database data while upgrading from UFM version 6.8.0, 6.9.0 and 6.10.0, perform the upgrade in two phases. First, upgrade to UFM v6.11.0, and then upgrade to the latest UFM version (UFM v6.12.0 or newer). It is important to note that the upgrade process may take longer depending on the size of the historical telemetry database.

Keywords: UFM Historical Telemetry Database, Cleanup, Upgrade

3346321

Description: In some cases, when multiport SM is configured in UFM, a failover to the secondary node might be triggered instead of failover to the local available port

Workaround: N/A

Keywords: Multiport SM, Failover, Secondary port

3240664

Description: This software release does not support upgrading the UFM Enterprise version from the latest GA version (v6.11.0). UFM upgrade is supported in UFM Enterprise v6.9.0 and v6.10.0.

Workaround: N/A

Keywords: UFM Upgrade

3242332

Description: Upgrading MLNX_OFED uninstalls UFM

Workaround: Upgrade UFM to a newer version (v6.11.0 or newer), then upgrade MLNX_OFED

Keywords: MLNX_OFED, Uninstall, UFM

3237353

Description: Upgrading from UFM v6.10 removes MLNX_OFED crucial packages

Workaround: Reinstall MLNX_OFED/UFM

Keywords: MLNX_OFED, Upgrade, Packages

N/A

Description: Running UFM software with external UFM-SM is no longer supported

Workaround: N/A

Keywords: External UFM-SM

3144732

Description: By default, a managed Ubuntu 22 host will not be able to send system dump (sysdump) to a remote host as it does not include the sshpass utility.

Workaround: In order to allow the UFM to generate system dump from a managed Ubuntu 22 host, install the sshpass utility prior to system dump generation.

Keywords: Ubuntu 22, sysdump, sshpass

3129490

Description: HA uninstall procedure might get stuck on Ubuntu 20.04 due to multipath daemon running on the host.

Workaround: Stop the multipath daemon before running the HA uninstall script on Ubuntu 20.04.

Keywords: HA uninstall, multipath daemon, Ubuntu 20.04

3147196

Description: Running the upgrade procedure on bare metal Ubuntu 18.04 in HA mode might fail.

Workaround: For instructions on how to apply the upgrade for bare metal Ubuntu 18.04, refer to High Availability Upgrade for Ubuntu 18.04 .

Keywords: Upgrade, Ubuntu 18.04, Docker Container, failure

3145058

Description: Running upgrade procedure on UFM Docker Container in HA mode might fail.

Workaround: For instructions on how to apply the upgrade for UFM Docker Container in HA, refer to Upgrade Container Procedure.

Keywords: Upgrade, Docker Container, failure

3061449

Description: Upon upgrade of UFM all telemetry configurations will be overridden with the new telemetry configuration of the new UFM version.

Workaround: If the telemetry configuration is set manually, the user should set up the configuration after upgrading the UFM for the changes to take effect.

Telemetry manual configuration should be set on the following telemetry configuration file right after UFM upgrade: /opt/ufm/conf/telemetry_defaults/launch_ibdiagnet_config.ini.

Keywords: Telemetry, configuration, upgrade, override.

3053455

Description: UFM “Set Node Description” action for unmanaged switches is not supported for Ubuntu18 deployments

Workaround: N/A

Keywords: Set Node Description, Ubuntu18

3053455

Description: UFM Installations are not supported on RHEL8.X or CentOS8.X

Workaround: N/A

Keywords: Install, RHEL8, CentOS8

3052660

Description: UFM monitoring mode is not working

Workaround: In order to make UFM work in monitoring mode, please edit telemetry configuration file: /opt/ufm/conf/telemetry_defaults/launch_ibdiagnet_config.ini

Search for arg_12 and set empty value: arg_12=

Restarting the UFM will run the UFM in monitoring mode. Before starting the UFM make sure to set: monitoring_mode = yes in gv.cfg

Keywords: Monitoring, mode

3054340

Description: Setting non-existing log directory will fail UFM to start

Workaround: Make sure to set a valid (existing) log directory when setting this parameter (gv.cfgàlog_dir)

Keywords: Log, Dir, fail, start

-

Description: Restoring HA standby node and configuring UFM HA with external UFM-Subnet Managers are not supported on Ubuntu bare-metal deployments

Workaround: N/A

Keywords: HA standby node, bare-metal

2887364

Description: After upgrading to UFM6.8, in case UFM failed over to the secondary node, trying to get cable information for selected port will fail.

Workaround: On the secondary UFM node, copy the following files to /usr/bin/ folder:

  • /usr/flint

  • /usr/flint_ext

  • /usr/mlxcables

  • /usr/mlxcables_ext

  • /usr/mlxlink

  • /usr/mlxlink_ext

trying to get cable information on the secondary UFM node should work now.

Keywords: upgrade, failover, cable information

2784560

Description: Intentional stop for master container and start it again or reboot of master server will damage the HA failover option

Workaround: manually restart UFM cluster

Keywords: UFM Container; Reboot, Failover

2872513

Description: after rebooting master container, Failover will be triggered twice (once to the standby and then back again to the master container)

Workaround: N/A

Keywords: UFM Container, reboot, failover

2863388

Description: Fail to get cables info for NDR Split Port.

Workaround: N/A

Keywords: Cable, NDR, Split

N/A

Description: In case of using SM mkey per port, several UFM operations might fail (get cable info, get system dump, switch FW upgrade)

Workaround: N/A

Keywords: SM, mkey per port

2702950

Description: Internet connection is required to download and install SQLite on the old container during software the upgrade process.

Workaround: N/A

Keywords: Container; upgrade

2694977

Description: Adding a large number of devices (~1000) to a group or a logical server, on large scale setup takes ~2 minutes.

Workaround: N/A

Keywords: Add device; group; logical server; large scale

2710613

Description: Periodic topology compare will not report removed nodes if the last topology change included only removed nodes.

Workaround: N/A

Keywords: Topology comparison

2698055

Description: UFM, configured to work with telemetry for collecting historical data, is limited to work only with the configured HCA port. If this port is part of a bond interface and a failure occurs on the port, collection of telemetry data via this port stops.

Workaround: Reconfigure telemetry with the new active port and restart it within UFM.

Keywords: Telemetry; history; bond; failure

2705974

Description: If new ports are added after UFM startup, the default session REST API (GET /ufmRest/monitoring/session/0/data) will not include port statistics for the newly added ports.

Workaround: Reset the main UFM.

  • For UFM standalone – /etc/init.d/ufmd model_restart

  • For UFM HA – /etc/init.d/ufmha model_restart

Keywords: Default session; REST API; missing ports

2714738

Description: Intentional stop for master container and start it again or reboot of master server will damage the HA failover option

Workaround: manually Restart UFM cluster

Keywords: UFM Container; Reboot, Failover

2872513

Description: after rebooting master container, Failover will be triggered twice (once to the standby and then back again to the master container)

Workaround: N/A

Keywords: UFM Container, reboot, failover

2863388

Description: Fail to get cables info for NDR Splitted Port.

Workaround: N/A

Keywords: Cable, NDR, Split

N/A

Description: In case of using SM mkey per port, several UFM operations might fail (get cable info, get system dump, switch FW upgrade)

Workaround: N/A

Keywords: SM, mkey per port,

Description: The UFM which is configured to work with telemetry for collecting historical data, is limited to work only with the configured HCA port - if this port is part of the bond interface and failure occurs, all telemetry data via this port will be stopped.

Workaround: If a historical telemetry port is apart of the bond and a failure occurs, user should reconfigure the telemetry with a new active port and restart it within UFM.

Keywords: telemetry, history, bond, failure

Discovered in release: 6.7

2459320

Description: Docker upgrade to UFM6.6.1 from UFM6.6.0 is not supported.

Workaround: N/A

Keywords: Docker; upgrade

Discovered in release: 6.6.1

-

Description: SHARP Aggregation Manager over UCX is not supported.

Workaround: N/A

Keywords: UCX; SHARP AM

Discovered in release: 6.6.1

2288038

Description: When the user try to collect system dump for UFM Appliance host, the job will be completed with an error with the following summary: "Running as a none root user Please switch to root user (super user) and run again."

Workaround: N/A

Keywords: System dump, UFM Appliance host

Discovered in release: 6.5.2

2100564

Description: For modular dual-management switch systems, switch information is not presented correctly if the primary management module fails and the secondary takes over.

Workaround: To avoid corrupted switch information, it is recommended to manually set the virtual IP address (box IP address) for the switch as the managed switch IP address (manual IP address) within UFM.

Keywords: Modular switch, dual-management, virtual IP, box IP

Discovered in release: 6.4.1

2135272

Description: UFM does not support hosts equipped with multiple HCAs of different types (e.g. a host with ConnectX®-3 and ConnectX-4/5/6) if multi-NIC grouping is enabled (i.e. multinic_host_enabled = true).

Workaround: All managed hosts must contain HCAs of the same type (either using ConnectX-3 HCAs or use ConnectX-4/5/6 HCAs).

Keywords: Multiple HCAs

Discovered in release: 6.4.1

2063266

Description: Firmware upgrade for managed hosts with multiple HCAs is not supported. That is, it is not possible to perform FW upgrade for a specific host HCA.

Workaround: Running software (MLNX_OFED) upgrade on that host will automatically upgrade all the HCAs on this host with the firmware bundled as part of this software package.

Keywords: FW upgrade, multiple HCAs

Discovered in release: 6.4.1

-

Description: Management PKey configuration (e.g. MTU, SL) can be performed only using PKey management interface (via GUI or REST API).

Workaround: N/A

Keywords: PKey, Management PKey, REST API

Discovered in release: 6.4

2092885

Description: UFM Agent is not supported for SLES15 and RHEL8/CentOS8.

Workaround: N/A

Keywords: UFM Agent

Discovered in release: 6.4

-

Description: CentOS 8.0 does not support IPv6.

Workaround: N/A

Keywords: IPv6

Discovered in release: 6.4

1895385

Description: QoS parameters (mtu, sl and rate_limit) change does not take effect unless OpenSM is restarted.

Workaround: N/A

Keywords: QoS, PKey, OpenSM

Discovered in release: 6.3

-

Description: Logical Server Auditing feature is supported on RedHat 7.x operating systems only.

Workaround: N/A

Keywords: Logical Server, auditing, OS

Discovered in release: 5.9

-

Description: Configuration from lossy to lossless requires device reset.

Workaround: Reboot all relevant devices after changing behavior from lossy to lossless.

Keywords: Lossy configuration

© Copyright 2023, NVIDIA. Last updated on Dec 19, 2023.