NVIDIA UFM Enterprise Appliance Software User Manual v1.8.2

Known Issue History

Ref #

Issue

3773902

Description: In congestion control, the cc-policy.conf file remains unchanged following the upgrade of the container version (with no changes made by the user)

Keywords: Congestion Control, cc-policy.conf, Upgrade, Container

Workaround: On the host, run the command:

docker exec -it ufm cp /opt/ufm/skeleton/conf/opensm/cc-policy.conf /opt/ufm/files/conf/opensm/cc-policy.conf

Discovered in Release: 1.7.0

3775405

Description: : Upon UFM startup, an empty temporary folder will be created at /tmp folder every 10 minutes (due to periodic telemetry status check)

Keywords: Empty folder, temporary, /tmp

Workaround: Change instances_sessions_compatibility_interval parameter in gv.cfg to 30 minutes

Discovered in Release: v1.6.0

3560659

Description: Modifying the mtu_limit parameter for [MngNetwork] in gv.cfg does not accurately reflect changes upon restarting UFM.

Keywords: mtu_limit, MngNetwork, gv.cfg, UFM restart

Workaround: UFM needs to be restarted twice in order for the changes to take effect.

Discovered in Release: v1.6.0

3729822

Description: The Logs API temporarily returns an empty response when SM log file contains messages from both previous year (2023) and current year (2024).

Keywords: Logs API, Empty response, Logs file

Workaround: N/A (issue will be automatically resolved after the problematic SM log file, which include messages from 2023 and 2024 years, will be rotated)

Discovered in Release: v1.6.0

3699419

Description: After remanufacturing the UFM Enterprise Appliance from an ISO file as described in Appendix - Deploying UFM Appliance from an ISO File, rebooting or power cycling the host in High-Availability (HA) mode results in the unsuccessful start of the HA services.

Workaround: Change the crontab option in UFM Enterprise Appliance via the OS shell #crontab -e:

Copy
Copied!
            

@reboot /usr/sbin/netplan apply

to:

Copy
Copied!
            

@reboot sleep 240 && /sbin/ip link set up dev idrac

Keywords: Reboot; HA; Power Cycle

Discovered in Release: 1.6.0

N/A

Description: Execution of UFM Fabric Health Report (via UFM Web UI / REST API) will trigger ibdiagnet to use SLRG register, which might cause some of the Switch and HCA's firmware to get stuck and cause the HCA's ports to stay at "Init" state.

Keywords: UFM Fabric Health Report; SLRG; Stuckness

Discovered in Release: 1.5.0

3511410

Description: Collect system dump for DGX host does not work due to missing sshpass utility.

Workaround: Install sshpass utility on the DGX .

Keywords: System Dump, DGX, sshpass utility

3432385

Description: UFM does not support HDR switch configured with hybrid split mode, where some of the ports are split and some are not.

Workaround:  UFM can properly operate when all or none of the HDR switch ports are configured as split.

Keywords: HDR Switch, Ports, Hybrid Split Mode

3461658

Description: After the upgrade from UFM Enterprise Appliance v1.4.0 GA to UFM Enterprise Appliance v1.4.1 FUR, the network fast recovery path in opensm.conf is not automatically updated and remains with a null value (fast_recovery_conf_file (null))

Workaround:  If you wish to enable the network fast recovery feature in UFM, make sure to set the appropriate path for the current fast recovery configuration file (/opt/ufm/files/conf/opensm/fast_recovery.conf) in the opensm.conf file located at /opt/ufm/files/conf/opensm, before starting UFM.

Keywords:  Network fast recovery, Missing, Configuration

N/A

Description: Upgrading the UFM Enterprise Appliance SW while upgrading the UFM Enterprise Appliance OS is not supported.

Workaround: Do not use the --appliance-sw-upgrade flag while upgrading the UFM Enterprise Appliance OS. Alternatively, upgrade the UFM Enterprise Appliance SW as described in Software Upgrade

Keywords: SW Upgrade; OS Upgrade, --appliance-sw-upgrade

3473600

Description: The UFM Enterprise service is enabled while upgrading the UFM Enterprise Appliance SW on HA mode.

Workaround: Disable the UFM Enterprise service after the upgrade in HA mode by running the following command:

Copy
Copied!
            

systemctl disable ufm-enterprise.service

Keywords: SW Upgrade, HA Mode

3361160

Description: Upgrading UFM Enterprise Appliance from versions 1.3.0, 1.2.0 and 1.1.0 results in cleanup of UFM historical telemetry database (due to schema change). This means that the new telemetry data will be stored based on the new schema.

Workaround: To preserve the historical telemetry database data while upgrading from UFM Enterprise Appliance version 1.3.0, 1.2.0 and 1.1.0, perform the upgrade in two phases. First, upgrade to UFM Enterprise Appliance v1.2.0, and then upgrade to the latest UFM version (UFM v1.3.0 or newer). It is important to note that the upgrade process may take longer depending on the size of the historical telemetry database.

Keywords: UFM Historical Telemetry Database, Cleanup, Upgrade

3346321

Description: In some cases, when multiport SM is configured in UFM, a failover to the secondary node might be triggered instead of failover to the local available port

Workaround: N/A

Keywords: Multiport SM, Failover, Secondary port

N/A

Description: Enabling a port for a managed switch fails in case that port is not disabled in a persistent way (this may occur in ports that were disabled in previous versions of UFM Enterprise Appliance v1.3.0)

Workaround: Set "persistent_port_operation=false” in gv.cfg to use non-persistent (legacy) disabling or enabling of the port. UFM restart is required.

Keywords: Disable, Enable, Port, Persistent

3346321

Description:  Failover to another port (multi-port SM) will not work as expected in case UFM was deployed as a docker container

Workaround: Failover to another port (multi-port SM) works properly on UFM Bare-metal deployments

Keywords: Failover to another port, Multi-port SM

348587

Description: Replacement of defected nodes in the HA cluster does not work when PCS version is 0.9.x

Workaround: N/A

Keywords: Defected Node, HA Cluster, pcs version

3336769

Description: UFM-HA: If the back-to-back interface is disabled or disconnected, the HA cluster will enter a split-brain state, and the "ufm_ha_cluster status" command will stop functioning properly.

Workaround: To resolve the issue:

  1. Connect or enable the back-to-back interface

  2. Run

    Copy
    Copied!
                

    pcs cluster start --all

  3. Follow instructions in Split-Brain Recovery in HA Installation.

Keywords: HA, Back-to-back Interface

N/A

Description: Running UFM software with external UFM-SM is no longer supported

Workaround: N/A

Keywords: External UFM-SM

© Copyright 2024, NVIDIA. Last updated on Jun 24, 2024.