NVIDIA UFM Enterprise User Manual v6.14.1
NVIDIA UFM Enterprise User Manual v6.14.1

Devices Window

The Devices window shows data pertaining to the physical devices in a tabular format.

devices-window.JPG

Devices Window Data

Data Type

Description

Health

Health of the device reflecting the highest alarm severity. Please refer to the Health States table.

Name

Name of the device

Warning

If UFM Agent is running on a device, the following icon will appear next to the device name:

image2019-6-20_12-15-36.png

GUID

System GUID of the device

Type

Type of the device: switch, node, IB router, and getaway

IP

IP address of the device

Vendor

The vendor of the device

Firmware Version

The firmware version installed on the device

Health States

Icon

Name

Description

image2019-6-16_11-40-21.png

Normal

Information/notification displayed during normal operating state or a normal system event.

image2019-6-16_11-40-26.png

Critical

Critical means that the operation of the system or a system component fails.

image2019-6-16_11-40-32.png

Minor

Minor reflects a problem in the fabric with no failure.

image2019-6-16_11-40-37.png

Warning

Warning reflects a low priority problem in the fabric with no failure. A warning is asserted when an event exceeds a predefined threshold.

A right-click on the device name displays a list of actions that can be performed on it.

devices-window2.png

Devices Actions

Action

Description

Firmware Upgrade

Perform a firmware upgrade on the selected device

Firmware Reset

Reboot the device. This action is only applicable to unmanaged hosts (servers).

Set Node Description

Configure a description to this node

Collect System Dump

Collect the system dump log for a specific device

Add to Group

Add the selected device to a devices group

Remove from Group

Remove the selected device from a devices group

Suppress Notifications

Suppress all event notifications for the device

Add to Monitor Session

Configure and activate host monitoring

Show in Network Map

Move to Zoom In tab in network map and add the selected device to filter list

Warning

Collecting system dump for hosts, managed by UFM, is available only for hosts which are set with a valid IPv4 address and installed with MLNX_OFED.

From the Devices table, it is possible to mark devices as healthy or unhealthy using the context menu (right-click).

There are two options for marking a device as unhealthy:

  • Isolate

  • No Discover

device-unhealthy.JPG

image2022-4-28_22-6-52.png

Server: conf/opensm/opensm-health-policy.conf content:

Copy
Copied!
            

0xe41d2d030003e3b0 34 UNHEALTHY isolate 0xe41d2d030003e3b0 19 UNHEALTHY isolate 0xe41d2d030003e3b0 3 UNHEALTHY isolate 0xe41d2d030003e3b0 26 UNHEALTHY isolate 0xe41d2d030003e3b0 0 UNHEALTHY isolate 0xe41d2d030003e3b0 27 UNHEALTHY isolate 0xe41d2d030003e3b0 7 UNHEALTHY isolate 0xe41d2d030003e3b0 10 UNHEALTHY isolate 0xe41d2d030003e3b0 11 UNHEALTHY isolate 0xe41d2d030003e3b0 22 UNHEALTHY isolate 0xe41d2d030003e3b0 18 UNHEALTHY isolate 0xe41d2d030003e3b0 29 UNHEALTHY isolate 0xe41d2d030003e3b0 8 UNHEALTHY isolate 0xe41d2d030003e3b0 5 UNHEALTHY isolate 0xe41d2d030003e3b0 17 UNHEALTHY isolate 0xe41d2d030003e3b0 23 UNHEALTHY isolate 0xe41d2d030003e3b0 15 UNHEALTHY isolate 0xe41d2d030003e3b0 24 UNHEALTHY isolate 0xe41d2d030003e3b0 2 UNHEALTHY isolate 0xe41d2d030003e3b0 16 UNHEALTHY isolate 0xe41d2d030003e3b0 13 UNHEALTHY isolate 0xe41d2d030003e3b0 14 UNHEALTHY isolate 0xe41d2d030003e3b0 32 UNHEALTHY isolate 0xe41d2d030003e3b0 33 UNHEALTHY isolate 0xe41d2d030003e3b0 35 UNHEALTHY isolate 0xe41d2d030003e3b0 20 UNHEALTHY isolate 0xe41d2d030003e3b0 21 UNHEALTHY isolate 0xe41d2d030003e3b0 28 UNHEALTHY isolate 0xe41d2d030003e3b0 1 UNHEALTHY isolate 0xe41d2d030003e3b0 9 UNHEALTHY isolate 0xe41d2d030003e3b0 4 UNHEALTHY isolate 0xe41d2d030003e3b0 31 UNHEALTHY isolate 0xe41d2d030003e3b0 30 UNHEALTHY isolate 0xe41d2d030003e3b0 36 UNHEALTHY isolate 0xe41d2d030003e3b0 12 UNHEALTHY isolate 0xe41d2d030003e3b0 25 UNHEALTHY isolate 0xe41d2d030003e3b0 6 UNHEALTHY isolate

/opt/ufm/files/log/opensm-unhealthy-ports.dump content:

image2021-11-26_16-31-54.png

device-healthy.JPG

Server /opt/ufm/files/conf/opensm/opensm-health-policy.conf content:

Copy
Copied!
            

0xe41d2d030003e3b0 15 HEALTHY 0xe41d2d030003e3b0 25 HEALTHY 0xe41d2d030003e3b0 35 HEALTHY 0xe41d2d030003e3b0 0 HEALTHY 0xe41d2d030003e3b0 11 HEALTHY 0xe41d2d030003e3b0 21 HEALTHY 0xe41d2d030003e3b0 28 HEALTHY 0xe41d2d030003e3b0 7 HEALTHY 0xe41d2d030003e3b0 17 HEALTHY 0xe41d2d030003e3b0 14 HEALTHY 0xe41d2d030003e3b0 24 HEALTHY 0xe41d2d030003e3b0 34 HEALTHY 0xe41d2d030003e3b0 3 HEALTHY 0xe41d2d030003e3b0 10 HEALTHY 0xe41d2d030003e3b0 20 HEALTHY 0xe41d2d030003e3b0 31 HEALTHY 0xe41d2d030003e3b0 6 HEALTHY 0xe41d2d030003e3b0 16 HEALTHY 0xe41d2d030003e3b0 27 HEALTHY 0xe41d2d030003e3b0 2 HEALTHY 0xe41d2d030003e3b0 13 HEALTHY 0xe41d2d030003e3b0 23 HEALTHY 0xe41d2d030003e3b0 33 HEALTHY 0xe41d2d030003e3b0 30 HEALTHY 0xe41d2d030003e3b0 9 HEALTHY 0xe41d2d030003e3b0 19 HEALTHY 0xe41d2d030003e3b0 26 HEALTHY 0xe41d2d030003e3b0 36 HEALTHY 0xe41d2d030003e3b0 5 HEALTHY 0xe41d2d030003e3b0 12 HEALTHY 0xe41d2d030003e3b0 22 HEALTHY 0xe41d2d030003e3b0 32 HEALTHY 0xe41d2d030003e3b0 1 HEALTHY 0xe41d2d030003e3b0 8 HEALTHY 0xe41d2d030003e3b0 18 HEALTHY 0xe41d2d030003e3b0 29 HEALTHY 0xe41d2d030003e3b0 4 HEALTHY

/opt/ufm/files/log/opensm-unhealthy-ports.dump content:

Copy
Copied!
            

# NodeGUID, PortNum, NodeDesc, PeerNodeGUID, PeerPortNum, PeerNodeDesc, {BadCond1, BadCond2, ...}, timestamp

Software/Firmware Upgrade via FTP

Software and firmware upgrade over FTP is enabled by the UFM Agent. UFM invokes the Software/Firmware Upgrade procedure locally on switches or on hosts. The procedure copies the new software/firmware file from the defined storage location and performs the operation on the device. UFM sends the set of attributes required for performing the software/firmware upgrade to the agent.

The attributes are:

  • File Transfer Protocol – default FTP

    • The Software/Firmware upgrade on InfiniScale III ASIC-based switches supports FTP protocol for transmitting files to the local machine.

    • The Software/Firmware upgrade on InfiniScale IV-based switches and hosts supports TFTP and protocols for transmitting files to the local machine.

  • IP address of file-storage server

  • Path to the software/firmware image location
    The software/firmware image files should be placed according to the required structure under the defined image storage location. Please refer to section Devices Window.

  • File-storage server access credentials (User/Password)

In-Band Firmware Upgrade

You can perform in-band firmware upgrades for externally managed switches and HCAs. This upgrade procedure does not require the UFM Agent or IP connectivity, but it does require current PSID recognition. Please refer to section PSID and Firmware Version In-Band Discovery. This feature requires that the Mellanox Firmware Toolkit (MFT), which is included in the UFM package, is installed on the UFM server. UFM uses flint from the MFT for in-band firmware burning.

Before upgrading, you must create the firmware repository on the UFM server under the directory /opt/ufm/files/userdata/fw/. The subdirectory should be created for each PSID and one firmware image should be placed under it. For example:

Copy
Copied!
            

/opt/ufm/files/userdata/fw/ MT_0D80110009 fw-ConnectX2-rel-2_9_1000-MHQH29B-XTR_A1.bin MT_0F90110002 fw-IS4-rel-7_4_2040-MIS5023Q_A1-A5.bin


Directory Structure for Software or Firmware Upgrade Over FTP

Before performing a software or firmware upgrade, you must create the following directory structure for the upgrade image. The path to the <ftp user home>/<path>/ directory should be specified in the upgrade dialog box.

Copy
Copied!
            

<ftp user home>/<path>/ InfiniScale3 - For anafa based switches Software/Firmware upgrade images voltaire_fw_images.tar – firmware image file ibswmpr-<s/w version>.tar – software image file InfiniScale4 - For InfiniScale IV based switches Software/Firmware upgrade images firmware_2036_4036.tar – Firmware image file upgrade_2036_4036.tgz – Software image file OFED /* For host SW upgrade*/ OFED-<OS label>.tar.bz2 <PSID>* – For host FW upgrade fw_update.img

The <PSID> value is extracted from the mstflint command:

Copy
Copied!
            

mstflint -d <device> q

The device is extracted from the lspci command. For example:

Copy
Copied!
            

# lspci 06:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex # mstflint -d 06:00.0 q | grep PSID PSID: VLT0040010001


PSID and Firmware Version In-Band Discovery

The device PSID and device firmware version are required for in-band firmware upgrade and for the correct functioning of Subnet Manager plugins, such as Congestion Control Manager and Lossy Configuration Management. For most devices, UFM discovers this information and displays it in the Device Properties pane. The PSID and the firmware version are discovered by the Vendor-specific MAD.

By default, the gv.cfg file value for event_plugin_option is set to (null). This means that the plugin is disabled and opensm does not send MADs to discover devices' PSID and FW version. Therefore, values for devices' PSID and FW version are taken from ibdiagnet output (section NODES_INFO).

The below is an example of the default value:

Copy
Copied!
            

event_plugin_options = (null)

To enable the vendor-specific discovery by opemsm, in the gv.cfg configuration file, change the value of event_plugin_option to (--vendinfo –m 1), as shown below:

Copy
Copied!
            

event_plugin_options = --vendinfo –m 1

If the value is set to –vendinfo –m 1, the data should be supplied by opensm, and in this case the ibdiagnet output is ignored.

Warning

In some firmware versions, the information above is currently not available.


Switch Management IP Address Discovery

From NVIDIA switch FM version 27.2010.3942 and up, NVIDIA switches support switch management IP address discovery using MADs. This information can be retrieved as part of ibdiagnet run (ibdiagnet output), and assigned to discover switches in UFM.

There is an option to choose the IP address of which IP protocol version that is assigned to the switch: IPv4 or IPv6.

The discovered_switch_ip_protocol key, located in the gv.cfg file in section [FabricAnalysys], is set to 4 by default. This means that the IP address of type IPv4 is assigned to the switch as its management IP address. In case this value is set to 6, the IP address of type IPv6 is assigned to the switch as its management IP address.

After changing the discover_switch_ip_protocol value in gv.cfg, the UFM Main Model needs to be restarted for the update to take effect. The discovered IP addresses for switches are not persistent in UFM – every UFM Main Model restarts the values of management IP address which is assigned from the ibdiagnet output.

Upgrading Server Software

The ability to update the server software is applicable only for hosts (servers) with the UFM Agent.

To upgrade the software:

  1. Select a device.

  2. From the right-click menu, select Software Update.

  3. Enter the parameters listed in the following table.

    Parameter

    Description

    Protocol

    Update is performed via FTP protocol

    IP

    Enter the host IP

    Path

    Enter the parent directory of the FTP directory structure for the Upgrade image.
    The path should not be an absolute path and should not contain the first slash (/) or trailer slash.

    User

    Name of the host username

    Password

    Enter the host password

  4. Click Submit to save your changes.

Upgrading Firmware

You can upgrade firmware over FTP for hosts and switches that are running the UFM Agent, or you can perform an in-band upgrade for externally managed switches and HCAs.

Before you begin the upgrade ensure that the new firmware version is in the correct location. For more information, please refer to section In-Band Firmware Upgrade.

To upgrade the firmware:

  1. Select a host or server.

  2. From the right-click menu, select Firmware Upgrade.

  3. Select protocol In Band.

  4. For upgrade over FTP, enter the parameters listed in the following table.

    Parameter

    Description

    IP

    Enter device IP

    Path

    Enter the parent directory of the FTP directory structure for the Upgrade image.
    The path should not be an absolute path and should not contain the first slash (/) or trailer slash.

    Username

    Name of the host username

    Password

    Enter the host password

  5. Click submit to save your changes.

    Warning

    The firmware upgrade takes effect only after the host or externally managed switch is restarted.

Upgrade Cables Transceivers Firmware Version

The main purpose of this feature is to add support for burning of multiple cables transceiver types on multiple devices using linkx tool which is part of flint. This needs to be done from both ends of the cable (switch and HCA/switch).

To upgrade cables transceivers FW version:

  1. Navigate to managed elements page

  2. select the target switches and click on Upgrade Cable Transceivers option

    cable-upgrade.JPG

  3. A model will be shown containing list of the active firmware versions for the cables of the selected switches, besides the version number, a badge will show the number of matched switches:

    image2022-4-15_14-28-38.png

    image2022-4-15_14-29-1.png

  4. After the user clicks Submit, the GUI will start sending the selected binaries with the relevant switches sequentially, and a model with a progress bar will be shown (this model can be minimized):

    upgrade-cable-trans.JPG

  5. After the whole action is completed successfully, you will be able to see the following message at the model bottom The upgrade cable transceivers completed successfully, do you want to activate it? by clicking the yes button it will run a new action on all the burned devices to activate the new uploaded binary image.

  6. Another option to activate burned cables transceivers you can go to the Groups page and right click on the predefined Group named Devices Pending FW Transceivers Reset or you can right click on the upgraded device from managed element page and select Activate cable Transceivers action.

    upgrade-cable-trans2.JPG

Selecting a device from the Devices table reveals the Device Information table on the right side of the screen. This table provides information on the device’s ports, cables, groups, events, alarms, inventory, and device access.

device-info-tabs.JPG

General Tab

Provides general information on the selected device.

general-tab.JPG


Ports Tab

This tab provides a list of the ports connected to this device in a tabular format.

image2022-4-28_22-9-12.png

Ports Data

Data Type

Description

Port Number

The number of ports on device.

Node

The node name/GUID/IP that the port belongs to.

Note that you can choose the node label (name/GUID/IP) using the drop-down menu available above the Ports data table.

Health

Health of the port reflecting the highest alarm severity. Please refer to the Health States table.

State

Indicates whether the port is connected (active or inactive).

LID

The local identifier (LID) of the port.

MTU

Maximum Transmission Unit of the port.

Speed

image2022-4-15_14-33-46.png

Lists the highest value of active, enabled and supported speeds in icons indicating their status:

  • Dark green – active speed

  • Light green – enabled speed

  • Grey – supported yet disabled speed

Width

image2022-4-15_14-35-18.png

Lists the highest value of active, enabled and supported widths in icons indicating their status:

  • Dark green – active width

  • Light green – enabled width

  • Grey – supported yet disabled width

Peer

The GUID of the device the port is connected to.

Peer Port

The name of the port that is connected to this port.

Cables Tab

This tab provides a list of the cables connected to this device in a tabular format.

image2022-4-28_22-9-37.png

Cables Data

Data Type

Description

Basic Information

Health

Health of the cable reflecting the highest alarm severity. Please refer to the Health States table.

Serial Number

Serial number of the cable.

Identifier

Identifier of the cable.

Source Port Information

Source GUID

GUID of the source port the cable is connected to.

Source Port

The number of the source port the cable is connected to.

Destination Port Information

Destination GUID

GUID of the destination port the cable is connected to.

Destination Port

The number of the destination port the cable is connected to.

Advanced Information

Revision

Revision of the cable.

Link Width

The maximum link width of the cable.

Part Number

Part number of the cable.

Technology

The transmitting medium of the cable: copper/optical/etc.

Length

The cable length in meters.

Groups Tab

This tab provides a list of the groups to which the selected device belongs.

image2022-4-28_22-9-57.png

Groups Data

Data Type

Description

Severity

Aggregated severity level of the group (the highest severity level of all group members).

Name

Name of the group.

Description

Description of the group.

Type

Type of the group: General/Rack.

Alarms Tab

This tab provides a list of all UFM alarms related to the selected device.

image2022-4-28_22-10-22.png

Alarms Data

Data Type

Description

Alarms ID

Alarm identifier.

Source

Source object (device/port) on which the alarm was triggered.

Severity

The severity of the alarm.

Description

Description of the alarm.

Date/Time

The time when the alarm was triggered.

Reason

Reason for the alarm.

Count

Number of instances that the alarm occurred on the related source object.

Events Tab

This tab provides a list of the UFM events that are related to the selected device.

image2022-4-28_22-11-1.png

Events Data

Data Type

Description

Severity

Event severity – Info, Warning, Error, Critical or Minor.

Event Name

The name of the event.

Source

The source object (device/port) on which the event was triggered.

Date/Time

The time when the event was triggered.

Category

The category of the event indicated by icons. Hovering over the icon will display the category name.

Description

Description of the event. Full description can be displayed by hovering over the text.

Inventory Tab

This tab provides a list of the device’s modules with information in a tabular format.

Warning

This tab is available for switches only.

image2022-4-28_22-11-25.png

Inventory Data

Data Type

Description

Health

Health of the module reflecting the highest alarm severity. Please refer to the Health States table.

Status

The module status.

Serial Number

Serial number of the module.

Name

Name of the device.

Description

Description of the module.

Type

Type of the module: spine/line/etc.

Firmware Version

Firmware version installed on the module.

Hardware Version

Hardware version of the module.

Temperature

Temperature of the module.

HCAs Tab

This tab provides a list of the device’s HCAs with information in a tabular format.

Warning

This tab is available for hosts only.

image2022-4-28_22-11-48.png

Data Type

Description

Health

Health of the HCA reflecting the highest alarm severity. Please refer to the Health States table.

Name

HCA Index

GUID

HCA GUID

Type

HCA Type

Port GUID

HCA ports GUIDs

PSID

HCA PSID

FW Version

HCA firmware version

Device Access Tab

This tab allows for managing the access credentials of the selected device for remote accessibility. To be able to set access credentials for the device, a device IP must be set either by installing UFM Agent on the device, or by manually setting the IP under IP Address Settings (IP is now supported with v4 and v6).

image2019-9-19_16-7-29.png

Warning

After manually setting the IP address of NVIDIA® Mellanox® InfiniScale IV® and SwitchX® based switches, UFM will first validate the new IP before setting it.

To edit your device access credentials

  1. Select the preferred protocol tab:

    • SSH – allows you to define the SSH parameters to open an SSH session on your device (available for nodes and switches)

    • IPMI – allows you to set the IPMI parameters to open an IPMI session on your device for remote power control (available for nodes only)

    • HTTP – allows you to define the HTTP parameters to open an HTTP session on your device (available for switches only)

  2. Click Update to save your changes.

    image2019-9-19_16-9-40.png

Device Access Credentials Parameters

Field

Description

User

Fill in or edit the computer user name.

Password

Enter the device password.

Confirmation

Enter the device password a second time to confirm.

Manual IP

Enter the device IP address (could be IPv4/IPv6).

Port

Enter the port number.

Timeout

Enter the connection timeout (in seconds) for the device specific protocol (SSH/HTTP/IPMI).

Virtual Networking Tab

This tab displays a map containing the HCAs for the selected device, and the ports and virtual ports it is connected to.

Virtual_Netowking_Tab.png


© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.