NVIDIA UFM Enterprise User Manual v6.17.0
NVIDIA UFM Enterprise User Manual v6.17.0

Unhealthy Ports Window

The Unhealthy Ports view shows all the unhealthy nodes in the fabric and the OpenSM health policy of the healthy/unhealthy nodes.

After the Subnet Manager examines the behavior of subnet nodes (switches and hosts) and discovers that a node is “unhealthy” according to the conditions specified below, the node is displayed in the Unhealthy Ports window. Once a node is declared as “unhealthy”, Subnet Manager can either ignore, report, isolate or disable the node. The user is provided with the ability to control the actions performed and the phenomena that declares a node “unhealthy." Moreover, the user can “clear” nodes that were previously marked as “unhealthy."

The information is displayed in a tabular form and includes the unhealthy port’s state, source node, source port, source port GUID, peer node, peer port, peer GUID, peer LID, condition, and status time.

image2022-4-28_22-13-12-version-1-modificationdate-1713267442200-api-v2.png

Note

The feature requires OpenSM parameter hm_unhealthy_ports_checks to be set to TRUE (default).

Note

This feature is not available in the "Monitoring Only Mode."

The following are the conditions that would declare a node as “unhealthy”:

  • Reboot - If a node was rebooted more than 10 times during last 900 seconds

  • Flapping - If several links of the node found in Initializing state in 5 out of 10 previous sweeps

  • Unresponsive - A port that does not respond to one of the SMPs and the MAD status is TIMEOUT in 5 out of 7 previous SM sweeps

  • Noisy Node - If a node sends traps 129, 130 or 131 more than 250 traps with interval of less than 60 seconds between each two traps

  • Seterr - If a node respond with bad status upon SET SMPs (PortInfo, SwitchInfo, VLArb, SL2VL or Pkeys)

  • Illegal - If illegal MAD fields are discovered after a check for MADs/fields during receive_process

  • Manual - Upon user request mark the node as unhealthy/healthy

  • Link Level Retransmission (LLR) – Activated when retransmission-per-second counter exceeds its threshold

All conditions except LLR generate Unhealthy port event, LLR generates a High Data retransmission event.

Procedure_Heading_Icon-version-1-modificationdate-1713267446643-api-v2.PNG

To clear a node from the Unhealthy Ports Tab, do the following:

  1. Go to the Unhealthy Ports window under Managed Elements.

  2. From the Unhealthy Ports table, right click the desired port it and mark it as healthy.

    image2022-4-28_22-14-6-version-1-modificationdate-1713267441727-api-v2.png

Procedure_Heading_Icon-version-1-modificationdate-1713267446643-api-v2.PNG

To mark a node as permanently healthy, do the following:

  1. Open the /opt/ufm/files/conf/health-policy.conf.user_ext file.

  2. Enter the node and the port information and set it as "Healthy."

  3. Run the /opt/ufm/scripts/sync_hm_port_health_policy_conf.sh script.

Note

To control Partial Switch ASIC Failure event:

Trigger Partial Switch ASIC Failure whenever number of unhealthy ports exceed the defined percent of the total number of the switch ports.

The switch_asic_fault_threshold flag (under the UnhealthyPorts section in gv.cfg file) default value is 20.

It is possible to to filter the Unhealthy Ports table by connectivity (all, host-to-switch, or switch-to-host).

Filtering the Unhealthy Ports table is possible from the dropdown options at the top of the table which includes

  • All Connectivity

  • Switch to Switch

  • Host to Switch

image2022-4-28_22-14-19-version-1-modificationdate-1713267441390-api-v2.png

This view manages the OpenSM health policy for the healthy/unhealthy nodes and ports. The OpenSM health policy is stored in the /opt/ufm/files/conf/opensm/opensm-health-policy.conf file.

The information is displayed in a tabular form, with an option to group it either by devices or ports, and includes the health nodes/ports details (GUID, Name, policy [healthy/unhealthy])

  1. Health Policy by devices:

    by_devices-version-1-modificationdate-1713267441107-api-v2.png

  2. Health Policy by ports:

    by_ports-version-1-modificationdate-1713267438017-api-v2.png

To switch between the above views, simply click on the control button located at the top right corner of the table. By default, the devices view will be shown.

The health policy supports the following capabilities. When you select a policy and right-click, you can perform the following actions:

  1. Delete the Policy

  2. Mark the selected healthy policies as unhealthy (Isolate/No discover)

  3. Mark the selected unhealthy policies as healthy

If you wish to delete all the healthy ports from the health policy, click on the 'Delete All Healthy Ports' option situated at the top right corner of the policy table.

© Copyright 2024, NVIDIA. Last updated on Aug 27, 2024.