Unhealthy Ports Window

NVIDIA UFM-SDN Appliance User Manual v4.11.0

The Unhealthy Ports tab shows all the unhealthy nodes in the fabric.

After the Subnet Manager examines the behavior of subnet nodes (switches and hosts) and discovers that a node is “unhealthy” according to the conditions specified below, the node is displayed in the Unhealthy Ports window. Once a node is declared as “unhealthy”, Subnet Manager can either ignore, report, isolate or disable the node. The user is provided with the ability to control the actions performed and the phenomena that declares a node “unhealthy." Moreover, the user can “clear” nodes that were previously marked as “unhealthy."

The information is displayed in a tabular form and includes the unhealthy port’s state, source node, source port, source port GUID, peer node, peer port, peer GUID, peer LID, condition, and status time.

image2022-4-28_22-13-12.png

Warning

The feature requires OpenSM parameter hm_unhealthy_ports_checks to be set to TRUE (default).

Warning

This feature is not available in the "Monitoring Only Mode."

The following are the conditions that would declare a node as “unhealthy”:

  • Reboot - If a node was rebooted more than 10 times during last 900 seconds

  • Flapping - If several links of the node found in Initializing state in 5 out of 10 previous sweeps

  • Unresponsive - A port that does not respond to one of the SMPs and the MAD status is TIMEOUT in 5 out of 7 previous SM sweeps

  • Noisy Node - If a node sends traps 129, 130 or 131 more than 250 traps with interval of less than 60 seconds between each two traps

  • Seterr - If a node respond with bad status upon SET SMPs (PortInfo, SwitchInfo, VLArb, SL2VL or Pkeys)

  • Illegal - If illegal MAD fields are discovered after a check for MADs/fields during receive_process

  • Manual - Upon user request mark the node as unhealthy/healthy

  • Link Level Retransmission (LLR) – Activated when retransmission-per-second counter exceeds its threshold

All conditions except LLR generate Unhealthy port event, LLR generates a High Data retransmission event.

Procedure_Heading_Icon.PNG

To clear a node from the Unhealthy Ports Tab, do the following:

  1. Go to the Unhealthy Ports window under Managed Elements.

  2. From the Unhealthy Ports table, right click the desired port it and mark it as healthy.

    image2022-4-28_22-14-6.png

Procedure_Heading_Icon.PNG

To mark a node as permanently healthy, do the following:

  1. Create a file named opensm-health-policy.conf.user_ext on a remote host.

  2. Enter the node and the port information and set it as "Healthy".

    Copy
    Copied!
                

    0x0002c903005dd832 6 Healthy

  3. Import the file.

    Copy
    Copied!
                

    ufm-appliance [ mgmt-sa ] (config) # ib sm configuration import opensm-health-policy-user-ext scp://root:123456@192.168.1.3/tmp/health-policy.conf.user_ext

  4. Make the changes effective by running "ib sm opensm-health-policy-merge".

    Copy
    Copied!
                

    ufm-appliance [ mgmt-sa ] (config) # ib sm opensm-health-policy-merge

Warning

To control Partial Switch ASIC Failure event:

Trigger Partial Switch ASIC Failure whenever number of unhealthy ports exceed the defined percent of the total number of the switch ports.

The switch_asic_fault_threshold flag (under the UnhealthyPorts section in gv.cfg file) default value is 20.

It is possible to to filter the Unhealthy Ports table by connectivity (all, host-to-switch, or switch-to-host).

Filtering the Unhealthy Ports table is possible from the dropdown options at the top of the table which includes

  • All Connectivity

  • Switch to Switch

  • Host to Switch

image2022-4-28_22-14-19.png

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.