NVIDIA InfiniBand Cluster Operation and Maintenance Guide
1.0

List of Scenarios

This section provides a set of UFM API queries used to gather information about events identified by UFM. The events/scenarios are organized by their likelihood. For each event, comprehensive details are supplied, including a description, relevant fabric state, methods to retrieve the specific UFM event, corresponding event IDs, and suggested courses of action for remediation.

Scenario

Bad Link/Port

Description

A port can drop too many packets

Fabric State

Any fabric state

How to Detect

Via UFM health reports, UFM events\alerts

UFM Alerts

Event ID

Event Name

915

Critical BER reported

916

High BER reported

Action Required

Review the firmware version of both the device and cable to ensure they are up to date. Fixing a bad link involves a sequence of actions to be followed on both ends of the link. Each of these actions should be executed in sequence, while verifying if the issue has been resolved. These actions are:

  1. Reset the link using NOS/mlx command:

    interface ib <port number> clear counters

    Example:

    interface ib 1/1/1 clear counters

  2. Pullout, clean, and reinsert the fiber.

  3. Consider replacing the transceiver – it is recommended to use two spare transceivers on both sides of the link. If this resolves the issue, determine in a controlled lab environment which of the removed transceivers was responsible for the problem. This minimizes downtime on the operational floor.

  4. Try using a new or spare fiber between the source and destination.

  5. If the problem persists, the Switch or NIC are likely malfunctioning. Replace them with new hardware.

  6. If the issue remains unresolved, contact NVIDIA Networking Support.

    To determine a link's viability, a short period of high Bit Error Rate (BER) or more complex criteria may disqualify it. Conversely, a significant monitoring duration of approximately 30 minutes is essential to confirm its reliability.

    The complete procedure for fixing a link involves isolating the link, applying manual fixes from the above list, waiting for confirmation of resolution, and then de-isolating the link:

  7. If cable temperature exhibits considerable drift or surpasses established thresholds, isolate it without attempting to repair. Allow the temperature to stabilize before de-isolating.

  8. Identify problematic links using the UFM report of bad links. Alternatively, consider links in the INIT state due to firmware AutoDetect or monitor changes in link counters.

  9. Employ the UFM port isolation routine to isolate the link, transitioning it to the INIT state.

  10. Apply the fixes outlined in actions (a to f above), addressing one issue at a time.

  11. Await the Cable Validation tool to report port health for the impacted ports.

  12. De-isolate the link using the UFM port de-isolation routine.

  13. Verify that the link has successfully transitioned to the ACTIVE state.

Additional Reading

Refer to Cable Validation Tool and Reports REST API

{

"cables": true,

"symbol_ber_check:": true,

"effective_ber_check:": true

}

Scenario

Wrong Connection

Description

A cable that is not connected according to the topology can act as a bottleneck.

Fabric State

Bring-up/maintenance

How to Detect

Set master topology scheduled topodiff to run once every 3 hours using Topology Compare REST API .

Turn on UFM topology compare capability.

UFM Alerts

Event ID

Event Name

1316

Topo Config Subnet Mismatch

Action Required

Fix Misconnection

Scenario

Host is Hanging

Description

Most commonly due to software issue or overload

Fabric State

Any

How to Detect

SM alerts about unresponsive host.

UFM Alerts

Event ID

Event Name

331

Node is Down

Action Required

Fix stuck host

Scenario

Low Bandwidth

Description

In the event that certain network cables remain disconnected and unfixed, the network may lose bandwidth. While application performance decline can occur due to various factors, the most common factor is losing network links.

Fabric State

Bring-up/maintenance

UFM API Query

GET /ufmRest/app/events

How to Detect

Customer complaints. Examine UFM Congestion dashboard.

UFM Alerts

Event ID

Event Name

122

Congested Bandwidth (%) Threshold Reached

134

T4 Port Congested Bandwidth

Action Required

Assess the network's bandwidth while considering the impact of the disconnected links, as this could explain the decrease in bandwidth. If the calculations align, it is crucial to replace the affected cables. In cases where the figures do not match, a deeper analysis may be necessary, including an analysis of the application's traffic pattern and the utilized transport methods.

Scenario

SHARP AM Issue

Description

SHARP job failure

Fabric State

Any

UFM API Query

GET /ufmRest/app/events

How to Detect

Review UFM events

UFM Alerts

Event ID

Event Name

1523

Job Start Failed

1524

Job Error

1532

SHARP is not Responding

Action Required

  • Check SHARP AM configuration.

  • Check job scheduler configuration.

Scenario

Inadequate Control of Cluster Temperature

Description

The cluster's temperature could escalate rapidly or surpass a designated threshold, potentially risking equipment damage.

Fabric State

Any

How to Detect

UFM temperature alert.

Via REST API: Request URL:

POST /ufmRest/fabricValidation/tests/CheckTemperature

For more information, refer to Fabric Validation Tests REST API .

UFM Alerts

Event ID

Event Name

912

Module Temperature High Threshold Reached

919

Cable Temperature High

1400

High Ambient Temperature

Action Required

Fix the fans/cooling/air conditioning systems.

Scenario

Non-Responsive Switch

Description

switch is not responding due to software/hardware issues

Fabric State

Any

How to Detect

Examine the UFM generated alerts for such cases

UFM Alerts

Event ID

Event Name

907

Switch is Down

909

Director Switch is Down

1312

Suspected switch Reboot

Action Required

Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated.

Additional Reading

Refer to Actions REST API .

Scenario

Infinite Switch Reboots due to Switch HW Malfunction

Description

While it is not highly probable, switch's software or hardware malfunction may lead to multiple switch reboots. Should this occur, the fabric might experience packet loss transmitted through that particular switch. Consequently, the SDN infrastructure could become busy managing these incidents and reconfiguring the network.

Fabric State

Any

How to Detect

UFM alerts about unhealthy switch

UFM Alerts

Event ID

Event Name

907

Switch is Down

909

Director Switch is Down

1312

Suspected switch Reboot

Action Required

Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated.

Additional Reading

Refer to Actions REST API .

Scenario

UFM Server Failover

Description

UFM server failure causes UFM HA failover to standby host.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID

Event Name

602

UFM Server Failover

Action Required

Execute a UFM health report and validate the successful execution of UFM failover. Fix any hardware failures and restore the UFM High Availability (HA) cluster. Ensure the collection of system dumps open a support ticket through NVIDIA Networking Support.

Scenario

SM Not Responding

Description

The OpenSM process hangs.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID

Event Name

545

SM is not responding

Action Required

Automated SM restart must be handled automatically within UFM. Conduct a UFM health report to verify the UFM's ability to restart OpenSM. Remember to gather sysdump data and proceed to initiate a support ticket through NVIDIA Networking Support.

Additional Reading

Fabric Validation Tests REST API

Scenario

UFM Management Interface is Down

Description

UFM server InfiniBand management interface is down.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID

Event Name

546

Management interface is down

Action Required

Verify the connectivity of the management interface. Run the "ibstat" command to confirm that the State: Active and Physical state: LinkUp. If the issue persists, retrieve sysdump information and proceed to create a support ticket through NVIDIA Networking Support.

Scenario

Threshold for UFM Server Disk Utilization Exceeded

Description

UFM server disk utilization threshold is reached and UFM is not able to free disk space.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID

Event Name

525

Disk utilization threshold reached

Action Required

Clean UFM server disk from third party data. If the problem persists, collect sysdump and open support ticket in NVIDIA Networking Support.

Scenario

Duplicated GUIDs

Description

Cards or switches can be accidentally provisioned with duplicated GUIDs. This is a rare case.

Fabric State

Bring-up/maintenance

How to Detect

Subnet Manager detects and reports duplicated GUIDs via UFM

UFM Alerts

Event ID

Event Name

1310

Duplicated node GUID was detected

1311

Duplicated port GUID was detected

Action Required

  • Collect sysdump and open support ticket in NVIDIA Networking Support.

  • Turn off switch/host.

Additional Reading

Actions REST API

© Copyright 2023, NVIDIA. Last updated on Dec 18, 2023.