List of Scenarios - NVIDIA Docs

This section provides a set of UFM API queries used to gather information about events identified by UFM. The events/scenarios are organized by their likelihood. For each event, comprehensive details are supplied, including a description, relevant fabric state, methods to retrieve the specific UFM event, corresponding event IDs, and suggested courses of action for remediation.

Bad Link/Port

Scenario

Bad Link/Port

Description

A port can drop too many packets

Fabric State

Any fabric state

How to Detect

Via UFM health reports, UFM events\alerts

UFM Alerts

Event ID	Event Name
915	Critical BER reported
916	High BER reported

Action Required

Review the firmware version of both the device and cable to ensure they are up to date. Fixing a bad link involves a sequence of actions to be followed on both ends of the link. Each of these actions should be executed in sequence, while verifying if the issue has been resolved. These actions are:

Reset the link using NOS/mlx command:
interface ib <port number> clear counters
Example:
interface ib 1/1/1 clear counters
Pullout, clean, and reinsert the fiber.
Consider replacing the transceiver – it is recommended to use two spare transceivers on both sides of the link. If this resolves the issue, determine in a controlled lab environment which of the removed transceivers was responsible for the problem. This minimizes downtime on the operational floor.
Try using a new or spare fiber between the source and destination.
If the problem persists, the Switch or NIC are likely malfunctioning. Replace them with new hardware.
If the issue remains unresolved, contact NVIDIA Networking Support.
To determine a link's viability, a short period of high Bit Error Rate (BER) or more complex criteria may disqualify it. Conversely, a significant monitoring duration of approximately 30 minutes is essential to confirm its reliability.
The complete procedure for fixing a link involves isolating the link, applying manual fixes from the above list, waiting for confirmation of resolution, and then de-isolating the link:
If cable temperature exhibits considerable drift or surpasses established thresholds, isolate it without attempting to repair. Allow the temperature to stabilize before de-isolating.
Identify problematic links using the UFM report of bad links. Alternatively, consider links in the INIT state due to firmware AutoDetect or monitor changes in link counters.
Employ the UFM port isolation routine to isolate the link, transitioning it to the INIT state.
Apply the fixes outlined in actions (a to f above), addressing one issue at a time.
Await the Cable Validation tool to report port health for the impacted ports.
De-isolate the link using the UFM port de-isolation routine.
Verify that the link has successfully transitioned to the ACTIVE state.

Additional Reading

Refer to Cable Validation Tool and Reports REST API

{

"cables": true,

"symbol_ber_check:": true,

"effective_ber_check:": true

}

Wrong Connection

Scenario

Wrong Connection

Description

A cable that is not connected according to the topology can act as a bottleneck.

Fabric State

Bring-up/maintenance

How to Detect

Set master topology scheduled topodiff to run once every 3 hours using Topology Compare REST API .

Turn on UFM topology compare capability.

UFM Alerts

Event ID	Event Name
1316	Topo Config Subnet Mismatch

Action Required

Fix Misconnection

Host is Hanging

Scenario

Host is Hanging

Description

Most commonly due to software issue or overload

Fabric State

Any

How to Detect

SM alerts about unresponsive host.

UFM Alerts

Event ID	Event Name
331	Node is Down

Action Required

Fix stuck host

Low Bandwidth

Scenario

Low Bandwidth

Description

In the event that certain network cables remain disconnected and unfixed, the network may lose bandwidth. While application performance decline can occur due to various factors, the most common factor is losing network links.

Fabric State

Bring-up/maintenance

UFM API Query

GET /ufmRest/app/events

How to Detect

Customer complaints. Examine UFM Congestion dashboard.

UFM Alerts

Event ID	Event Name
122	Congested Bandwidth (%) Threshold Reached
134	T4 Port Congested Bandwidth

Action Required

Assess the network's bandwidth while considering the impact of the disconnected links, as this could explain the decrease in bandwidth. If the calculations align, it is crucial to replace the affected cables. In cases where the figures do not match, a deeper analysis may be necessary, including an analysis of the application's traffic pattern and the utilized transport methods.

SHARP AM Issue

Scenario

SHARP AM Issue

Description

SHARP job failure

Fabric State

Any

UFM API Query

GET /ufmRest/app/events

How to Detect

Review UFM events

UFM Alerts

Event ID	Event Name
1523	Job Start Failed
1524	Job Error
1532	SHARP is not Responding

Action Required

Check SHARP AM configuration.
Check job scheduler configuration.

Inadequate Control of Cluster Temperature

Scenario

Inadequate Control of Cluster Temperature

Description

The cluster's temperature could escalate rapidly or surpass a designated threshold, potentially risking equipment damage.

Fabric State

Any

How to Detect

UFM temperature alert.

Via REST API: Request URL:

POST /ufmRest/fabricValidation/tests/CheckTemperature

For more information, refer to Fabric Validation Tests REST API .

UFM Alerts

Event ID	Event Name
912	Module Temperature High Threshold Reached
919	Cable Temperature High
1400	High Ambient Temperature

Action Required

Fix the fans/cooling/air conditioning systems.

Non-Responsive Switch

Scenario

Non-Responsive Switch

Description

switch is not responding due to software/hardware issues

Fabric State

Any

How to Detect

Examine the UFM generated alerts for such cases

UFM Alerts

Event ID	Event Name
907	Switch is Down
909	Director Switch is Down
1312	Suspected switch Reboot

Action Required

Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated.

Additional Reading

Refer to Actions REST API .

Infinite Switch Reboots due to Switch HW Malfunction

Scenario

Infinite Switch Reboots due to Switch HW Malfunction

Description

While it is not highly probable, switch's software or hardware malfunction may lead to multiple switch reboots. Should this occur, the fabric might experience packet loss transmitted through that particular switch. Consequently, the SDN infrastructure could become busy managing these incidents and reconfiguring the network.

Fabric State

Any

How to Detect

UFM alerts about unhealthy switch

UFM Alerts

Event ID	Event Name
907	Switch is Down
909	Director Switch is Down
1312	Suspected switch Reboot

Action Required

Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated.

Additional Reading

Refer to Actions REST API .

UFM Server Failover

Scenario

UFM Server Failover

Description

UFM server failure causes UFM HA failover to standby host.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID	Event Name
602	UFM Server Failover

Action Required

Execute a UFM health report and validate the successful execution of UFM failover. Fix any hardware failures and restore the UFM High Availability (HA) cluster. Ensure the collection of system dumps open a support ticket through NVIDIA Networking Support.

SM Not Responding

Scenario

SM Not Responding

Description

The OpenSM process hangs.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID	Event Name
545	SM is not responding

Action Required

Automated SM restart must be handled automatically within UFM. Conduct a UFM health report to verify the UFM's ability to restart OpenSM. Remember to gather sysdump data and proceed to initiate a support ticket through NVIDIA Networking Support.

Additional Reading

Fabric Validation Tests REST API

UFM Management Interface is Down

Scenario

UFM Management Interface is Down

Description

UFM server InfiniBand management interface is down.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID

Event Name

546

Management interface is down

Action Required

Verify the connectivity of the management interface. Run the "ibstat" command to confirm that the State: Active and Physical state: LinkUp. If the issue persists, retrieve sysdump information and proceed to create a support ticket through NVIDIA Networking Support.

UFM Server Disk Utilization

Scenario

Threshold for UFM Server Disk Utilization Exceeded

Description

UFM server disk utilization threshold is reached and UFM is not able to free disk space.

Fabric State

Any

How to Detect

Monitor UFM events

UFM Alerts

Event ID	Event Name
525	Disk utilization threshold reached

Action Required

Clean UFM server disk from third party data. If the problem persists, collect sysdump and open support ticket in NVIDIA Networking Support.

Duplicated GUIDs

Scenario

Duplicated GUIDs

Description

Cards or switches can be accidentally provisioned with duplicated GUIDs. This is a rare case.

Fabric State

Bring-up/maintenance

How to Detect

Subnet Manager detects and reports duplicated GUIDs via UFM

UFM Alerts

Event ID	Event Name
1310	Duplicated node GUID was detected
1311	Duplicated port GUID was detected

Action Required

Collect sysdump and open support ticket in NVIDIA Networking Support.
Turn off switch/host.

Additional Reading

Actions REST API

On This Page