List of Scenarios
This section provides a set of UFM API queries used to gather information about events identified by UFM. The events/scenarios are organized by their likelihood. For each event, comprehensive details are supplied, including a description, relevant fabric state, methods to retrieve the specific UFM event, corresponding event IDs, and suggested courses of action for remediation.
Scenario |
Bad Link/Port |
||||||
Description |
A port can drop too many packets |
||||||
Fabric State |
Any fabric state |
||||||
How to Detect |
Via UFM health reports, UFM events\alerts |
||||||
UFM Alerts |
|
||||||
Action Required |
Review the firmware version of both the device and cable to ensure they are up to date. Fixing a bad link involves a sequence of actions to be followed on both ends of the link. Each of these actions should be executed in sequence, while verifying if the issue has been resolved. These actions are:
|
||||||
Additional Reading |
Refer to Cable Validation Tool and Reports REST API { "cables": true, "symbol_ber_check:": true, "effective_ber_check:": true } |
Scenario |
Wrong Connection |
||||
Description |
A cable that is not connected according to the topology can act as a bottleneck. |
||||
Fabric State |
Bring-up/maintenance |
||||
How to Detect |
Set master topology scheduled topodiff to run once every 3 hours using Topology Compare REST API . Turn on UFM topology compare capability. |
||||
UFM Alerts |
|
||||
Action Required |
Fix Misconnection |
Scenario |
Host is Hanging |
||||
Description |
Most commonly due to software issue or overload |
||||
Fabric State |
Any |
||||
How to Detect |
SM alerts about unresponsive host. |
||||
UFM Alerts |
|
||||
Action Required |
Fix stuck host |
Scenario |
Low Bandwidth |
||||||
Description |
In the event that certain network cables remain disconnected and unfixed, the network may lose bandwidth. While application performance decline can occur due to various factors, the most common factor is losing network links. |
||||||
Fabric State |
Bring-up/maintenance |
||||||
UFM API Query |
GET /ufmRest/app/events |
||||||
How to Detect |
Customer complaints. Examine UFM Congestion dashboard. |
||||||
UFM Alerts |
|
||||||
Action Required |
Assess the network's bandwidth while considering the impact of the disconnected links, as this could explain the decrease in bandwidth. If the calculations align, it is crucial to replace the affected cables. In cases where the figures do not match, a deeper analysis may be necessary, including an analysis of the application's traffic pattern and the utilized transport methods. |
Scenario |
SHARP AM Issue |
||||||||
Description |
SHARP job failure |
||||||||
Fabric State |
Any |
||||||||
UFM API Query |
GET /ufmRest/app/events |
||||||||
How to Detect |
Review UFM events |
||||||||
UFM Alerts |
|
||||||||
Action Required |
|
Scenario |
Inadequate Control of Cluster Temperature |
||||||||
Description |
The cluster's temperature could escalate rapidly or surpass a designated threshold, potentially risking equipment damage. |
||||||||
Fabric State |
Any |
||||||||
How to Detect |
UFM temperature alert. Via REST API: Request URL: POST /ufmRest/fabricValidation/tests/CheckTemperature For more information, refer to Fabric Validation Tests REST API . |
||||||||
UFM Alerts |
|
||||||||
Action Required |
Fix the fans/cooling/air conditioning systems. |
Scenario |
Non-Responsive Switch |
||||||||
Description |
switch is not responding due to software/hardware issues |
||||||||
Fabric State |
Any |
||||||||
How to Detect |
Examine the UFM generated alerts for such cases |
||||||||
UFM Alerts |
|
||||||||
Action Required |
Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated. |
||||||||
Additional Reading |
Refer to Actions REST API . |
Scenario |
Infinite Switch Reboots due to Switch HW Malfunction |
||||||||
Description |
While it is not highly probable, switch's software or hardware malfunction may lead to multiple switch reboots. Should this occur, the fabric might experience packet loss transmitted through that particular switch. Consequently, the SDN infrastructure could become busy managing these incidents and reconfiguring the network. |
||||||||
Fabric State |
Any |
||||||||
How to Detect |
UFM alerts about unhealthy switch |
||||||||
UFM Alerts |
|
||||||||
Action Required |
Isolate and reboot the switch, then de-isolate it. If problem persists, keep it isolated. |
||||||||
Additional Reading |
Refer to Actions REST API . |
Scenario |
UFM Server Failover |
||||
Description |
UFM server failure causes UFM HA failover to standby host. |
||||
Fabric State |
Any |
||||
How to Detect |
Monitor UFM events |
||||
UFM Alerts |
|
||||
Action Required |
Execute a UFM health report and validate the successful execution of UFM failover. Fix any hardware failures and restore the UFM High Availability (HA) cluster. Ensure the collection of system dumps open a support ticket through NVIDIA Networking Support. |
Scenario |
SM Not Responding |
||||
Description |
The OpenSM process hangs. |
||||
Fabric State |
Any |
||||
How to Detect |
Monitor UFM events |
||||
UFM Alerts |
|
||||
Action Required |
Automated SM restart must be handled automatically within UFM. Conduct a UFM health report to verify the UFM's ability to restart OpenSM. Remember to gather sysdump data and proceed to initiate a support ticket through NVIDIA Networking Support. |
||||
Additional Reading |
Scenario |
UFM Management Interface is Down |
||||||
Description |
UFM server InfiniBand management interface is down. |
||||||
Fabric State |
Any |
||||||
How to Detect |
Monitor UFM events |
||||||
UFM Alerts |
|
||||||
Action Required |
Verify the connectivity of the management interface. Run the "ibstat" command to confirm that the State: Active and Physical state: LinkUp. If the issue persists, retrieve sysdump information and proceed to create a support ticket through NVIDIA Networking Support. |
Scenario |
Threshold for UFM Server Disk Utilization Exceeded |
||||
Description |
UFM server disk utilization threshold is reached and UFM is not able to free disk space. |
||||
Fabric State |
Any |
||||
How to Detect |
Monitor UFM events |
||||
UFM Alerts |
|
||||
Action Required |
Clean UFM server disk from third party data. If the problem persists, collect sysdump and open support ticket in NVIDIA Networking Support. |
Scenario |
Duplicated GUIDs |
||||||
Description |
Cards or switches can be accidentally provisioned with duplicated GUIDs. This is a rare case. |
||||||
Fabric State |
Bring-up/maintenance |
||||||
How to Detect |
Subnet Manager detects and reports duplicated GUIDs via UFM |
||||||
UFM Alerts |
|
||||||
Action Required |
|
||||||
Additional Reading |