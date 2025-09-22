Fan-Related Events

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "FAN1/1 speed is out of range, speed=40%, range=[50,100]" Fan speed out of range Collect tech-support and submit NVIDIA support ticket.

Consider number of faulty fans: more than one fan requires immediate maintenance.

Power-cycle the switch.

If persists, submit NVIDIA support ticket to replace fan module.

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "FAN1/1 is not working" Fan status is not okay (status in the hardware)

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "Failed to get actual speed data for FAN1/1" Failed to get some information of fan data from hardware

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "Failed to get target speed data for FAN1/1" Failed to get some information of fan data from hardware

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "Failed to get speed tolerance for FAN1/1" Failed to get some information of fan data from hardware

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "Failed to get speed status for FAN1/1" Failed to get some information of fan data from hardware

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "FAN1/1 is missing" Fan is missing

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "FAN1/1 direction is not aligned with exhaust direction intake" Fan direction is not aligned with other fans

Fan failure HEALTH_NOT_OK WARNING FAN1/1 "Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100" Invalid speed

Fan health HEALTH_OK INFORMATIONAL FAN1/1 “HW component goes back to normal” Fan is back to normal state N/A

ASIC-Related Events

ASIC failure HEALTH_NOT_OK WARNING ASIC-HEALTH Switch ASIC in fatal state ASIC in fatal Check correct software and firmware bundle recipe of switch and compute trays. Collect tech-support and submit NVIDIA support ticket.

Reboot the system.

If persists, power-cycle system.

ASIC failure SYSTEM_FATAL_DETECTED CRITICAL System System fatal state detected Detect ASIC in fatal

ASIC failure HEALTH_NOT_OK WARNING ASIC1 ASIC1 temperature is too hot, temperature=120, threshold=105 ASIC temp too high Collect tech-support and submit NVIDIA support ticket.

Continue to monitor switch temperature.

ASIC failure SYSTEM_FATAL_REMEDY MAJOR System Restart all syncd-ibv0 dockers ASIC in fatal preforming reboot of dockers N/A

ASIC failure SYSTEM_FATAL_REMEDY MAJOR System Performing reboot ASIC in fatal preforming reboot

ASIC health HEALTH_OK INFORMATIONAL ASIC1 “HW component goes back to normal” ASIC1 is back to normal state

ASIC health SYSTEM_FATAL_RECOVERED INFORMATIONAL System System recovered from fatal state Recoverd from fatal

Leakage-Related Events

Leakage LEAKAGE CRITICAL LEAKAGE-1 Leakage detected, inspect for water leakage and consider power down switch tray Detected leakage Collect tech-support and submit NVIDIA support ticket.

For additional instructions refer to NVONLINE 1115991 chapter "NVIDIA MGX Leak Detection Strategy and Remediation" NOTE: Relevant only for liquid-cooled-based systems.

Leakage HEALTH_NOT_OK WARNING LEAKAGE-1 LEAKAGE-1 detected leakage Detected leakage

Voltage-Related Events

Voltage HEALTH_NOT_OK WARNING <Voltage-sensor-name> Sensor voltage is out of range, voltage={}, range=[{},{}] Voltage sensor not in range Collect tech-support and submit NVIDIA support ticket.

Power cycle the switch. If persists, check busbar power supply if the sensor is one of the following: HSC-VinDC-In, PDB-1-Conv-In-1, PDB-2-Conv-In-1, PDB-3-Conv-In-1, PDB-4-Conv-In-1. If persists, consider replacing the system.

Voltage HEALTH_NOT_OK WARNING <Voltage-sensor-name> Sensor status is failed Voltage sensor status in hardware is failed

Voltage HEALTH_OK INFORMATIONAL <Voltage-sensor-name> “HW component goes back to normal” Voltage sensor value is back to normal state N/A

Temperature-Related Events

Temperature HEALTH_NOT_OK WARNING <Temp-sensor-name> <Temp-sensor-name> temperature is too hot, temperature={}, threshold={} Temperature too hot Collect tech-support and submit NVIDIA support ticket.

Power cycle the switch.

If persists, see if the sensor is Ambient-MNG-Temp. If it is, check the environmental conditions (CDU and DC temperature).

If persists, consider replacing the system.

Temperature HEALTH_NOT_OK WARNING <Temp-sensor-name> Sensor status is failed Sensor status in hardware is failed

Temperature HEALTH_OK INFORMATIONAL <Temp-sensor-name> “HW component goes back to normal” Temperature sensor value is back to normal state N/A

System-Services-Related Events

services HEALTH_NOT_OK WARNING <container-name> Container '<container-name>' is not running Container is not running Collect tech-support and submit NVIDIA support ticket.

services HEALTH_OK INFORMATIONAL <container-name> “Service goes back to normal” Service goes back to normal state N/A

System-Initialization-Related Events

Init flow SYSTEM_STATE_DOWN CRITICAL System System is not ready—one or more services are not up Some services are not up as part of init Collect tech-support and submit NVIDIA support ticket.

If sensor is Ambient-MNG-Temp:

Check environmental conditions (CDU (if exists) and DC temperature).

If persists, power-cycle the switch.

If still persists, replace the switch.

Init flow SYSTEM_STATE_FAILED MAJOR System System is not ready—one or more services failed Some services failed as part of init

Init flow SYSTEM_STATE_UP INFORMATIONAL System System is ready System finished initialization and is ready N/A

Interface-Related Informational Events

interface INTERFACE_ADMIN_STATUS INFORMATIONAL <interface_name> "Interface admin state is {admin_state}" Informs of admin state change of interface N/A

interface INTERFACE_OPER_STATUS INFORMATIONAL <interface_name> "Interface operational state is {up or down}" Informs of operational state change of interface N/A

interface INTERFACE_LOGICAL_STATE INFORMATIONAL <interface_name> "Interface logical state is {logical_state}" Informs of logical state change of interface N/A

System-Health-Related Events (The below events are summary and accompany the specific errors that were detailed above)

system HEALTH_SUMMARY_NOT_OK_CRITICAL CRITICAL System Health status is not ok Have some health not okay event—critical (e.g. leakage) Collect tech-support and submit NVIDIA support ticket.

system HEALTH_SUMMARY_NOT_OK WARNING System Health status is not ok Have some health not okay event—warning

System HEALTH_SUMMARY_OK INFORMATIONAL System Health status is ok System health with no issue N/A

Transceiver-Related Events

Transceiver failure HEALTH_NOT_OK WARNING sw1 "Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]" Temperature is critically high N/A

Transceiver failure HEALTH_NOT_OK WARNING sw1 "Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]" Temperature is critically low N/A