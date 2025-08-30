Event Management
The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.
In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.
The following table presents the supported events with their description.
Resource
Event Description
Severity
System
System fatal state detected
CRITICAL
System
System is not ready—one or more services are not up
CRITICAL
System
System is not ready—one or more services failed
MAJOR
System
Restart all syncd-ibv0 dockers
MAJOR
System
Performing reboot
MAJOR
System
Health status is not ok
WARNING
System
System is ready
INFORMATIONAL
System
System recovered from fatal state
INFORMATIONAL
System
Health status is ok
INFORMATIONAL
Sensor or service name
<Repeats a message from the system health>
WARNING
Sensor or service name
Hardware component goes back to normal /
Service goes back to normal
INFORMATIONAL
Interface name
Interface admin state is {Up/Down}
INFORMATIONAL
Interface name
Interface operational state is {Up/Down}
INFORMATIONAL
Interface name
Fast-recovery error event for trigger {trigger_name} was received
INFORMATIONAL
Event Category
Event Type ID ("event" in gNMI)
Severity
Resource ("component" in gNMI)
Text
Failure Reason
Suggested Repair Flow
Fan-Related Events
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 speed is out of range, speed=40%, range=[50,100]"
Fan speed out of range
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 is not working"
Fan status is not okay (status in the hardware)
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get actual speed data for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get target speed data for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get speed tolerance for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get speed status for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 is missing"
Fan is missing
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 direction is not aligned with exhaust direction intake"
Fan direction is not aligned with other fans
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100"
Invalid speed
Fan health
HEALTH_OK
INFORMATIONAL
FAN1/1
“HW component goes back to normal”
Fan is back to normal state
N/A
ASIC-Related Events
ASIC failure
HEALTH_NOT_OK
WARNING
ASIC-HEALTH
Switch ASIC in fatal state
ASIC in fatal
|
ASIC failure
SYSTEM_FATAL_DETECTED
CRITICAL
System
System fatal state detected
Detect ASIC in fatal
ASIC failure
HEALTH_NOT_OK
WARNING
ASIC1
ASIC1 temperature is too hot, temperature=120, threshold=105
ASIC temp too high
ASIC failure
SYSTEM_FATAL_REMEDY
MAJOR
System
Restart all syncd-ibv0 dockers
ASIC in fatal preforming reboot of dockers
N/A
ASIC failure
SYSTEM_FATAL_REMEDY
MAJOR
System
Performing reboot
ASIC in fatal preforming reboot
ASIC health
HEALTH_OK
INFORMATIONAL
ASIC1
“HW component goes back to normal”
ASIC1 is back to normal state
ASIC health
SYSTEM_FATAL_RECOVERED
INFORMATIONAL
System
System recovered from fatal state
Recoverd from fatal
Leakage-Related Events
Leakage
LEAKAGE
CRITICAL
LEAKAGE-1
Leakage detected, inspect for water leakage and consider power down switch tray
Detected leakage
Note
Relevant only for liquid-cooled-based systems.
Leakage
HEALTH_NOT_OK
WARNING
LEAKAGE-1
LEAKAGE-1 detected leakage
Detected leakage
Voltage-Related Events
Voltage
HEALTH_NOT_OK
WARNING
<Voltage-sensor-name>
Sensor voltage is out of range, voltage={}, range=[{},{}]
Voltage sensor not in range
Voltage
HEALTH_NOT_OK
WARNING
<Voltage-sensor-name>
Sensor status is failed
Voltage sensor status in hardware is failed
Voltage
HEALTH_OK
INFORMATIONAL
<Voltage-sensor-name>
“HW component goes back to normal”
Voltage sensor value is back to normal state
N/A
Temperature-Related Events
Temperature
HEALTH_NOT_OK
WARNING
<Temp-sensor-name>
<Temp-sensor-name> temperature is too hot, temperature={}, threshold={}
Temperature too hot
Temperature
HEALTH_NOT_OK
WARNING
<Temp-sensor-name>
Sensor status is failed
Sensor status in hardware is failed
Temperature
HEALTH_OK
INFORMATIONAL
<Temp-sensor-name>
“HW component goes back to normal”
Temperature sensor value is back to normal state
N/A
System-Services-Related Events
services
HEALTH_NOT_OK
WARNING
<container-name>
Container '<container-name>' is not running
Container is not running
Collect tech-support and submit NVIDIA support ticket.
services
HEALTH_OK
INFORMATIONAL
<container-name>
“Service goes back to normal”
Service goes back to normal state
N/A
System-Initialization-Related Events
Init flow
SYSTEM_STATE_DOWN
CRITICAL
System
System is not ready—one or more services are not up
Some services are not up as part of init
Init flow
SYSTEM_STATE_FAILED
MAJOR
System
System is not ready—one or more services failed
Some services failed as part of init
Init flow
SYSTEM_STATE_UP
INFORMATIONAL
System
System is ready
System finished initialization and is ready
N/A
Interface-Related Informational Events
interface
INTERFACE_ADMIN_STATUS
INFORMATIONAL
<interface_name>
"Interface admin state is {admin_state}"
Informs of admin state change of interface
N/A
interface
INTERFACE_OPER_STATUS
INFORMATIONAL
<interface_name>
"Interface operational state is {up or down}"
Informs of operational state change of interface
N/A
interface
INTERFACE_LOGICAL_STATE
INFORMATIONAL
<interface_name>
"Interface logical state is {logical_state}"
Informs of logical state change of interface
N/A
System-Health-Related Events
(The below events are summary and accompany the specific errors that were detailed above)
system
HEALTH_SUMMARY_NOT_OK_CRITICAL
CRITICAL
System
Health status is not ok
Have some health not okay event—critical (e.g. leakage)
Collect tech-support and submit NVIDIA support ticket.
system
HEALTH_SUMMARY_NOT_OK
WARNING
System
Health status is not ok
Have some health not okay event—warning
System
HEALTH_SUMMARY_OK
INFORMATIONAL
System
Health status is ok
System health with no issue
N/A
Transceiver-Related Events
Transceiver failure
HEALTH_NOT_OK
WARNING
sw1
"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"
Temperature is critically high
N/A
Transceiver failure
HEALTH_NOT_OK
WARNING
sw1
"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"
Temperature is critically low
N/A
Transceiver health
HEALTH_OK
INFORMATIONAL
sw1
"HW component goes back to normal"
Transceiver's temperature is good now
N/A