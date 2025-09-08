On This Page
Event Management
The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.
In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.
The following table presents the supported events with their description.
Resource
Event Description
Severity
System
System fatal state detected
CRITICAL
System
System is not ready—one or more services are not up
CRITICAL
System
System is not ready—one or more services failed
MAJOR
System
Restart all syncd-ibv0 dockers
MAJOR
System
Performing reboot
MAJOR
System
Health status is not ok
WARNING
System
System is ready
INFORMATIONAL
System
System recovered from fatal state
INFORMATIONAL
System
Health status is ok
INFORMATIONAL
Sensor or service name
<Repeats a message from the system health>
WARNING
Sensor or service name
Hardware component goes back to normal /
Service goes back to normal
INFORMATIONAL
Interface name
Interface admin state is {Up/Down}
INFORMATIONAL
Interface name
Interface operational state is {Up/Down}
INFORMATIONAL
Interface name
Fast-recovery error event for trigger {trigger_name} was received
INFORMATIONAL
System
System reboot occurred
reason:
performed by user:
reboot time:
For a list of reboot causes, see Possible Reboot Causes.
INFORMATIONAL
ASIC name
PSC detected failure
MAJOR
Receiving Clearing Events
For most WARNING/CRITICAL events, the system will send a "Clearing" event once the issue is resolved. If the system experiences two or more issues for a component, it will send an event about the last issue, and once all the issues are resolved, it will send a "Clearing" event for the last issue only. Once you receive a "Clearing" event, all issues for the component are resolved.
Example:
PSU2 is out of power
PSU2 is missing or not available
Cleared: PSU2 is missing or not available
Resending Unresolved Issues
If one of the initial issues for the component still exists after the last one was resolved, the system will resend the issue that still exists.
Example:
PSU2 is out of power
PSU2 is missing or not available
PSU2 is out of power
Backward Compatibility
Backward compatibility is preserved, and in the case of clearing the issue for the component, the system will also send generic events. Please consider avoiding generic messages, as they will be dropped in future releases.
Example:
Cleared: PSU2/FAN speed is out of range
Hardware component goes back to normal
System Reboot
After a reboot, the system does not clear any pre-boot events and assumes everything is cleared.
Event Category
Event Type ID ("event" in gNMI)
Severity
Resource ("component" in gNMI)
Text
Failure Reason
Suggested Repair Flow
Fan-Related Events
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 speed is out of range, speed=40%, range=[50,100]"
Fan speed out of range
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 is not working"
Fan status is not okay (status in the hardware)
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get actual speed data for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get target speed data for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get speed tolerance for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Failed to get speed status for FAN1/1"
Failed to get some information of fan data from hardware
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 is missing"
Fan is missing
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"FAN1/1 direction is not aligned with exhaust direction intake"
Fan direction is not aligned with other fans
Fan failure
HEALTH_NOT_OK
WARNING
FAN1/1
"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100"
Invalid speed
Fan health
HEALTH_OK
INFORMATIONAL
FAN1/1
“HW component goes back to normal”
Fan is back to normal state
N/A
ASIC-Related Events
ASIC failure
HEALTH_NOT_OK
WARNING
ASIC-HEALTH
Switch ASIC in fatal state
ASIC in fatal
|
ASIC failure
SYSTEM_FATAL_DETECTED
CRITICAL
System
System fatal state detected
Detect ASIC in fatal
ASIC failure
HEALTH_NOT_OK
WARNING
ASIC1
ASIC1 temperature is too hot, temperature=120, threshold=105
ASIC temp too high
ASIC failure
SYSTEM_FATAL_REMEDY
MAJOR
System
Restart all syncd-ibv0 dockers
ASIC in fatal preforming reboot of dockers
N/A
ASIC failure
SYSTEM_FATAL_REMEDY
MAJOR
System
Performing reboot
ASIC in fatal preforming reboot
ASIC health
HEALTH_OK
INFORMATIONAL
ASIC1
“HW component goes back to normal”
ASIC1 is back to normal state
ASIC health
SYSTEM_FATAL_RECOVERED
INFORMATIONAL
System
System recovered from fatal state
Recoverd from fatal
ASIC security irregularity
ASIC_SECURITY_IRREGULARITY
MAJOR
ASIC1
"PSC detected failure"
The system has detected an irregularity in physical monitoring
N/A
Leakage-Related Events
Leakage
LEAKAGE
CRITICAL
LEAKAGE-1
Leakage detected, inspect for water leakage and consider power down switch tray
Detected leakage
NOTE: Relevant only for liquid-cooled-based systems.
Leakage
HEALTH_NOT_OK
WARNING
LEAKAGE-1
LEAKAGE-1 detected leakage
Detected leakage
Voltage-Related Events
Voltage
HEALTH_NOT_OK
WARNING
<Voltage-sensor-name>
Sensor voltage is out of range, voltage={}, range=[{},{}]
Voltage sensor not in range
Voltage
HEALTH_NOT_OK
WARNING
<Voltage-sensor-name>
Sensor status is failed
Voltage sensor status in hardware is failed
Voltage
HEALTH_OK
INFORMATIONAL
<Voltage-sensor-name>
“HW component goes back to normal”
Voltage sensor value is back to normal state
N/A
Temperature-Related Events
Temperature
HEALTH_NOT_OK
WARNING
<Temp-sensor-name>
<Temp-sensor-name> temperature is too hot, temperature={}, threshold={}
Temperature too hot
Temperature
HEALTH_NOT_OK
WARNING
<Temp-sensor-name>
Sensor status is failed
Sensor status in hardware is failed
Temperature
HEALTH_OK
INFORMATIONAL
<Temp-sensor-name>
“HW component goes back to normal”
Temperature sensor value is back to normal state
N/A
System-Services-Related Events
services
HEALTH_NOT_OK
WARNING
<container-name>
Container '<container-name>' is not running
Container is not running
Collect tech-support and submit NVIDIA support ticket.
services
HEALTH_OK
INFORMATIONAL
<container-name>
“Service goes back to normal”
Service goes back to normal state
N/A
services
CPU_USAGE
WARNING
CPU usage x% is above expected threshold y%
CPU usage is larger than the expected usage
services
MEMORY_USAGE
WARNING
Memory usage x% is above expected threshold y%
Memory usage is larger than the expected usage
services
CPU_USAGE
INFORMATIONAL
CPU usage is back to normal
CPU usage is back to normal state
N/A
services
MEMORY_USAGE
INFORMATIONAL
Memory usage is back to normal
Memory usage is back to normal state
N/A
System-Initialization-Related Events
Init flow
SYSTEM_STATE_DOWN
CRITICAL
System
System is not ready—one or more services are not up
Some services are not up as part of init
Init flow
SYSTEM_STATE_FAILED
MAJOR
System
System is not ready—one or more services failed
Some services failed as part of init
Init flow
SYSTEM_STATE_UP
INFORMATIONAL
System
System is ready
System finished initialization and is ready
N/A
Interface-Related Informational Events
interface
INTERFACE_ADMIN_STATUS
INFORMATIONAL
<interface_name>
"Interface admin state is {admin_state}"
Informs of admin state change of interface
N/A
interface
INTERFACE_OPER_STATUS
INFORMATIONAL
<interface_name>
"Interface operational state is {up or down}"
Informs of operational state change of interface
N/A
interface
INTERFACE_LOGICAL_STATE
INFORMATIONAL
<interface_name>
"Interface logical state is {logical_state}"
Informs of logical state change of interface
N/A
System-Health-Related Events
(The below events are summary and accompany the specific errors that were detailed above)
system
HEALTH_SUMMARY_NOT_OK_CRITICAL
CRITICAL
System
Health status is not ok
Have some health not okay event—critical (e.g. leakage)
Collect tech-support and submit NVIDIA support ticket.
system
HEALTH_SUMMARY_NOT_OK
WARNING
System
Health status is not ok
Have some health not okay event—warning
System
HEALTH_SUMMARY_OK
INFORMATIONAL
System
Health status is ok
System health with no issue
N/A
Transceiver-Related Events
Transceiver failure
HEALTH_NOT_OK
WARNING
sw1
"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"
Temperature is critically high
N/A
Transceiver failure
HEALTH_NOT_OK
WARNING
sw1
"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"
Temperature is critically low
N/A
Transceiver health
HEALTH_OK
INFORMATIONAL
sw1
"HW component goes back to normal"
Transceiver's temperature is good now
N/A
Event Category
Event Type ID ("event" in gNMI)
Severity
Resource ("component" in gNMI)
Text
Failure Reason
Suggested Repair Flow
Power-Supply-Unit-Related Events
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> is missing—Unpopulated PSU slot
PSU expected to be in the system, but PSU slot is empty
Insert PSU/dummy PSU to the unpopulated PSU slot
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> is out of power
PSU is out of power
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> temperature is too hot, temperature={}, threshold={}
Temperature too hot
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> voltage is out of range, voltage={}, range=[{},{}]
Voltage is out of range
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> System power exceeds threshold ({}w)
Power exceeds threshold
PSU
HEALTH_NOT_OK
WARNING
<PSU-name>
<PSU-name> Power supply is not providing power
No power from PSU
PSU
HEALTH_OK
INFORMATIONAL
<PSU-name>
HW component goes back to normal
Health issue was resolved
N/A
Event Category
Event Type ID ("event" in gNMI)
Severity
Resource ("component" in gNMI)
Text
Failure Reason
Suggested Repair Flow
PS-Redundancy-Related Events
Power redundancy policy
HEALTH_NOT_OK
WARNING
N/A
Power redundancy policy no-redundancy requires at least <minimal-number-of-PSUs-per-system> working power supplies, currently system has only <number-of-working-PSUs>
Insufficient number of PSUs relative to the current no-redundancy policy
Insert at least the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:
nv show platform ps-redundancy
Power redundancy policy
HEALTH_NOT_OK
WARNING
N/A
Power redundancy policy ps-redundant requires at least <minimal-number-of-PSUs-per-system + 1> working power supplies, currently system has only <number-of-working-PSUs>
Insufficient number of PSUs relative to the current ps-redundant policy
Insert at least one more PSU than the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:
nv show platform ps-redundancy
Power redundancy policy
HEALTH_NOT_OK
WARNING
N/A
Power redundancy policy grid-redundant requires all <number-of-PSUs-in-the-system> power supplies to be working, currently system has only <number-of-working-PSUs>
Insufficient number of PSUs relative to the current grid-redundant policy
Insert PSUs to all PSU slots in the system.
Power redundancy policy
HEALTH_OK
INFORMATIONAL
N/A
HW component goes back to normal
Health issue was resolved
N/A