Event Management
The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.
In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.
The following table presents the supported events with their description.
Resource | Event Description | Severity |
System | System fatal state detected | CRITICAL |
System | System is not ready—one or more services are not up | CRITICAL |
System | System is not ready—one or more services failed | MAJOR |
System | Restart all syncd-ibv0 dockers | MAJOR |
System | Performing reboot | MAJOR |
System | Health status is not ok | WARNING |
System | System is ready | INFORMATIONAL |
System | System recovered from fatal state | INFORMATIONAL |
System | Health status is ok | INFORMATIONAL |
Sensor or service name | <Repeats a message from the system health> | WARNING |
Sensor or service name | Hardware component goes back to normal / Service goes back to normal | INFORMATIONAL |
Interface name | Interface admin state is {Up/Down} | INFORMATIONAL |
Interface name | Interface operational state is {Up/Down} | INFORMATIONAL |
Interface name | Fast-recovery error event for trigger {trigger_name} was received | INFORMATIONAL |
System | System reboot occurred
reason: performed by user: reboot time: For a list of reboot causes, see Possible Reboot Causes. | INFORMATIONAL |
Receiving Clearing Events
For most WARNING/CRITICAL events, the system will send a "Clearing" event once the issue is resolved. If the system experiences two or more issues for a component, it will send an event about the last issue, and once all the issues are resolved, it will send a "Clearing" event for the last issue only. Once you receive a "Clearing" event, all issues for the component are resolved.
Example:
PSU2 is out of power
PSU2 is missing or not available
Cleared: PSU2 is missing or not available
Resending Unresolved Issues
If one of the initial issues for the component still exists after the last one was resolved, the system will resend the issue that still exists.
Example:
PSU2 is out of power
PSU2 is missing or not available
PSU2 is out of power
Backward Compatibility
Backward compatibility is preserved, and in the case of clearing the issue for the component, the system will also send generic events. Please consider avoiding generic messages, as they will be dropped in future releases.
Example:
Cleared: PSU2/FAN speed is out of range
Hardware component goes back to normal
System Reboot
After a reboot, the system does not clear any pre-boot events and assumes everything is cleared.
Event Category | Event Type ID ("event" in gNMI) | Severity | Resource ("component" in gNMI) | Text | Failure Reason | Suggested Repair Flow |
Fan-Related Events | ||||||
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 speed is out of range, speed=40%, range=[50,100]" | Fan speed out of range |
|
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 is not working" | Fan status is not okay (status in the hardware) | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get actual speed data for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get target speed data for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get speed tolerance for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get speed status for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 is missing" | Fan is missing | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 direction is not aligned with exhaust direction intake" | Fan direction is not aligned with other fans | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100" | Invalid speed | |
Fan health | HEALTH_OK | INFORMATIONAL | FAN1/1 | “HW component goes back to normal” | Fan is back to normal state | N/A |
ASIC-Related Events | ||||||
ASIC failure | HEALTH_NOT_OK | WARNING | ASIC-HEALTH | Switch ASIC in fatal state | ASIC in fatal |
|
ASIC failure | SYSTEM_FATAL_DETECTED | CRITICAL | System | System fatal state detected | Detect ASIC in fatal | |
ASIC failure | HEALTH_NOT_OK | WARNING | ASIC1 | ASIC1 temperature is too hot, temperature=120, threshold=105 | ASIC temp too high |
|
ASIC failure | SYSTEM_FATAL_REMEDY | MAJOR | System | Restart all syncd-ibv0 dockers | ASIC in fatal preforming reboot of dockers | N/A |
ASIC failure | SYSTEM_FATAL_REMEDY | MAJOR | System | Performing reboot | ASIC in fatal preforming reboot | |
ASIC health | HEALTH_OK | INFORMATIONAL | ASIC1 | “HW component goes back to normal” | ASIC1 is back to normal state | |
ASIC health | SYSTEM_FATAL_RECOVERED | INFORMATIONAL | System | System recovered from fatal state | Recoverd from fatal | |
Leakage-Related Events | ||||||
Leakage | LEAKAGE | CRITICAL | LEAKAGE-1 | Leakage detected, inspect for water leakage and consider power down switch tray | Detected leakage |
Note
Relevant only for liquid-cooled-based systems. |
Leakage | HEALTH_NOT_OK | WARNING | LEAKAGE-1 | LEAKAGE-1 detected leakage | Detected leakage | |
Voltage-Related Events | ||||||
Voltage | HEALTH_NOT_OK | WARNING | <Voltage-sensor-name> | Sensor voltage is out of range, voltage={}, range=[{},{}] | Voltage sensor not in range |
|
Voltage | HEALTH_NOT_OK | WARNING | <Voltage-sensor-name> | Sensor status is failed | Voltage sensor status in hardware is failed | |
Voltage | HEALTH_OK | INFORMATIONAL | <Voltage-sensor-name> | “HW component goes back to normal” | Voltage sensor value is back to normal state | N/A |
Temperature-Related Events | ||||||
Temperature | HEALTH_NOT_OK | WARNING | <Temp-sensor-name> | <Temp-sensor-name> temperature is too hot, temperature={}, threshold={} | Temperature too hot |
|
Temperature | HEALTH_NOT_OK | WARNING | <Temp-sensor-name> | Sensor status is failed | Sensor status in hardware is failed | |
Temperature | HEALTH_OK | INFORMATIONAL | <Temp-sensor-name> | “HW component goes back to normal” | Temperature sensor value is back to normal state | N/A |
System-Services-Related Events | ||||||
services | HEALTH_NOT_OK | WARNING | <container-name> | Container '<container-name>' is not running | Container is not running | Collect tech-support and submit NVIDIA support ticket. |
services | HEALTH_OK | INFORMATIONAL | <container-name> | “Service goes back to normal” | Service goes back to normal state | N/A |
System-Initialization-Related Events | ||||||
Init flow | SYSTEM_STATE_DOWN | CRITICAL | System | System is not ready—one or more services are not up | Some services are not up as part of init |
|
Init flow | SYSTEM_STATE_FAILED | MAJOR | System | System is not ready—one or more services failed | Some services failed as part of init | |
Init flow | SYSTEM_STATE_UP | INFORMATIONAL | System | System is ready | System finished initialization and is ready | N/A |
Interface-Related Informational Events | ||||||
interface | INTERFACE_ADMIN_STATUS | INFORMATIONAL | <interface_name> | "Interface admin state is {admin_state}" | Informs of admin state change of interface | N/A |
interface | INTERFACE_OPER_STATUS | INFORMATIONAL | <interface_name> | "Interface operational state is {up or down}" | Informs of operational state change of interface | N/A |
interface | INTERFACE_LOGICAL_STATE | INFORMATIONAL | <interface_name> | "Interface logical state is {logical_state}" | Informs of logical state change of interface | N/A |
System-Health-Related Events (The below events are summary and accompany the specific errors that were detailed above) | ||||||
system | HEALTH_SUMMARY_NOT_OK_CRITICAL | CRITICAL | System | Health status is not ok | Have some health not okay event—critical (e.g. leakage) | Collect tech-support and submit NVIDIA support ticket. |
system | HEALTH_SUMMARY_NOT_OK | WARNING | System | Health status is not ok | Have some health not okay event—warning | |
System | HEALTH_SUMMARY_OK | INFORMATIONAL | System | Health status is ok | System health with no issue | N/A |
Transceiver-Related Events | ||||||
Transceiver failure | HEALTH_NOT_OK | WARNING | sw1 | "Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]" | Temperature is critically high | N/A |
Transceiver failure | HEALTH_NOT_OK | WARNING | sw1 | "Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]" | Temperature is critically low | N/A |
Transceiver health | HEALTH_OK | INFORMATIONAL | sw1 | "HW component goes back to normal" | Transceiver's temperature is good now | N/A |
Event Category | Event Type ID ("event" in gNMI) | Severity | Resource ("component" in gNMI) | Text | Failure Reason | Suggested Repair Flow |
Power-Supply-Unit-Related Events | ||||||
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> is missing—Unpopulated PSU slot | PSU expected to be in the system, but PSU slot is empty | Insert PSU/dummy PSU to the unpopulated PSU slot |
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> is out of power | PSU is out of power | |
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> temperature is too hot, temperature={}, threshold={} | Temperature too hot | |
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> voltage is out of range, voltage={}, range=[{},{}] | Voltage is out of range | |
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> System power exceeds threshold ({}w) | Power exceeds threshold | |
PSU | HEALTH_NOT_OK | WARNING | <PSU-name> | <PSU-name> Power supply is not providing power | No power from PSU | |
PSU | HEALTH_OK | INFORMATIONAL | <PSU-name> | HW component goes back to normal | Health issue was resolved | N/A |
Event Category | Event Type ID ("event" in gNMI) | Severity | Resource ("component" in gNMI) | Text | Failure Reason | Suggested Repair Flow |
PS-Redundancy-Related Events | ||||||
Power redundancy policy | HEALTH_NOT_OK | WARNING | N/A | Power redundancy policy no-redundancy requires at least <minimal-number-of-PSUs-per-system> working power supplies, currently system has only <number-of-working-PSUs> | Insufficient number of PSUs relative to the current no-redundancy policy | Insert at least the minimal amount of PSUs required. Minimal PSU amount can be found using the following command: nv show platform ps-redundancy |
Power redundancy policy | HEALTH_NOT_OK | WARNING | N/A | Power redundancy policy ps-redundant requires at least <minimal-number-of-PSUs-per-system + 1> working power supplies, currently system has only <number-of-working-PSUs> | Insufficient number of PSUs relative to the current ps-redundant policy | Insert at least one more PSU than the minimal amount of PSUs required. Minimal PSU amount can be found using the following command: nv show platform ps-redundancy |
Power redundancy policy | HEALTH_NOT_OK | WARNING | N/A | Power redundancy policy grid-redundant requires all <number-of-PSUs-in-the-system> power supplies to be working, currently system has only <number-of-working-PSUs> | Insufficient number of PSUs relative to the current grid-redundant policy | Insert PSUs to all PSU slots in the system. |
Power redundancy policy | HEALTH_OK | INFORMATIONAL | N/A | HW component goes back to normal | Health issue was resolved | N/A |