Event Management
The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.
In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.
The following table presents the supported events with their description.
Resource | Event Description | Severity |
System | System fatal state detected | CRITICAL |
System | System is not ready—one or more services are not up | CRITICAL |
System | System is not ready—one or more services failed | MAJOR |
System | Restart all syncd-ibv0 dockers | MAJOR |
System | Performing reboot | MAJOR |
System | Health status is not ok | WARNING |
System | System is ready | INFORMATIONAL |
System | System recovered from fatal state | INFORMATIONAL |
System | Health status is ok | INFORMATIONAL |
Sensor or service name | <Repeats a message from the system health> | WARNING |
Sensor or service name | Hardware component goes back to normal / Service goes back to normal | INFORMATIONAL |
Interface name | Interface admin state is {Up/Down} | INFORMATIONAL |
Interface name | Interface operational state is {Up/Down} | INFORMATIONAL |
Interface name | Fast-recovery error event for trigger {trigger_name} was received | INFORMATIONAL |
Event Category | Event Type ID ("event" in gNMI) | Severity | Resource ("component" in gNMI) | Text | Failure Reason | Suggested Repair Flow |
Fan-Related Events | ||||||
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 speed is out of range, speed=40%, range=[50,100]" | Fan speed out of range |
|
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 is not working" | Fan status is not okay (status in the hardware) | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get actual speed data for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get target speed data for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get speed tolerance for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Failed to get speed status for FAN1/1" | Failed to get some information of fan data from hardware | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 is missing" | Fan is missing | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "FAN1/1 direction is not aligned with exhaust direction intake" | Fan direction is not aligned with other fans | |
Fan failure | HEALTH_NOT_OK | WARNING | FAN1/1 | "Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100" | Invalid speed | |
Fan health | HEALTH_OK | INFORMATIONAL | FAN1/1 | “HW component goes back to normal” | Fan is back to normal state | N/A |
ASIC-Related Events | ||||||
ASIC failure | HEALTH_NOT_OK | WARNING | ASIC-HEALTH | Switch ASIC in fatal state | ASIC in fatal |
|
ASIC failure | SYSTEM_FATAL_DETECTED | CRITICAL | System | System fatal state detected | Detect ASIC in fatal | |
ASIC failure | HEALTH_NOT_OK | WARNING | ASIC1 | ASIC1 temperature is too hot, temperature=120, threshold=105 | ASIC temp too high |
|
ASIC failure | SYSTEM_FATAL_REMEDY | MAJOR | System | Restart all syncd-ibv0 dockers | ASIC in fatal preforming reboot of dockers | N/A |
ASIC failure | SYSTEM_FATAL_REMEDY | MAJOR | System | Performing reboot | ASIC in fatal preforming reboot | |
ASIC health | HEALTH_OK | INFORMATIONAL | ASIC1 | “HW component goes back to normal” | ASIC1 is back to normal state | |
ASIC health | SYSTEM_FATAL_RECOVERED | INFORMATIONAL | System | System recovered from fatal state | Recoverd from fatal | |
Leakage-Related Events | ||||||
Leakage | LEAKAGE | CRITICAL | LEAKAGE-1 | Leakage detected, inspect for water leakage and consider power down switch tray | Detected leakage |
Note
Relevant only for liquid-cooled-based systems. |
Leakage | HEALTH_NOT_OK | WARNING | LEAKAGE-1 | LEAKAGE-1 detected leakage | Detected leakage | |
Voltage-Related Events | ||||||
Voltage | HEALTH_NOT_OK | WARNING | <Voltage-sensor-name> | Sensor voltage is out of range, voltage={}, range=[{},{}] | Voltage sensor not in range |
|
Voltage | HEALTH_NOT_OK | WARNING | <Voltage-sensor-name> | Sensor status is failed | Voltage sensor status in hardware is failed | |
Voltage | HEALTH_OK | INFORMATIONAL | <Voltage-sensor-name> | “HW component goes back to normal” | Voltage sensor value is back to normal state | N/A |
Temperature-Related Events | ||||||
Temperature | HEALTH_NOT_OK | WARNING | <Temp-sensor-name> | <Temp-sensor-name> temperature is too hot, temperature={}, threshold={} | Temperature too hot |
|
Temperature | HEALTH_NOT_OK | WARNING | <Temp-sensor-name> | Sensor status is failed | Sensor status in hardware is failed | |
Temperature | HEALTH_OK | INFORMATIONAL | <Temp-sensor-name> | “HW component goes back to normal” | Temperature sensor value is back to normal state | N/A |
System-Services-Related Events | ||||||
services | HEALTH_NOT_OK | WARNING | <container-name> | Container '<container-name>' is not running | Container is not running | Collect tech-support and submit NVIDIA support ticket. |
services | HEALTH_OK | INFORMATIONAL | <container-name> | “Service goes back to normal” | Service goes back to normal state | N/A |
System-Initialization-Related Events | ||||||
Init flow | SYSTEM_STATE_DOWN | CRITICAL | System | System is not ready—one or more services are not up | Some services are not up as part of init |
|
Init flow | SYSTEM_STATE_FAILED | MAJOR | System | System is not ready—one or more services failed | Some services failed as part of init | |
Init flow | SYSTEM_STATE_UP | INFORMATIONAL | System | System is ready | System finished initialization and is ready | N/A |
Interface-Related Informational Events | ||||||
interface | INTERFACE_ADMIN_STATUS | INFORMATIONAL | <interface_name> | "Interface admin state is {admin_state}" | Informs of admin state change of interface | N/A |
interface | INTERFACE_OPER_STATUS | INFORMATIONAL | <interface_name> | "Interface operational state is {up or down}" | Informs of operational state change of interface | N/A |
interface | INTERFACE_LOGICAL_STATE | INFORMATIONAL | <interface_name> | "Interface logical state is {logical_state}" | Informs of logical state change of interface | N/A |
System-Health-Related Events (The below events are summary and accompany the specific errors that were detailed above) | ||||||
system | HEALTH_SUMMARY_NOT_OK_CRITICAL | CRITICAL | System | Health status is not ok | Have some health not okay event—critical (e.g. leakage) | Collect tech-support and submit NVIDIA support ticket. |
system | HEALTH_SUMMARY_NOT_OK | WARNING | System | Health status is not ok | Have some health not okay event—warning | |
System | HEALTH_SUMMARY_OK | INFORMATIONAL | System | Health status is ok | System health with no issue | N/A |
Transceiver-Related Events | ||||||
Transceiver failure | HEALTH_NOT_OK | WARNING | sw1 | "Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]" | Temperature is critically high | N/A |
Transceiver failure | HEALTH_NOT_OK | WARNING | sw1 | "Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]" | Temperature is critically low | N/A |
Transceiver health | HEALTH_OK | INFORMATIONAL | sw1 | "HW component goes back to normal" | Transceiver's temperature is good now | N/A |