Event Management - NVIDIA Docs

The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.

Note

In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.

Supported Events

The following table presents the supported events with their description.

Resource	Event Description	Severity
System	System fatal state detected	CRITICAL
System	System is not ready—one or more services are not up	CRITICAL
System	System is not ready—one or more services failed	MAJOR
System	Restart all syncd-ibv0 dockers	MAJOR
System	Performing reboot	MAJOR
System	Health status is not ok	WARNING
System	System is ready	INFORMATIONAL
System	System recovered from fatal state	INFORMATIONAL
System	Health status is ok	INFORMATIONAL
Sensor or service name	<Repeats a message from the system health>	WARNING
Sensor or service name	Hardware component goes back to normal / Service goes back to normal	INFORMATIONAL
Interface name	Interface admin state is {Up/Down}	INFORMATIONAL
Interface name	Interface operational state is {Up/Down}	INFORMATIONAL
Interface name	Fast-recovery error event for trigger {trigger_name} was received	INFORMATIONAL
System	System reboot occurred reason: `<reboot cause>` performed by user: `<user who performed reboot>` reboot time: `<time when reboot occurred>` For a list of reboot causes, see Possible Reboot Causes.	INFORMATIONAL

Clearing Events

Receiving Clearing Events

For most WARNING/CRITICAL events, the system will send a "Clearing" event once the issue is resolved. If the system experiences two or more issues for a component, it will send an event about the last issue, and once all the issues are resolved, it will send a "Clearing" event for the last issue only. Once you receive a "Clearing" event, all issues for the component are resolved.

Example:

PSU2 is out of power
PSU2 is missing or not available
Cleared: PSU2 is missing or not available

Resending Unresolved Issues

If one of the initial issues for the component still exists after the last one was resolved, the system will resend the issue that still exists.

Example:

PSU2 is out of power
PSU2 is missing or not available
PSU2 is out of power

Backward Compatibility

Backward compatibility is preserved, and in the case of clearing the issue for the component, the system will also send generic events. Please consider avoiding generic messages, as they will be dropped in future releases.

Example:

Cleared: PSU2/FAN speed is out of range
Hardware component goes back to normal

System Reboot

After a reboot, the system does not clear any pre-boot events and assumes everything is cleared.

Detailed Table of Events

Event Category	Event Type ID ("event" in gNMI)	Severity	Resource ("component" in gNMI)	Text	Failure Reason	Suggested Repair Flow
Fan-Related Events
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"FAN1/1 speed is out of range, speed=40%, range=[50,100]"	Fan speed out of range	Collect tech-support and submit NVIDIA support ticket. Consider number of faulty fans: more than one fan requires immediate maintenance. Power-cycle the switch. If persists, submit NVIDIA support ticket to replace fan module.
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"FAN1/1 is not working"	Fan status is not okay (status in the hardware)
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"Failed to get actual speed data for FAN1/1"	Failed to get some information of fan data from hardware
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"Failed to get target speed data for FAN1/1"	Failed to get some information of fan data from hardware
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"Failed to get speed tolerance for FAN1/1"	Failed to get some information of fan data from hardware
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"Failed to get speed status for FAN1/1"	Failed to get some information of fan data from hardware
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"FAN1/1 is missing"	Fan is missing
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"FAN1/1 direction is not aligned with exhaust direction intake"	Fan direction is not aligned with other fans
Fan failure	HEALTH_NOT_OK	WARNING	FAN1/1	"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100"	Invalid speed
Fan health	HEALTH_OK	INFORMATIONAL	FAN1/1	“HW component goes back to normal”	Fan is back to normal state	N/A
ASIC-Related Events
ASIC failure	HEALTH_NOT_OK	WARNING	ASIC-HEALTH	Switch ASIC in fatal state	ASIC in fatal	Collect tech-support and submit NVIDIA support ticket. Reboot the system. If persists, power-cycle system.
ASIC failure	SYSTEM_FATAL_DETECTED	CRITICAL	System	System fatal state detected	Detect ASIC in fatal
ASIC failure	HEALTH_NOT_OK	WARNING	ASIC1	ASIC1 temperature is too hot, temperature=120, threshold=105	ASIC temp too high	Collect tech-support and submit NVIDIA support ticket. Continue to monitor switch temperature.
ASIC failure	SYSTEM_FATAL_REMEDY	MAJOR	System	Restart all syncd-ibv0 dockers	ASIC in fatal preforming reboot of dockers	N/A
ASIC failure	SYSTEM_FATAL_REMEDY	MAJOR	System	Performing reboot	ASIC in fatal preforming reboot
ASIC health	HEALTH_OK	INFORMATIONAL	ASIC1	“HW component goes back to normal”	ASIC1 is back to normal state
ASIC health	SYSTEM_FATAL_RECOVERED	INFORMATIONAL	System	System recovered from fatal state	Recoverd from fatal
Leakage-Related Events
Leakage	LEAKAGE	CRITICAL	LEAKAGE-1	Leakage detected, inspect for water leakage and consider power down switch tray	Detected leakage	Collect tech-support and submit NVIDIA support ticket. For additional instructions refer to NVONLINE 1115991 chapter "NVIDIA MGX Leak Detection Strategy and Remediation" Note Relevant only for liquid-cooled-based systems.
Leakage	HEALTH_NOT_OK	WARNING	LEAKAGE-1	LEAKAGE-1  detected leakage	Detected leakage
Voltage-Related Events
Voltage	HEALTH_NOT_OK	WARNING	<Voltage-sensor-name>	Sensor voltage is out of range, voltage={}, range=[{},{}]	Voltage sensor not in range	Collect tech-support and submit NVIDIA support ticket. Power cycle the switch. If persists, consider replacing the system.
Voltage	HEALTH_NOT_OK	WARNING	<Voltage-sensor-name>	Sensor status is failed	Voltage sensor status in hardware is failed
Voltage	HEALTH_OK	INFORMATIONAL	<Voltage-sensor-name>	“HW component goes back to normal”	Voltage sensor value is back to normal state	N/A
Temperature-Related Events
Temperature	HEALTH_NOT_OK	WARNING	<Temp-sensor-name>	<Temp-sensor-name> temperature is too hot, temperature={}, threshold={}	Temperature too hot	Collect tech-support and submit NVIDIA support ticket. Power cycle the switch. If persists, see if the sensor is Ambient-MNG-Temp. If it is, check the environmental conditions (CDU and DC temperature). If persists, consider replacing the system.
Temperature	HEALTH_NOT_OK	WARNING	<Temp-sensor-name>	Sensor status is failed	Sensor status in hardware is failed
Temperature	HEALTH_OK	INFORMATIONAL	<Temp-sensor-name>	“HW component goes back to normal”	Temperature sensor value is back to normal state	N/A
System-Services-Related Events
services	HEALTH_NOT_OK	WARNING	<container-name>	Container '<container-name>' is not running	Container is not running	Collect tech-support and submit NVIDIA support ticket.
services	HEALTH_OK	INFORMATIONAL	<container-name>	“Service goes back to normal”	Service goes back to normal state	N/A
System-Initialization-Related Events
Init flow	SYSTEM_STATE_DOWN	CRITICAL	System	System is not ready—one or more services are not up	Some services are not up as part of init	Collect tech-support and submit NVIDIA support ticket. If sensor is Ambient-MNG-Temp: Check environmental conditions (CDU (if exists) and DC temperature). If persists, power-cycle the switch. If still persists, replace the switch.
Init flow	SYSTEM_STATE_FAILED	MAJOR	System	System is not ready—one or more services failed	Some services failed as part of init
Init flow	SYSTEM_STATE_UP	INFORMATIONAL	System	System is ready	System finished initialization and is ready	N/A
Interface-Related Informational Events
interface	INTERFACE_ADMIN_STATUS	INFORMATIONAL	<interface_name>	"Interface admin state is {admin_state}"	Informs of admin state change of interface	N/A
interface	INTERFACE_OPER_STATUS	INFORMATIONAL	<interface_name>	"Interface operational state is {up or down}"	Informs of operational state change of interface	N/A
interface	INTERFACE_LOGICAL_STATE	INFORMATIONAL	<interface_name>	"Interface logical state is {logical_state}"	Informs of logical state change of interface	N/A
System-Health-Related Events (The below events are summary and accompany the specific errors that were detailed above)
system	HEALTH_SUMMARY_NOT_OK_CRITICAL	CRITICAL	System	Health status is not ok	Have some health not okay event—critical (e.g. leakage)	Collect tech-support and submit NVIDIA support ticket.
system	HEALTH_SUMMARY_NOT_OK	WARNING	System	Health status is not ok	Have some health not okay event—warning	Collect tech-support and submit NVIDIA support ticket.
System	HEALTH_SUMMARY_OK	INFORMATIONAL	System	Health status is ok	System health with no issue	N/A
Transceiver-Related Events
Transceiver failure	HEALTH_NOT_OK	WARNING	sw1	"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"	Temperature is critically high	N/A
Transceiver failure	HEALTH_NOT_OK	WARNING	sw1	"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"	Temperature is critically low	N/A
Transceiver health	HEALTH_OK	INFORMATIONAL	sw1	"HW component goes back to normal"	Transceiver's temperature is good now	N/A

Event Category	Event Type ID ("event" in gNMI)	Severity	Resource ("component" in gNMI)	Text	Failure Reason	Suggested Repair Flow
Power-Supply-Unit-Related Events
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> is missing—Unpopulated PSU slot	PSU expected to be in the system, but PSU slot is empty	Insert PSU/dummy PSU to the unpopulated PSU slot
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> is out of power	PSU is out of power
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> temperature is too hot, temperature={}, threshold={}	Temperature too hot
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> voltage is out of range, voltage={}, range=[{},{}]	Voltage is out of range
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> System power exceeds threshold ({}w)	Power exceeds threshold
PSU	HEALTH_NOT_OK	WARNING	<PSU-name>	<PSU-name> Power supply is not providing power	No power from PSU
PSU	HEALTH_OK	INFORMATIONAL	<PSU-name>	HW component goes back to normal	Health issue was resolved	N/A

Event Category	Event Type ID ("event" in gNMI)	Severity	Resource ("component" in gNMI)	Text	Failure Reason	Suggested Repair Flow
PS-Redundancy-Related Events
Power redundancy policy	HEALTH_NOT_OK	WARNING	N/A	Power redundancy policy no-redundancy requires at least <minimal-number-of-PSUs-per-system> working power supplies, currently system has only <number-of-working-PSUs>	Insufficient number of PSUs relative to the current no-redundancy policy	Insert at least the minimal amount of PSUs required. Minimal PSU amount can be found using the following command: nv show platform ps-redundancy
Power redundancy policy	HEALTH_NOT_OK	WARNING	N/A	Power redundancy policy ps-redundant requires at least <minimal-number-of-PSUs-per-system + 1> working power supplies, currently system has only <number-of-working-PSUs>	Insufficient number of PSUs relative to the current ps-redundant policy	Insert at least one more PSU than the minimal amount of PSUs required. Minimal PSU amount can be found using the following command: nv show platform ps-redundancy
Power redundancy policy	HEALTH_NOT_OK	WARNING	N/A	Power redundancy policy grid-redundant requires all <number-of-PSUs-in-the-system> power supplies to be working, currently system has only <number-of-working-PSUs>	Insufficient number of PSUs relative to the current grid-redundant policy	Insert PSUs to all PSU slots in the system.
Power redundancy policy	HEALTH_OK	INFORMATIONAL	N/A	HW component goes back to normal	Health issue was resolved	N/A

Event Management Commands

Event Management Commands

On This Page