What can I help you with?
NVIDIA NVOS User Manual for NVLink Switches v25.02.2141

Event Management

The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.

Note

In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.

The following table presents the supported events with their description.

Resource

Event Description

Severity

System

System fatal state detected

CRITICAL

System

System is not ready—one or more services are not up

CRITICAL

System

System is not ready—one or more services failed

MAJOR

System

Restart all syncd-ibv0 dockers

MAJOR

System

Performing reboot

MAJOR

System

Health status is not ok

WARNING

System

System is ready

INFORMATIONAL

System

System recovered from fatal state

INFORMATIONAL

System

Health status is ok

INFORMATIONAL

Sensor or service name

<Repeats a message from the system health>

WARNING

Sensor or service name

Hardware component goes back to normal /

Service goes back to normal

INFORMATIONAL

Interface name

Interface admin state is {Up/Down}

INFORMATIONAL

Interface name

Interface operational state is {Up/Down}

INFORMATIONAL

Interface name

Fast-recovery error event for trigger {trigger_name} was received

INFORMATIONAL

Event Category

Event Type ID ("event" in gNMI)

Severity

Resource ("component" in gNMI)

Text

Failure Reason

Suggested Repair Flow

Fan-Related Events

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 speed is out of range, speed=40%, range=[50,100]"

Fan speed out of range

  • Collect tech-support and submit NVIDIA support ticket.

  • Consider number of faulty fans: more than one fan requires immediate maintenance.

  • Power-cycle the switch.

  • If persists, submit NVIDIA support ticket to replace fan module.

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 is not working"

Fan status is not okay (status in the hardware)

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get actual speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get target speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get speed tolerance for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get speed status for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 is missing"

Fan is missing

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 direction is not aligned with exhaust direction intake"

Fan direction is not aligned with other fans

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100"

Invalid speed

Fan health

HEALTH_OK

INFORMATIONAL

FAN1/1

“HW component goes back to normal”

Fan is back to normal state

N/A

ASIC-Related Events

ASIC failure

HEALTH_NOT_OK

WARNING

ASIC-HEALTH

Switch ASIC in fatal state

ASIC in fatal

  • Check correct software and firmware bundle recipe of switch and compute trays.

  • Collect tech-support and submit NVIDIA support ticket.

  • Reboot the system.

  • If persists, power-cycle system.

ASIC failure

SYSTEM_FATAL_DETECTED

CRITICAL

System

System fatal state detected

Detect ASIC in fatal

ASIC failure

HEALTH_NOT_OK

WARNING

ASIC1

ASIC1 temperature is too hot, temperature=120, threshold=105

ASIC temp too high

  • Collect tech-support and submit NVIDIA support ticket.

  • Continue to monitor switch temperature.

ASIC failure

SYSTEM_FATAL_REMEDY

MAJOR

System

Restart all syncd-ibv0 dockers

ASIC in fatal preforming reboot of dockers

N/A

ASIC failure

SYSTEM_FATAL_REMEDY

MAJOR

System

Performing reboot

ASIC in fatal preforming reboot

ASIC health

HEALTH_OK

INFORMATIONAL

ASIC1

“HW component goes back to normal”

ASIC1 is back to normal state

ASIC health

SYSTEM_FATAL_RECOVERED

INFORMATIONAL

System

System recovered from fatal state

Recoverd from fatal

Leakage-Related Events

Leakage

LEAKAGE

CRITICAL

LEAKAGE-1

Leakage detected, inspect for water leakage and consider power down switch tray

Detected leakage

  • Collect tech-support and submit NVIDIA support ticket.

  • For additional instructions refer to NVONLINE 1115991 chapter "NVIDIA MGX Leak Detection Strategy and Remediation"

Note

Relevant only for liquid-cooled-based systems.

Leakage

HEALTH_NOT_OK

WARNING

LEAKAGE-1

LEAKAGE-1  detected leakage

Detected leakage

Voltage-Related Events

Voltage

HEALTH_NOT_OK

WARNING

<Voltage-sensor-name>

Sensor voltage is out of range, voltage={}, range=[{},{}]

Voltage sensor not in range

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, check busbar power supply if the sensor is one of the following: HSC-VinDC-In, PDB-1-Conv-In-1, PDB-2-Conv-In-1, PDB-3-Conv-In-1, PDB-4-Conv-In-1.

  • If persists, consider replacing the system.

Voltage

HEALTH_NOT_OK

WARNING

<Voltage-sensor-name>

Sensor status is failed

Voltage sensor status in hardware is failed

Voltage

HEALTH_OK

INFORMATIONAL

<Voltage-sensor-name>

“HW component goes back to normal”

Voltage sensor value is back to normal state

N/A

Temperature-Related Events

Temperature

HEALTH_NOT_OK

WARNING

<Temp-sensor-name>

<Temp-sensor-name> temperature is too hot, temperature={}, threshold={}

Temperature too hot

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, see if the sensor is Ambient-MNG-Temp. If it is, check the environmental conditions (CDU and DC temperature).

  • If persists, consider replacing the system.

Temperature

HEALTH_NOT_OK

WARNING

<Temp-sensor-name>

Sensor status is failed

Sensor status in hardware is failed

Temperature

HEALTH_OK

INFORMATIONAL

<Temp-sensor-name>

“HW component goes back to normal”

Temperature sensor value is back to normal state

N/A

System-Services-Related Events

services

HEALTH_NOT_OK

WARNING

<container-name>

Container '<container-name>' is not running

Container is not running

Collect tech-support and submit NVIDIA support ticket.

services

HEALTH_OK

INFORMATIONAL

<container-name>

“Service goes back to normal”

Service goes back to normal state

N/A

System-Initialization-Related Events

Init flow

SYSTEM_STATE_DOWN

CRITICAL

System

System is not ready—one or more services are not up

Some services are not up as part of init

  • Collect tech-support and submit NVIDIA support ticket.

  • If sensor is Ambient-MNG-Temp:

  • Check environmental conditions (CDU (if exists) and DC temperature).

  • If persists, power-cycle the switch.

  • If still persists, replace the switch.

Init flow

SYSTEM_STATE_FAILED

MAJOR

System

System is not ready—one or more services failed

Some services failed as part of init

Init flow

SYSTEM_STATE_UP

INFORMATIONAL

System

System is ready

System finished initialization and is ready

N/A

Interface-Related Informational Events

interface

INTERFACE_ADMIN_STATUS

INFORMATIONAL

<interface_name>

"Interface admin state is {admin_state}"

Informs of admin state change of interface

N/A

interface

INTERFACE_OPER_STATUS

INFORMATIONAL

<interface_name>

"Interface operational state is {up or down}"

Informs of operational state change of interface

N/A

interface

INTERFACE_LOGICAL_STATE

INFORMATIONAL

<interface_name>

"Interface logical state is {logical_state}"

Informs of logical state change of interface

N/A

System-Health-Related Events

(The below events are summary and accompany the specific errors that were detailed above)

system

HEALTH_SUMMARY_NOT_OK_CRITICAL

CRITICAL

System

Health status is not ok

Have some health not okay event—critical (e.g. leakage)

Collect tech-support and submit NVIDIA support ticket.

system

HEALTH_SUMMARY_NOT_OK

WARNING

System

Health status is not ok

Have some health not okay event—warning

System

HEALTH_SUMMARY_OK

INFORMATIONAL

System

Health status is ok

System health with no issue

N/A

Transceiver-Related Events

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"

Temperature is critically high

N/A

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"

Temperature is critically low

N/A

Transceiver health

HEALTH_OK

INFORMATIONAL

sw1

"HW component goes back to normal"

Transceiver's temperature is good now

N/A

© Copyright 2025, NVIDIA. Last updated on Apr 23, 2025.