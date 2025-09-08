NVIDIA NVOS User Manual for InfiniBand Switches v25.02.5002
Event Management

The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.

Note

In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.

Supported Events

The following table presents the supported events with their description.

Resource

Event Description

Severity

System

System fatal state detected

CRITICAL

System

System is not ready—one or more services are not up

CRITICAL

System

System is not ready—one or more services failed

MAJOR

System

Restart all syncd-ibv0 dockers

MAJOR

System

Performing reboot

MAJOR

System

Health status is not ok

WARNING

System

System is ready

INFORMATIONAL

System

System recovered from fatal state

INFORMATIONAL

System

Health status is ok

INFORMATIONAL

Sensor or service name

<Repeats a message from the system health>

WARNING

Sensor or service name

Hardware component goes back to normal /

Service goes back to normal

INFORMATIONAL

Interface name

Interface admin state is {Up/Down}

INFORMATIONAL

Interface name

Interface operational state is {Up/Down}

INFORMATIONAL

Interface name

Fast-recovery error event for trigger {trigger_name} was received

INFORMATIONAL

System

System reboot occurred

reason: <reboot cause>

performed by user: <user who performed reboot>

reboot time: <time when reboot occurred>

For a list of reboot causes, see Possible Reboot Causes.

INFORMATIONAL

ASIC name

PSC detected failure

MAJOR

Clearing Events

Receiving Clearing Events

For most WARNING/CRITICAL events, the system will send a "Clearing" event once the issue is resolved. If the system experiences two or more issues for a component, it will send an event about the last issue, and once all the issues are resolved, it will send a "Clearing" event for the last issue only. Once you receive a "Clearing" event, all issues for the component are resolved.

Example:

  • PSU2 is out of power

  • PSU2 is missing or not available

  • Cleared: PSU2 is missing or not available

Resending Unresolved Issues

If one of the initial issues for the component still exists after the last one was resolved, the system will resend the issue that still exists.

Example:

  • PSU2 is out of power

  • PSU2 is missing or not available

  • PSU2 is out of power

Backward Compatibility

Backward compatibility is preserved, and in the case of clearing the issue for the component, the system will also send generic events. Please consider avoiding generic messages, as they will be dropped in future releases.

Example:

  • Cleared: PSU2/FAN speed is out of range

  • Hardware component goes back to normal

System Reboot

After a reboot, the system does not clear any pre-boot events and assumes everything is cleared.

Detailed Table of Events

Event Category

Event Type ID ("event" in gNMI)

Severity

Resource ("component" in gNMI)

Text

Failure Reason

Suggested Repair Flow

Fan-Related Events

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 speed is out of range, speed=40%, range=[50,100]"

Fan speed out of range

  • Collect tech-support and submit NVIDIA support ticket.

  • Consider number of faulty fans: more than one fan requires immediate maintenance.

  • Power-cycle the switch.

  • If persists, submit NVIDIA support ticket to replace fan module.

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 is not working"

Fan status is not okay (status in the hardware)

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get actual speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get target speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get speed tolerance for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Failed to get speed status for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 is missing"

Fan is missing

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"FAN1/1 direction is not aligned with exhaust direction intake"

Fan direction is not aligned with other fans

Fan failure

HEALTH_NOT_OK

WARNING

FAN1/1

"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100"

Invalid speed

Fan health

HEALTH_OK

INFORMATIONAL

FAN1/1

“HW component goes back to normal”

Fan is back to normal state

N/A

ASIC-Related Events

ASIC failure

HEALTH_NOT_OK

WARNING

ASIC-HEALTH

Switch ASIC in fatal state

ASIC in fatal

  • Collect tech-support and submit NVIDIA support ticket.

  • Reboot the system.

  • If persists, power-cycle system.

ASIC failure

SYSTEM_FATAL_DETECTED

CRITICAL

System

System fatal state detected

Detect ASIC in fatal

ASIC failure

HEALTH_NOT_OK

WARNING

ASIC1

ASIC1 temperature is too hot, temperature=120, threshold=105

ASIC temp too high

  • Collect tech-support and submit NVIDIA support ticket.

  • Continue to monitor switch temperature.

ASIC failure

SYSTEM_FATAL_REMEDY

MAJOR

System

Restart all syncd-ibv0 dockers

ASIC in fatal preforming reboot of dockers

N/A

ASIC failure

SYSTEM_FATAL_REMEDY

MAJOR

System

Performing reboot

ASIC in fatal preforming reboot

ASIC health

HEALTH_OK

INFORMATIONAL

ASIC1

“HW component goes back to normal”

ASIC1 is back to normal state

ASIC health

SYSTEM_FATAL_RECOVERED

INFORMATIONAL

System

System recovered from fatal state

Recoverd from fatal

ASIC security irregularity

ASIC_SECURITY_IRREGULARITY

MAJOR

ASIC1

"PSC detected failure"

The system has detected an irregularity in physical monitoring

N/A

Leakage-Related Events

Leakage

LEAKAGE

CRITICAL

LEAKAGE-1

Leakage detected, inspect for water leakage and consider power down switch tray

Detected leakage

  • Collect tech-support and submit NVIDIA support ticket.

  • For additional instructions refer to NVONLINE 1115991 chapter "NVIDIA MGX Leak Detection Strategy and Remediation"

NOTE: Relevant only for liquid-cooled-based systems.

Leakage

HEALTH_NOT_OK

WARNING

LEAKAGE-1

LEAKAGE-1  detected leakage

Detected leakage

Voltage-Related Events

Voltage

HEALTH_NOT_OK

WARNING

<Voltage-sensor-name>

Sensor voltage is out of range, voltage={}, range=[{},{}]

Voltage sensor not in range

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, consider replacing the system.

Voltage

HEALTH_NOT_OK

WARNING

<Voltage-sensor-name>

Sensor status is failed

Voltage sensor status in hardware is failed

Voltage

HEALTH_OK

INFORMATIONAL

<Voltage-sensor-name>

“HW component goes back to normal”

Voltage sensor value is back to normal state

N/A

Temperature-Related Events

Temperature

HEALTH_NOT_OK

WARNING

<Temp-sensor-name>

<Temp-sensor-name> temperature is too hot, temperature={}, threshold={}

Temperature too hot

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, see if the sensor is Ambient-MNG-Temp. If it is, check the environmental conditions (CDU and DC temperature).

  • If persists, consider replacing the system.

Temperature

HEALTH_NOT_OK

WARNING

<Temp-sensor-name>

Sensor status is failed

Sensor status in hardware is failed

Temperature

HEALTH_OK

INFORMATIONAL

<Temp-sensor-name>

“HW component goes back to normal”

Temperature sensor value is back to normal state

N/A

System-Services-Related Events

services

HEALTH_NOT_OK

WARNING

<container-name>

Container '<container-name>' is not running

Container is not running

Collect tech-support and submit NVIDIA support ticket.

services

HEALTH_OK

INFORMATIONAL

<container-name>

“Service goes back to normal”

Service goes back to normal state

N/A

services

CPU_USAGE

WARNING

CPU usage x% is above expected threshold y%

CPU usage is larger than the expected usage

  • Collect techsupport and submit NVIDIA support ticket.

services

MEMORY_USAGE

WARNING

Memory usage x% is above expected threshold y%

Memory usage is larger than the expected usage

  • Container will be restarted automatically.

  • If persists, collect techsupport and submit NVIDIA support ticket.

services

CPU_USAGE

INFORMATIONAL

CPU usage is back to normal

CPU usage is back to normal state

N/A

services

MEMORY_USAGE

INFORMATIONAL

Memory usage is back to normal

Memory usage is back to normal state

N/A

System-Initialization-Related Events

Init flow

SYSTEM_STATE_DOWN

CRITICAL

System

System is not ready—one or more services are not up

Some services are not up as part of init

  • Collect tech-support and submit NVIDIA support ticket.

  • If sensor is Ambient-MNG-Temp:

  • Check environmental conditions (CDU (if exists) and DC temperature).

  • If persists, power-cycle the switch.

  • If still persists, replace the switch.

Init flow

SYSTEM_STATE_FAILED

MAJOR

System

System is not ready—one or more services failed

Some services failed as part of init

Init flow

SYSTEM_STATE_UP

INFORMATIONAL

System

System is ready

System finished initialization and is ready

N/A

Interface-Related Informational Events

interface

INTERFACE_ADMIN_STATUS

INFORMATIONAL

<interface_name>

"Interface admin state is {admin_state}"

Informs of admin state change of interface

N/A

interface

INTERFACE_OPER_STATUS

INFORMATIONAL

<interface_name>

"Interface operational state is {up or down}"

Informs of operational state change of interface

N/A

interface

INTERFACE_LOGICAL_STATE

INFORMATIONAL

<interface_name>

"Interface logical state is {logical_state}"

Informs of logical state change of interface

N/A

System-Health-Related Events

(The below events are summary and accompany the specific errors that were detailed above)

system

HEALTH_SUMMARY_NOT_OK_CRITICAL

CRITICAL

System

Health status is not ok

Have some health not okay event—critical (e.g. leakage)

Collect tech-support and submit NVIDIA support ticket.

system

HEALTH_SUMMARY_NOT_OK

WARNING

System

Health status is not ok

Have some health not okay event—warning

System

HEALTH_SUMMARY_OK

INFORMATIONAL

System

Health status is ok

System health with no issue

N/A

Transceiver-Related Events

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"

Temperature is critically high

N/A

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"

Temperature is critically low

N/A

Transceiver health

HEALTH_OK

INFORMATIONAL

sw1

"HW component goes back to normal"

Transceiver's temperature is good now

N/A

Event Category

Event Type ID ("event" in gNMI)

Severity

Resource ("component" in gNMI)

Text

Failure Reason

Suggested Repair Flow

Power-Supply-Unit-Related Events

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> is missing—Unpopulated PSU slot

PSU expected to be in the system, but PSU slot is empty

Insert PSU/dummy PSU to the unpopulated PSU slot

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> is out of power

PSU is out of power

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> temperature is too hot, temperature={}, threshold={}

Temperature too hot

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> voltage is out of range, voltage={}, range=[{},{}]

Voltage is out of range

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> System power exceeds threshold ({}w)

Power exceeds threshold

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> Power supply is not providing power

No power from PSU

PSU

HEALTH_OK

INFORMATIONAL

<PSU-name>

HW component goes back to normal

Health issue was resolved

N/A

Event Category

Event Type ID ("event" in gNMI)

Severity

Resource ("component" in gNMI)

Text

Failure Reason

Suggested Repair Flow

PS-Redundancy-Related Events

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy no-redundancy requires at least <minimal-number-of-PSUs-per-system> working power supplies, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current no-redundancy policy

Insert at least the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:

nv show platform ps-redundancy

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy ps-redundant requires at least <minimal-number-of-PSUs-per-system + 1> working power supplies, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current ps-redundant policy

Insert at least one more PSU than the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:

nv show platform ps-redundancy

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy grid-redundant requires all <number-of-PSUs-in-the-system> power supplies to be working, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current grid-redundant policy

Insert PSUs to all PSU slots in the system.

Power redundancy policy

HEALTH_OK

INFORMATIONAL

N/A

HW component goes back to normal

Health issue was resolved

N/A

Event Management Commands
