What can I help you with?
NVIDIA NVOS User Manual for NVLink Switches v25.02.2141

Health Monitoring

NVOS includes a health daemon that is responsible for collecting health events in the system and monitoring various components, including hardware components such as fans, power supply units, leakage sensors, and failing docker containers.

This health daemon runs every 3 seconds, analyzing components. If it detects an issue, it appears in nv show system health and publishes a health event in gNMI.

Health daemon monitors two main items: Service and Hardware

The health daemon ensures the continuous operation of essential system services. It monitors:

  • Critical Dockers and Services: Verifies that all vital dockers and services are running

  • Event Alerts: Raises alerts if any essential services are not functioning in a normal, healthy state

  • Leakage Sensors: Detects any potential fluid leaks

  • Temperature Sensors: Monitors the temperature of all hardware components to prevent overheating

  • Voltage Sensors: Tracks voltage levels across hardware components to ensure proper functionality

  • Fan Speeds: Checks the speed of system fans to maintain optimal airflow and cooling

  • Power Supply Units (PSUs): Monitors the status of power supply units for stability

  • ASIC Health Status: Verifies the health of ASIC components to prevent processing issues

Example output for a healthy system:

healthy system

Copy
Copied!
            

admin@nvos:~$ nv show system health operational applied ---------- ----------- ------- status OK status-led green       Health issues ================ No Data

Example output for a faulty system:

bad system

Copy
Copied!
            

admin@nvos:~$ nv show system health operational applied ---------- ----------- ------- status Not OK status-led amber       Health issues ================ Component Status information ---------------------------- ---------------------------------------------------------- LEAKAGE-2 detected leakage PMIC 1 Temp temperature is too hot, temperature=2008.0, threshold=125.0

© Copyright 2025, NVIDIA. Last updated on Apr 23, 2025.