Health Monitoring
NVOS includes a health daemon that is responsible for collecting health events in the system and monitoring various components, including hardware components such as fans, power supply units, leakage sensors, and failing docker containers.
This health daemon runs every 3 seconds, analyzing components. If it detects an issue, it appears in nv show system health and publishes a health event in gNMI.
Health daemon monitors two main items: Service and Hardware
The health daemon ensures the continuous operation of essential system services. It monitors:
Critical Dockers and Services: Verifies that all vital dockers and services are running
Event Alerts: Raises alerts if any essential services are not functioning in a normal, healthy state
Leakage Sensors: Detects any potential fluid leaks
Temperature Sensors: Monitors the temperature of all hardware components to prevent overheating
Voltage Sensors: Tracks voltage levels across hardware components to ensure proper functionality
Fan Speeds: Checks the speed of system fans to maintain optimal airflow and cooling
Power Supply Units (PSUs): Monitors the status of power supply units for stability
ASIC Health Status: Verifies the health of ASIC components to prevent processing issues
Example output for a healthy system:
healthy system
admin@nvos
:~$ nv show system health
operational applied
---------- ----------- -------
status OK
status-led green
Health issues
================
No Data
Example output for a faulty system:
bad system
admin@nvos
:~$ nv show system health
operational applied
---------- ----------- -------
status Not OK
status-led amber
Health issues
================
Component Status information
---------------------------- ----------------------------------------------------------
LEAKAGE-2
detected leakage
PMIC 1
Temp temperature is too hot, temperature=2008.0
, threshold=125.0