NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Switch Software NVIDIA NVOS User Manual for NVLink Switches v25.02.2141 Health Monitoring

Health Monitoring

NVOS includes a health daemon that is responsible for collecting health events in the system and monitoring various components, including hardware components such as fans, power supply units, leakage sensors, and failing docker containers.

This health daemon runs every 3 seconds, analyzing components. If it detects an issue, it appears in nv show system health and publishes a health event in gNMI.

Health daemon monitors two main items: Service and Hardware

Service Monitoring

The health daemon ensures the continuous operation of essential system services. It monitors:

Critical Dockers and Services: Verifies that all vital dockers and services are running

Hardware Monitoring

Leakage Sensors: Detects any potential fluid leaks
Temperature Sensors: Monitors the temperature of all hardware components to prevent overheating
Voltage Sensors: Tracks voltage levels across hardware components to ensure proper functionality
Fan Speeds: Checks the speed of system fans to maintain optimal airflow and cooling
Power Supply Units (PSUs): Monitors the status of power supply units for stability
ASIC Health Status: Verifies the health of ASIC components to prevent processing issues
Disk: Checks the health and free space of the Disk
CPU: Monitors the CPU utilization and temperature
Transievers: Monitors the status of the transievers connected to the system

Output Examples

Example output for a healthy system:

healthy system

Copy
Copied!

            
            admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  -------  
status      OK
status-led  green
 
 
 
Health issues
================
No Data

Example output for a faulty system:

bad system

Copy
Copied!

            
            admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  ------- 
status      Not OK
status-led  amber
 
 
 
Health issues
================
    Component                     Status information
    ----------------------------  ----------------------------------------------------------
    LEAKAGE-2                     detected leakage
    PMIC 1 Temp                   temperature is too hot, temperature=2008.0, threshold=125.0