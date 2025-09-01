NVIDIA NVOS User Manual for InfiniBand Switches v25.02.3000
NVIDIA Docs Hub Homepage  NVIDIA Networking  Networking Software  Switch Software  NVIDIA NVOS User Manual for InfiniBand Switches v25.02.3000  Health Monitoring

On This Page

Health Monitoring

NVOS includes a health daemon that is responsible for collecting health events in the system and monitoring various components, including hardware components such as fans, power supply units, leakage sensors, and failing docker containers.

This health daemon runs every 3 seconds, analyzing components. If it detects an issue, it appears in nv show system health and publishes a health event in gNMI.

Health daemon monitors two main items: Service and Hardware

Service Monitoring

The health daemon ensures the continuous operation of essential system services. It monitors:

  • Critical Dockers and Services: Verifies that all vital dockers and services are running

Hardware Monitoring

  • Leakage Sensors: Detects any potential fluid leaks

  • Temperature Sensors: Monitors the temperature of all hardware components to prevent overheating

  • Voltage Sensors: Tracks voltage levels across hardware components to ensure proper functionality

  • Fan Speeds: Checks the speed of system fans to maintain optimal airflow and cooling

  • Power Supply Units (PSUs): Monitors the status of power supply units for stability

  • ASIC Health Status: Verifies the health of ASIC components to prevent processing issues

  • Disk: Checks the health and free space of the Disk

  • CPU: Monitors the CPU utilization and temperature

  • Transievers: Monitors the status of the transievers connected to the system

Output Examples

Example output for a healthy system:

healthy system

Copy
Copied!
            

            
admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  -------  
status      OK
status-led  green
 
 
 
Health issues
================
No Data

Example output for a faulty system:

bad system

Copy
Copied!
            

            
admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  ------- 
status      Not OK
status-led  amber
 
 
 
Health issues
================
    Component                     Status information
    ----------------------------  ----------------------------------------------------------
    LEAKAGE-2                     detected leakage
    PMIC 1 Temp                   temperature is too hot, temperature=2008.0, threshold=125.0

Health Monitoring Commands
© Copyright 2025, NVIDIA. Last updated on Sep 1, 2025.
content here