Introduction

NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center.

For DGX Servers, it includes active health monitoring, system alerts, and log generation.
For DGX Station, is it limited to using the CLI to check the health of the system and obtain diagnostic information.

NVSM can be used as a standalone utility from the command line by system administrators.

The following is a high level diagram of the NVSM framework, showing the NVSM API services at the heart of the framework, the DGX System Health Monitors (DSHM) responsible for monitoring the health of key system components, and the NVSM CLI for user control.

Note

DSHM functionality is not supported on DGX Station.

Verifying NVSM API Services

nvsm_api_gateway is part of the DGX OS image and is launched by systemd when DGX boots. The following are the services running under NVSM-APIS.

nvsm_api_gateway-plugin-environment

nvsm_api_gateway-mqtt

nvsm_api_gateway-plugin-memory

nvsm_api_gateway

nvsm_api_gateway-selwatcher

You can verify if each NVSM-APIS service is up and running using the ‘systemctl’ command. For example, the following command verifies the memory service.

$ sudo systemctl status nvsm_api_gateway-plugin-memory

You can also view all the NVSM-APIS services and their status with the following command.

$ sudo systemctl status -all nvsm_api_gateway*

Configurable DSHM Features

DSHM contains the following features that you can configure using the NVSM CLI:

Health Monitor Alerts
Health Monitor Policies

Health Monitor Alerts

Alerts are events of significance that require attention. When a health monitor detects such an event in the subsystem that it monitors, it generates an alert to inform the user. The default behavior is to log the alerts in persistent storage as well as to send an E-mail notification to registered users. Refer to the section Using the NVSM CLI for details about configuring users for receiving alert E-mail notifications.

Each alert has a ‘state’. An active alert can be in a ‘critical’ or ‘warning’ state. Here, ‘critical’ implies an event that needs immediate action, and ‘warning’ implies an event that needs user attention. When the alerting condition is removed, the alert state changes to ‘cleared’. Details of how to view the generated alerts recorded in the database are available in the section Using the NVSM CLI.

DSHM Alert List

The following table describes each DSHM alert ID.

Message and details	Alert ID	Component ID	Severity	Recommended Action
Unsupported drive configuration in {} (Drive(s) missing or Drive placement not in recommended slots. Drives found at these slots: {}. Supported Configurations are: {})	NV-DRIVE-01	<drive slot>	Critical	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error. (Media errors detected in drive.)	NV-DRIVE-02	<drive slot>	Warning	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error. (IO errors detected in drive.)	NV-DRIVE-03	<drive slot>	Warning	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error. (NVMe controller failure detected in drive.)	NV-DRIVE-04	<drive slot>	Critical	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error. (Available capacity percentage below critical threshold for drive.)	NV-DRIVE-05	<drive slot>	Critical	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error.	NV-DRIVE-06	<drive slot>	Critical	See Recommended Action A below.
System entered degraded mode, drive in {} is reporting an error. (Drive used percentage above critical threshold for drive.)	NV-DRIVE-07	<drive slot>	Warning	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. ({RAID-0/RAID-1/Partition} is corrupted.)	NV-VOL-01	NA	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Power supply module has failed.)	NV-PSU-01	<PSU#> where # is the PSU number.	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (operating temperature exceeds the thermal specifications of the component.)	NV-PSU-02	<PSU#> where # is the PSU number.	Warning	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Input to the PSU is missing)	NV-PSU-03	<PSU#> where # is the PSU number.	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Input voltage is out of range for the Power Supply Module) (Input voltage is out of range for the Power Supply Module)	NV-PSU-04	<PSU#> where # is the PSU number.	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (PSU is missing)	NV-PSU-05	<PSU#> where # is the PSU number.	Warning	See Recommended Action A below.
Failures in Power supply modules have been detected. (System is operating in degraded performance mode.)	NV-PSU-06		Warning	Rectify the issues observed on the PSUs. Then see Recommended Action A below.
Failures in Power supply modules have been detected. (System is in power failed state)	NV-PSU-07		Critical	Rectify the issues observed on the PSUs. Then see Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Operating temperature exceeds the thermal specifications of the component.)	NV-PDB-01	<PDB#> where # is the PDB number	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Fan speed reading has fallen below the expected speed setting.)	NV-FAN-01	<FAN#_F> or <FAN#_R> where # is the fan module number. F is for front fan. R is for rear fan.	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Fan readings are inaccessible.)	NV-FAN-02	<FAN#_F> or <FAN#_R> where # is the fan module number. F is for front fan. R is for rear fan.	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (An unrecoverable CPU Internal error has occurred.)	NV-CPU-01	<CPU#> where # is the CPU socket number (CPU0 or CPU1)	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component.)	NV-CPU-02	<CPU#> where # is the CPU socket number (CPU0 or CPU1)	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Processor socket temperature exceeded the thermal specifications of the component.)	NV-CPU-03		Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Processor socket temperature exceeded the thermal specifications of the component.)	NV-CPU-04		Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Uncorrectable error is reported).	NV-DIMM-01	<CPU#_DIMM_@$> where # = (1, 2) @ = (A, B, C, D, E, F) $ = (1, 2)	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Correctable errors reported exceeds the configured threshold.)	NV-DIMM-02	<CPU#_DIMM_@$> where # = (1, 2) @ = (A, B, C, D, E, F) $ = (1, 2)	Warning	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (Unrecoverable error is observed on the DIMM, specific details of the error are unavailable.)	NV-DIMM-03	<CPU#_DIMM_@$> where # = (1, 2) @ = (A, B, C, D, E, F) $ = (1, 2)	Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting an error. (DIMM presence is not expected in this slot, please verify the DIMM details.)	NV-DIMM-04			See Recommended Action A below.
System entered degraded mode, GPU is reporting an error (Critical error has been reported by the GPU.)	NV-GPU-01		Critical	See Recommended Action A below.
GPU{} power Limits are not configured correctly (Expected limits (Power: 200000W, CLock: 1597MHz), Actual limits (Power: 200000W, Clock: 1163MHz).)	NV-GPU-02		Critical	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Link speed degradation observed between { BDF1, BDF2}, expected link speed is {} actual link speed is {})	NV-PCI-01		Warning	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Link width degradation observed between {BDF1, BDF2},, expected link width is {} actual link width is {})	NV-PCI-02		Warning	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Correctable errors reported on {BDF}.)	NV-PCI-03		Warning	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (UnCorrectable errors reported on {BDF})	NV-PCI-04		Critical	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Device is missing on {BDF})	NV-PCI-05		Critical	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Device Error Reporting is disabled on {BDF} for {})	NV-PCI-06		Critical	See Recommended Action A below.
System entered degraded mode, {ID} is reporting an error. (Device is disabled on {BDF})	NV-PCI-07		Critical	See Recommended Action A below.
System entered degraded mode, Storge Controller {} is reporting an error. (Storage controller state not optimal.)	NV-CONTROLLER-01		Warning	See Recommended Action A below.
System entered degraded mode, Storge Controller {} is reporting an error. (Storage controller port connectivity issues detected.)	NV-CONTROLLER-02		Warning	See Recommended Action A below.
System entered degraded mode, Storage Controller {} is reporting an error. (Storage controller phy link speed not set to optimal.)	NV-CONTROLLER-03		Warning	See Recommended Action A below.
Link error on {}. (Network Link is down)	NV-NET-01		Warning	See Recommended Action B below.
Network traffic errors observed on {}. (Rx collision rate of {}, has crossed threshold value of {} on {}network port.)	NV-NET-02		Warning	See Recommended Action B below.
Network traffic errors observed on {}. (Tx collision rate of {}, has crossed threshold value of {} on {}network port.)	NV-NET-03		Warning	See Recommended Action B below.
Network traffic errors observed on {}. (CRC error rate of {}, has crossed threshold value of {} on {}network port.)	NV-NET-04		Critical	See Recommended Action B below.
{} is reporting an error. ({}Network port is disabled.)	NV-NET-05		Critical	See Recommended Action B below.
Ethernet interface error on port {}. ({}Ethernet health check failing with Online NVRAM test failure.)	NV-ETH-01		Critical	See Recommended Action B below.
Ethernet interface configuration error on {}. (MAC address is missing on the Ethernet interface of {}.)	NV-ETH-02		Critical	See Recommended Action B below.
IB driver error. (HCA self test reports IB driver initialization failure.)	NV-IB-01		Critical	See Recommended Action C below.
Counter errors on IB port {} ({}HCA self test on IB port reports counter error.)	NV-IB-02		Critical	See Recommended Action B below.
Configuration error on IB port {}. (GUID is missing on {}HCA.)	NV-IB-03		Critical	See Recommended Action D below.
System entered degraded mode, {} is reporting a fatal error (Critical error has been reported by the NVSwitch Id {} with SXID error {})	NV-NVSWITCH-01		Critical	See Recommended Action A below.
System entered degraded mode, {} is reporting a non fatal error (Critical error has been reported by the NVSwitch Id {} with SXID error {})	NV-NVSWITCH-02		Warning	See Recommended Action A below.

Recommended Actions

(A)

Run ‘sudo nvsm dump health’
Open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
Attach this notification and the nvsysinfo log file from /tmp/nvsm-health- <hostname>-<timestamp>.tar.xz

(B)

Check the physical link connection
Open a case with NVIDIA Enterprise Support at https://nvid.nvidia.com/dashboard/

(C)

Check OFED installation troubleshooting
Open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/

(D)

Check the status of the Subnet Manager
Open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/

Health Monitor Policies

Users can tune certain aspects of health monitor behavior using health monitor policies. This includes details such as email related configuration for alert notification, selectively disabling devices to be monitored, etc. Details of the supported policies and how to configure them using the CLI are provided in the section Using the NVSM CLI.

Verifying the Installation

Before using NVSM, you can verify the installation to make sure all the services are present.

Verifying DSHM Services

Health monitors are part of the DGX OS image and launched by systemd when DGX boots. You can verify if all the DSHM services are up and running using the systemctl command. Below is an example of verifying whether the environmental DSHM service is functional.

$ sudo systemctl status nvsm-plugin-monitor-environment

nvsm-plugin-monitor-environment.service - Environmental DSHM service.
  Loaded: loaded (/user/lib/systemd/sysem/nvsm-plugin-monitor-environment.service; enabled; vendor preset; enabled)
  Active: active (running) since Tues 2018-09-11 15:12:06 PDT: 3h 1min ago
Main PID: 2540 (plugin-monitor-environment)
   Tasks: 1 (limit 12287)
  CGroup: /system.slice/nvsm-plugin-monitor-environment.service
          |_2540 /user/bin/python /usr/bin/plugin-monitor-environment

Other modules can be verified using similar commands:

To verify the storage module:

$ sudo systemctl status nvsm-plugin-monitor-storage

To verify the system module:

$ sudo systemctl status nvsm-plugin-monitor-system

To verify the environment module

$ sudo systemctl status nvsm-plugin-monitor-environment

Verifying NVSM Services

NVSM is part of the DGX OS image and is launched by systemd when DGX boots. The following are the services running under NVSM.

nvsm-plugin-environment

nvsm-mqtt

nvms-plugin-memory

nvsm

nvsm-selwatcher

You can verify if each NVSM service is up and running using the ‘systemctl’ command. For example, the following command verifies the memory service.

$ sudo systemctl status nvsm-plugin-memory

You can also view all the NVSM services and their status with the following command.

$ sudo systemctl status -all nvsm*