1. Overview
nvidia-healthmon is the system administrator's and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIA® Tesla™ GPUs in a high performance computing environment. nvidia-healthmon contains limited hardware diagnostic capabilities, and focuses on software and system configuration issues.
1.1. nvidia-healthmon Goals
- Discover common problems that affect a GPUs ability to run a compute job
including:
- Software configuration issues
- System configuration issues
- System assembly issues, like loose cables
- A limited number of hardware issues
- Provide troubleshooting help
- Easily integrate into Cluster Scheduler and Cluster Management applications
- Reduce downtime and failed GPU jobs
1.2. Dependencies
This version of nvidia-healthmon depends on the NVIDIA Developer Display Driver r340, or later, found on the CUDA download page. http://developer.nvidia.com/cuda-downloads
1.3. Supported Products
nvidia-healthmon supports Tesla GPUs running on Linux (bare metal) operating systems. NVIDIA® Tesla™ Line:
- All Fermi architecture GPUs
- All Kepler architecture GPUs
1.4. Deprecation Notice
Beginning with this release, NVIDIA® has begun shipping an improved health and diagnostic tool called the NVIDIA Validation Suite®, or NVVS for short. It is included in this GDK alongside Healthmon. The initial version of NVVS is very similar to Healthmon in functionality, but built on a more robust framework. Future versions of NVVS will see substantial new features and tests.
Accordingly the current version of Healthmon is being deprecated in this release. It is still fully supported and will remain so for this driver release cycle. Please begin looking at the associated NVVS material to plan ahead for that transition. If there are any questions or concerns, please feel free to contact cudatools@nvidia.com.
2. Running nvidia-healthmon
Once unpackaged, nvidia-healthmon can be run from the command line.
user@hostname $ nvidia-healthmon
When no arguments are supplied, nvidia-healthmon will run with the default behavior on all supported GPUs.
2.1. Listing GPUs
nvidia-healthmon is able to list the GPUs installed on the system. This is useful to determine the PCI bus ID or device index needed in the next section.
user@hostname $ nvidia-healthmon -L
For extended GPU information see the nvidia-smi tool:
user@hostname $ man nvidia-smi user@hostname $ nvidia-smi -q
2.2. Targeting a Specific GPU
nvidia-healthmon can target a single GPU or a set of GPU’s. To target a specific GPU, run nvidia-healthmon using the -i or --id flag with the identifier of the GPU to be targeted. Identifiers are either:
- A device index
- A PCI bus ID
- A GPU chip UUID
- A GPU board serial number
user@hostname $ nvidia-healthmon -i 0 user@hostname $ nvidia-healthmon -i 0000:02:00.0
2.3. Modes
The default mode is quick mode. In quick mode a subset of tests are run, to quickly detect common problems. The -q or --quick flags allow quick mode to be explicitly requested.
The other mode is extended mode. In extended mode all available tests will be run.
To run extended diagnostics run nvidia-healthmon using the -e or –-extended flags.
user@hostname $ nvidia-healthmon --extended
For more information about these modes see the nvidia-healthmon Best Practices Guide.
2.4. Log File Output
By default nvidia-healthmon will report information to standard output. To redirect output to a file, the -l or –-log-file flags are used. Only errors in command line parsing will be printed to the console.
2.5. Verbose Output
The default output of nvidia-healthmon will not report values for various metrics it has collected. The verbose flags, -v or –-verbose, can be used to print values like the pinned memory bandwidth and CUDA device query information. Additionally, verbose mode will provide information about why tests were skipped.
2.6. Other Flags
For a more complete list of the flags available, run nvidia-healthmon with the -h/-–help or -H/–-verbose-help flags.
user@hostname $ nvidia-healthmon -h
3. Interpreting the Result
3.1. nvidia-healthmon Exit Code
nvidia-healthmon will terminate once it completes the execution diagnostics on all specified devices. An exit code of zero will be used when nvidia-healthmon runs successfully. A non-zero exit code indicates that there was a problem with the nvidia-healthmon run. This may be due to problems running the diagnostic: such as a missing configuration file, or invalid command line arguments, or problems with the local machine that nvidia-healthmon has detected. The output of the application must be read to determine what the exact problem was.
An example of a successful run of nvidia-healthmon.
user@hostname $ nvidia-healthmon -q Loading Config: SUCCESS Global Tests NVML Sanity: SUCCESS Tesla Devices Count: SUCCESS Global Test Results: 2 success, 0 errors, 0 warnings, 0 did not run ----------------------------------------------------------- GPU 0000:04:00.0 #1 : Tesla C2075 (Serial: 0425912072221) NVML Sanity: SUCCESS InfoROM: SKIPPED GEMINI InfoROM: SKIPPED ECC: SUCCESS CUDA Sanity Result: SUCCESS PCIe Maximum Link Generation: SUCCESS PCIe Maximum Link Width: SUCCESS PCI Seating: SUCCESS PCI Bandwidth: SKIPPED Device Results: 6 success, 0 errors, 0 warnings, 3 did not run System Results: 8 success, 0 errors, 0 warnings, 3 did not run
In the above example three of the tests did not run.
- The InfoROM test is an extended mode only test so it is skipped. To run the InfoROM test, extended mode must be used.
- The Gemini infoROM test was skipped because the Tesla C2075 only has a single GPU on the board.
- The PCIe bandwidth test was not run because the configuration file does not specify the bandwidth supported by the system. To run the bandwidth test, the configuration file must be edited.
An example of a failing run of nvidia-healthmon.
user@hostname $ nvidia-healthmon -q Loading Config: SUCCESS Global Tests NVML Sanity: SUCCESS Tesla Devices Count: SUCCESS Global Test Results: 2 success, 0 errors, 0 warnings, 0 did not run ----------------------------------------------------------- GPU 0000:04:00.0 #1: Tesla C2075 (Serial: 0425912072221) NVML Sanity: SUCCESS InfoROM: SKIPPED GEMINI InfoROM: SKIPPED ECC: SUCCESS CUDA Sanity Result: SUCCESS PCIe Maximum Link Generation: SUCCESS PCIe Maximum Link Width: SUCCESS PCI Seating ERROR: After enabling maximum performance mode, the current PCIe Link Width (8) does not match the expected maximum PCIe Link Width (16). This can indicate that this GPU is improperly seated. An issue was detected with this GPU. This issue is usually caused by poor connection between the GPU and the system. * Run 'nvidia-smi -q'. In some cases this will report a poorly connected power cable. * Power down your system. * Check that all power connectors are firmly attached (some GPUs require two power cables attached). * Check that the power cable is not damaged. Symptoms of damaged power cables include exposed wiring and kinks (sharply creased region). * Check that the power cable is attached to a working power supply. * Rerun these diagnostics, using the same GPU, on system that is known to be working. A variety of system issues can cause diagnostic failure. * Restart your system and install the latest NVIDIA display driver. * Contact your OEM provider, to run further system diagnostics. * Run 'nvidia-bug-report.sh' as the root user. * Run 'nvidia-healthmon -v -e -l nvidia-healthmon-report.txt --debug' * Provide the files nvidia-bug-report.log.gz, nvidia-healthmon-report.dump, and nvidia-healthmon-report.txt to your OEM to assist your OEM in resolving this issue. Result: ERROR PCI Bandwidth: SKIPPED Device Results: 5 success, 1 errors, 0 warnings, 3 did not run System Results: 7 success, 1 errors, 0 warnings, 3 did not run WARNING: One or more tests didn’t run. Read the output for details. ERROR: One or more tests failed. Read the output for details.
In the above example, nvidia-healthmon detected a problem with how the GPU was inserted into the system. nvidia-healthmon exited with a non-zero exit code. Additionally, the output provides a user readable description of what went wrong and a list of the steps the customer can take to solve the problem.
4. Configuring nvidia-healthmon
While nvidia-healthmon will work out of the box without additional configuration, it is possible to configure the behavior and enable optional features. nvidia-healthmon is packaged with a sample configuration file, nvidia-healthmon.conf. This configuration file can be used to enable optional tests. By default, a test that is missing configuration information will be skipped.
The configuration file used can be specified on the command line. When not specified nvidia-healthmon will check the NVHEALTHMON_CONF environment variable for the full path (including file name) of the nvidia-healthmon.conf file, followed by the current working directory and lastly, the default RPM/DEB installation directory (/etc/nvidia-healthmon/nvidia-healthmon.conf).
Specify the configuration file using the following:
user@hostname $ nvidia-healthmon -c /path/to/your/nvidia-healthmon.conf
4.1. Configuration File Contents
This file is formatted in the standard ini file format.
There are two types of sections in the file. The first is the global section, the second is a GPU name section. The global section contains expected system configuration information. For example, the devices.tesla.count describes the number of GPUs on the system. Any section name other than global is a GPU name section. The GPU name section contains information about all GPUs with a given name. For example, the section [Tesla M2090] contains the configuration for all Tesla M2090 GPUs. The two types of sections support different key value pairs. For instance, the key devices.tesla.count is only allowed to be grouped underneath the global category. The nvidia-healthmon.conf file nvidia-healthmon is packaged with contains the valid key value pairs for each section along with a brief description.
Sample System Configuration File with 4 Devices shows a sample configuration file for a system that contains 4 devices that are either Tesla™ C2075s, or Tesla C2070s.
4.2. Summary of valid keys
Key | Description |
---|---|
devices.tesla.count |
Enable this setting to ensure that the expected number of Tesla brand GPUs are detected by the NVML library. This count only includes Tesla brand GPUs that the nvidia-healthmon process has sufficient permission to access. If this setting is not configured, then checks that require it will skip. |
drivers.blacklist |
Checks the system for drivers that have been known to cause issues with NVIDIA hardware, drivers, and software. If nvidia-healthmon detects any blacklisted drivers it will not execute further tests. You may add/remove drivers on this list at your own risk. If this setting is not configured, then checks that require it will skip. |
display.gpudirect |
Enable this setting to display the GPUDirect communication matrix between all P2P/RDMA devices on the system. |
version.driver |
Set this to give an expected Nvidia driver version, a mismatch causes the test to fail |
rdma.enable |
Set this to activate RDMA testing features and display relevant RDMA information in the GPU Direct communication matrix (through display.gpudirect). Please see rdma.* below |
Key | Description |
---|---|
bandwidth.warn |
If the bandwidth from the host to GPU or from the GPU to the host is below this value (in MB/s), nvidia-healthmon will generate a warning. If this setting is not configured, then checks that require it will skip are detected by the NVML library. |
bandwidth.min |
If the bandwidth from the host to GPU or from the GPU to the host is below this value (in MB/s), nvidia-healthmon will generate an error. If this setting is not configured, then checks that require it will skip. |
peer.bandwidth.warn |
In the case that peer access is supported, if the bandwidth from one GPU to the other GPU is supported is below this value (in MB/s), nvidia-healthmon will generate a warning. If peer to peer access is not supported, the bandwidth test is still run, but no comparison to the minimum bandwidth is done. |
peer.bandwidth.min |
In the case that peer access is supported, if the bandwidth from one GPU to the other GPU is supported is below this value (in MB/s), nvidia-healthmon will generate an error. If peer to peer access is not supported, the bandwidth test is still run, but no comparison to the minimum bandwidth is done. |
pci.gen |
Compare the maximum PCIe link generation for the PCIe link closest to the GPU chip against the value specified here. If this setting is not configured, then checks that require it will skip. An error will be generated if there is a mismatch. For a board that contains multiple GPU chips, this value will reflect the PCIe link generation between the GPU chip and an on board PCIe switch. For single GPU boards this value reflects the link width between the GPU chip and the PCIe slot the GPU is connected to. Note that additional PCIe links upstream from the GPU may have a different link generation. Those links are not considered here. Because PCIe link generation is dependent on the non-GPU side of the link, knowledge of the system's capability is required to set the correct expectations. Consequently this config file disables this test for each GPU by default. |
pci.width |
Compare the maximum PCIe link width for the PCIe link closest to the GPU chip against the value specified here. If this setting is not configured, then checks that require it will skip. An error will be generated if there is a mismatch. For a board that contains multiple GPU chips, this value will reflect the PCIe link generation between the GPU chip and an on board PCIe switch. For single GPU boards this value reflects the link width between the GPU chip and the PCIe slot the GPU is connected to. Note that additional PCIe links upstream from the GPU may have a different link generation. Those links are not considered here. Because PCIe link generation is dependent on the non-GPU side of the link, knowledge of the system's capability is required to set the correct expectations. Consequently this config file disables this test for each GPU by default. |
temperature.warn |
Compare the current GPU die temperature to a warning level in degrees Celsius. A warning will be generated if the current temperature is at or above the warning level. Note that the desired temperature may vary based on the cooling system used. If this setting is not configured, then checks that require it will skip. |
rdma.mlxdev rdma.mlxport |
Perform an RDMA test to measure the bandwidth between a Mellanox Infiniband NIC and a GPU. The tests are activated via the rdma.enable key in the global section. rdma.mlxdev is the Mellanox device ID given by ibstat. rdma.mlxport is the port number you wish to perform the test on. Though this is strictly a loopback test, the port must be active and link up for the test to work correctly. |
rdma.bandwidth.warn |
In the case that RDMA access is supported, if the RDMA unidirectional bandwidth is below this value (in MB/s), nvidia-healthmon will generate a warning. RDMA bandwidths will be highly dependent on the exact host bridge version and architecture. This is an optional parameter. |
rdma.bandwidth.min |
In the case that RDMA access is supported, if the RDMA unidirectional bandwidth is below this value (in MB/s), nvidia-healthmon will generate an error. RDMA bandwidths will be highly dependent on the exact host bridge version and architecture. This is an optional parameter. |
version.vbios |
Check to make sure a device's vBios version is an expected value. This can be a single value or a comma-separated list of values. |
ecc.check_state |
Set this to check the ECC state of a device and consider a mismatch with this parameter to be a failure. If this parameter is provided, it must be one of the following: 0 = expect ECC to be disabled on this device 1 = expect ECC to be enabled on this device If this parameter is provided and a device does not support getting its ECC state, it will be considered a test failure |
GPUDirect
5.1. GPUDirect Testing
nvidia-healthmon can assist in the testing and optimization of GPUDirect communications. Enable the GPUDirect communication matrix using the display.gpudirect global key. Peer-to-peer (P2P) tests will automatically be conducted if there exists more than one GPUDirect-capable card in the system.
Please note, the enumeration of GPUs in the matrix are according to NVML.
- The Infiband driver must be loaded and the IB link up
- The nv_peer_mem driver must be loaded. This driver and instructions for compiling it can be found on the Mellanox OFED page. http://www.mellanox.com/page/products_dyn?product_family=116
- The OFED libibverbs libraries installed and in the current library search path (i.e. LD_LIBRARY_PATH)
Use Cases
While nvidia-healthmon is primarily targeted at clusters of NVIDIA® Tesla™ GPUs, it can also be used in workstations without a cluster manager.
6.1. Run nvidia-healthmon after System Provisioning
After a system is provisioned, nvidia-healthmon can be run on the node to ensure that the node is correctly configured and able to run a GPU job. In this use case an extended mode run of nvidia-healthmon will try to deliver the most comprehensive system health check.
6.2. Run nvidia-healthmon before a Job
6.3. Periodic Health Check
Analogous to periodically scanning for viruses, nvidia-healthmon can be run in the extended mode periodically. Again, when nvidia-healthmon reports a problem the scheduler can mark the node as down.
6.4. After Job Failure
When a GPU job fails, the extended mode run of nvidia-healthmon can help troubleshoot the problem.
6.5. Interfacing with Other Services
nvidia-healthmon can be run by a wrapper script which handles any reported warnings and errors. These problems can subsequently be forwarded to other services in order to notify them of any problems. NVIDIA suggests using syslog for logging reported issues on the system. Similarly, an SNMP trap can be sent to notify hosts over a network of these issues.
6.6. Troubleshooting Problems
nvidia-healthmon ’s troubleshooting report is designed to cover common problems, and will often suggest a number of possible solutions. These troubleshooting steps should be tackled from the top down, as the most likely solution is listed at the top.
6.6.1. Save Log Files
NVIDIA recommends that the log files from failing nvidia-healthmon runs should be saved. Saving log files ensures that data about intermittent problems is not lost.
Notices
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.