DCGM Diagnostics
Overview
The NVIDIA Validation Suite (NVVS) is now called DCGM Diagnostics. As of DCGM v1.5, running NVVS as a standalone utility is now deprecated and all the functionality (including command line options) is available via the DCGM command-line utility (‘dcgmi’). For brevity, the rest of the document may use DCGM Diagnostics and NVVS interchangeably.
DCGM Diagnostic Goals
DCGM Diagnostics are designed to:
Provide a system-level tool, in production environments, to assess cluster readiness levels before a workload is deployed.
Facilitate multiple run modes:
Interactive via an administrator or user in plain text.
Scripted via another tool with easily parseable output.
Provide multiple test timeframes to facilitate different preparedness or failure conditions:
Level 1 tests to use as a readiness metric
Level 2 tests to use as an epilogue on failure
Level 3 and Level 4 tests to be run by an administrator as post-mortem
Integrate the following concepts into a single tool to discover deployment, system software and hardware configuration issues, basic diagnostics, integration issues, and relative system performance.
Deployment and Software Issues
NVML library access and versioning
CUDA library access and versioning
Software conflicts
Hardware Issues and Diagnostics
Pending Page Retirements
PCIe interface checks
NVLink interface checks
Framebuffer and memory checks
Compute engine checks
Integration Issues
PCIe replay counter checks
Topological limitations
Permissions, driver, and cgroups checks
Basic power and thermal constraint checks
Stress Checks
Power and thermal stress
Throughput stress
Constant relative system performance
Maximum relative system performance
Memory Bandwidth
Provide troubleshooting help
Easily integrate into Cluster Scheduler and Cluster Management applications
Reduce downtime and failed GPU jobs
Beyond the Scope of the DCGM Diagnostics
DCGM Diagnostics are not designed to:
Provide comprehensive hardware diagnostics
Actively fix problems
Replace the field diagnosis tools. Please refer to http://docs.nvidia.com/deploy/hw-field-diag/index.html for that process.
Facilitate any RMA process. Please refer to http://docs.nvidia.com/deploy/rma-process/index.html for those procedures.
Run Levels and Tests
The following table describes which tests are run at each Level in DCGM Diagnostics.
Plugin |
Test name |
r1 (Short)
Seconds
|
r2 (Medium)
< 2 mins
|
r3 (Long)
< 30 mins
|
r4 (Extra Long)
1-2 hours
|
---|---|---|---|---|---|
Software |
|
Yes |
Yes |
Yes |
Yes |
PCIe + NVLink |
|
Yes |
Yes |
Yes |
|
GPU Memory |
|
Yes |
Yes |
Yes |
|
Memory Bandwidth |
|
Yes |
Yes |
Yes |
|
Diagnostics |
|
Yes |
Yes |
||
Targeted Stress |
|
Yes |
Yes |
||
Targeted Power |
|
Yes |
Yes |
||
Memory Stress |
|
Yes |
|||
Input EDPp |
|
Yes |
Getting Started with DCGM Diagnostics
Command Line options
The various command line options are designed to control general execution parameters, whereas detailed changes to execution behavior are contained within the configuration files detailed in the next section.
The following table lists the various options supported by DCGM Diagnostics:
Short option |
Long option |
Parameter |
Description |
---|---|---|---|
|
|
groupId |
The device group ID to query. |
|
IP/FQDN |
Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with |
|
|
|
Displays usage information and exits. |
|
|
|
diag |
Run a diagnostic. (Note: higher numbered tests include all beneath.):
Specific tests to run may be specified by name, and multiple tests may be specified as a comma separated list. For example, the command: dcgmi diag -r “pcie,diagnostic” would run the PCIe and Diagnostic tests together. |
|
|
test_name.variable_name=variable_name |
Test parameters to set for this run. |
|
|
full/path/to/config/file |
Path to the configuration file. |
|
|
A comma-separated list of the fake gpus on which
the diagnostic should run. For internal/testing
use only. Cannot be used with |
|
|
|
gpuList |
A comma-separated list of the gpus on which the
diagnostic should run. Cannot be used with |
|
|
Show information and warnings for each test. |
|
|
Only output the statistics files if there was a failure |
||
|
file |
||
|
plugin statistics path |
Write the plugin statistics to the given path rather than the current directory |
|
|
|
debug level |
Debug level (One of NONE, FATAL, ERROR, WARN, INFO, DEBUG, VERB). Default: DEBUG. The logfile can be specified by the –debugLogFile parameter. |
|
|
Print the output in a json format. |
|
|
Specify which clocks event reasons should be ignored. You can provide a comma separated list of reasons. For example, specifying ‘HW_SLOWDOWN ,SW_THERMAL’ would ignore the HW_SLOWDOWN and SW_THERMAL reasons. Alternatively, you can specify the integer value of the ignore bitmask. For the bitmask, multiple reasons may be specified by the sum of their bit masks. For example, specifying ‘40’ would ignore the HW_SLOWDOWN and SW_THERMAL reasons. Valid clocks event reasons and their corresponding bitmasks (given in parentheses) are:
|
||
|
Deprecated: please use |
||
|
Enable early failure checks for the Targeted Power , Targeted Stress, and Diagnostic tests. When enabled, these tests check for a failure once every 5 seconds (can be modified by the –check-interval parameter) while the test is running instead of a single check performed after the test is complete. Disabled by default. |
||
|
check interval |
Specify the interval (in seconds) at which the early failure checks should occur for the Targeted Power, Targeted Stress, SM Stress, and Diagnostic tests when early failure checks are enabled. Default is once every 5 seconds. Interval must be between 1 and 300 |
|
|
iterations |
Specify a number of iterations of the diagnostic to run consecutively. (Must be greater than 0.) |
|
|
Ignores the rest of the labeled arguments following this flag. |
Configuration File
The DCGM Diagnostics (dcgmi diag
) configuration file is a YAML -formatted text file
controlling the various tests and the execution parameters.
The general format of the configuration file is shown below:
version:
spec: dcgm-diag-v1
skus:
- name: GPU-name
id: GPU part number
test_name1:
test_parameter1: value
test_parameter2: value
test_name2:
test_parameter1: value
test_parameter2: value
A standard configuration file for H100 would look like below:
version: "@CMAKE_PROJECT_VERSION@"
spec: dcgm-diag-v1
skus:
- name: H100 80GB PCIe
id: 2331
targeted_power:
is_allowed: true
starting_matrix_dim: 1024.0
target_power: 350.0
use_dgemm: false
targeted_stress:
is_allowed: true
use_dgemm: false
target_stress: 15375
sm_stress:
is_allowed: true
target_stress: 15375.0
use_dgemm: false
pcie:
is_allowed: true
h2d_d2h_single_pinned:
min_pci_generation: 3.0
min_pci_width: 16.0
h2d_d2h_single_unpinned:
min_pci_generation: 3.0
min_pci_width: 16.0
memory:
is_allowed: true
l1cache_size_kb_per_sm: 192.0
diagnostic:
is_allowed: true
matrix_dim: 8192.0
memory_bandwidth:
is_allowed: true
minimum_bandwidth: 1230000
pulse_test:
is_allowed: true
Usage Examples
Custom Configuration File
The default configuration file can be overridden using the -c
option.
$ dcgmi diag -r 2 -c custom-diag-tests.yaml
where desired tests and parameters are included in the custom-diag-tests.yaml file.
Tests and Parameters
Specific tests and parameters can be directly specified when running diagnostics:
$ dcgmi diag -r targeted_power -p targeted_power.target_power=300.0
Iterations
DCGM also supports running tests suites in loops using the --iterations
option. Using this
option allows for increasing the runtime duration of the tests.
$ dcgmi diag -r pcie --iterations 3
Logging
By default, DCGM emits debugging information into logs that are stored under /var/log/nvidia-dcgm/nvvs.log
.
DCGM also provides a JSON output of the results of the tests, which allows for processing by various tools.
$ dcgmi diag -r pcie -j
...
{
"category": "Integration",
"tests": [
{
"name": "pcie",
"results": [
{
"entity_group": "GPU",
"entity_group_id": 1,
"entity_id": 0,
"info": [
"GPU to Host bandwidth:\t\t13.53 GB/s",
"Host to GPU bandwidth:\t\t12.05 GB/s",
"bidirectional bandwidth:\t23.69 GB/s",
"GPU to Host latency:\t\t0.791 us",
"Host to GPU latency:\t\t1.201 us",
"bidirectional latency:\t\t1.468 us"
],
"status": "Pass"
}
],
"test_summary": {
"status": "Pass"
}
}
]
}
...
Overview of Plugins
The NVIDIA Validation Suite consists of a series of plugins that are each designed to accomplish a different goal.
Deployment Plugin
The deployment plugin’s purpose is to verify the compute environment is ready to run CUDA applications and is able to load the NVML library.
Preconditions
LD_LIBRARY_PATH must include the path to the CUDA libraries, which for version X.Y of CUDA is normally
/usr/local/cuda-X.Y/lib64
, which can be set by runningexport LD_LIBRARY_PATH=/usr/local/cuda-X.Y/lib64
The Linux nouveau driver must not be running, and should be blacklisted since it will conflict with the NVIDIA driver
Configuration Parameters
None at this time.
Stat Outputs
None at this time.
Failure
The plugin will fail if:
The corresponding device nodes for the target GPU(s) are being blocked by the operating system (e.g. cgroups) or exist without r/w permissions for the current user.
The NVML library libnvidia-ml.so cannot be loaded
The CUDA runtime libraries cannot be Loaded
The nouveau driver is found to be loaded
Any pages are pending retirement on the target GPU(s)
Any pending row remaps or failed row remappings on the target GPU(s).
Any other graphics processes are running on the target GPU(s) while the plugin runs
Diagnostic Plugin
Overview
The Diagnostic plugin is part of the level 3 tests. It performs large matrix multiplies while copying data to various addresses in the frame buffer and checking that the data can be written and read correctly.
This test performs large matrix multiplications; by default it will alternate running these multiplications at all available among 64, 32, and 16-bit precisions. It will also walk the frame buffer, writing values to different addresses and making sure that the values are written and read correctly.
Test Description
This process will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors (XIDs, temperature violations, uncorrectable memory errors, etc.) as well as the correctness of data being written and read.
Supported Parameters
The following table lists the global parameters for the diagnostic plugin:
Parameter Name |
Type |
Default |
Description |
---|---|---|---|
max_sbe_errors |
Double |
Blank |
This is the threshold beyond which SBE’s are treated as errors. |
test_duration |
Double |
180.0 |
This is the time in seconds that the test should run. |
use_doubles |
String |
False |
This indicates doubles should be used instead of floats. |
temperature_max |
Double |
30.0 |
This is the maximum temperature in degrees allowed during the test. |
is_allowed |
Bool |
False |
This is whether the specified test is allowed to run. |
matrix_dim |
Double |
2048.0 |
This is the starting dimension of the matrix used for S/Dgemm. |
precision |
String |
Half Single Double |
This is the precision to use: half, single, or double |
gflops_tolerance_pcnt |
Double |
0.0 |
This is the percent of mean below which gflops are treated as errors. |
Sample Commands
Run a quick diagnostic:
Run the diagnostic for 5 minutes:
$ dcgmi diag -r 3 -p diagnostic.test_duration=300.0
Run the diagnostic, stopping if max temperature exceeds 28 degrees:
$ dcgmi diag -r 3 -p diagnostic.temperature_max=28.0
Run the diagnostic, with a smaller starting dimension for matrix operations:
$ dcgmi diag -r 3 -p diagnostic.matrix_dim=1024.0
Run the diagnostic, reporting an error if a GPU reports gflops not within 60% of the mean gflops across all GPUs:
$ dcgmi diag -r 3 -p diagnostic.gflops_tolerance_pcnt=0.60
Run the diagnostic, using double precision:
$ dcgmi diag -r 3 -p diagnostic.precision=double
Failure Conditions
The test will fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.
PCIe - GPU Bandwidth Plugin
Overview
The PCIe plugin’s purpose is to stress the communication from the host to the GPUs as well as among the GPUs on the system. It checks for p2p (peer-to-peer) correctness, any errors or replays while writing the data, and can be used to measure the bandwidth and latency to and from the GPUs and the host.
Preconditions
None
Sub tests
The plugin consists of several self-tests that each measure a different aspect of bandwidth or latency. Each subtest has either a pinned/unpinned pair or a p2p enabled/p2p disabled pair of identical tests. Pinned/unpinned tests use either pinned or unpinned memory when copying data between the host and the GPUs.
This plugin will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe
Each sub test is represented with a tag that is used both for specifying configuration parameters for the sub test and for outputting stats for the sub test. P2p enabled/p2p disabled tests enable or disable GPUs writing and reading to and from each other directly rather than through the PCIe bus.
Sub Test Tag |
Pinned/Unpinned P2P Enabled/P2P Disabled |
Description |
---|---|---|
h2d_d2h_single_pinned |
Pinned |
Device <-> Host Bandwidth, one GPU at a time |
h2d_d2h_single_unpinned |
Unpinned |
Device <-> Host Bandwidth, one GPU at a time |
h2d_d2h_latency_pinned |
Pinned |
Device <-> Host Latency, one GPU at a time |
h2d_d2h_latency_unpinned |
Unpinned |
Device <-> Host Latency, one GPU at a time |
p2p_bw_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth one GPU pair at a time |
p2p_bw_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth one GPU pair at a time |
p2p_bw_concurrent_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1 |
p2p_bw_concurrent_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1 |
1d_exch_bw_p2p_enabled |
P2P Enabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l) |
1d_exch_bw_p2p_disabled |
P2P Disabled |
Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l) |
p2p_latency_p2p_enabled |
P2P Enabled |
Device <-> Device Latency, one GPU pair at a time |
p2p_latency_p2p_disabled |
P2P Disabled |
Device <-> Device Latency, one GPU pair at a time |
The following table lists the global parameters for the PCIe plugin.
Parameter Name |
Type |
Default |
Description |
---|---|---|---|
test_pinned |
Bool |
True |
Include subtests that test using pinned memory. |
test_unpinned |
Bool |
True |
Include subtests that test using unpinned memory. |
test_p2p_on |
Bool |
True |
Run relevant subtests with peer to peer (P2P) memory transfers between GPUs enabled. |
test_p2p_off |
Bool |
True |
Run relevant subtests with peer to peer (P2P) memory transfers between GPUs disabled. |
max_pcie_replays |
Float |
80.0 |
Maximum number of PCIe replays to allow per GPU for the duration of this plugin. This is based on an expected replay rate less than per minute for PCIe Gen 3.0, assuming this plugin will run for less than a minute and allowing 10x as many replays before failure. |
The following table lists the parameters to specific subtests for the PCIe plugin.
Parameter Name |
Default |
Sub Tests |
Description |
---|---|---|---|
min_bandwidth |
0 |
h2d_d2h_single_pinned, h2d_d2h_single_unpinned, h2d_d2h_concurrent_pinned, h2d_d2h_concurrent_unpinned |
Minimum bandwidth in GB/s that must be reached for this sub-test to pass. |
max_latency |
100,000 |
h2d_d2h_latency_pinned, h2d_d2h_latency_unpinned |
Latency in microseconds that cannot be exceeded for this sub-test to pass. |
min_pci_generation |
1.0 |
h2d_d2h_single_pinned, h2d_d2h_single_unpinned |
Minimum allowed PCI generation that the GPU must be at or exceed for this sub-test to pass. |
min_pci_width |
1.0 |
h2d_d2h_single_pinned, h2d_d2h_single_unpinned |
Minimum allowed PCI width that the GPU must be at or exceed for this sub-test to pass. For example, 16x = 16.0. |
Memtest Diagnostic
Overview
Beginning with 2.4.0 DCGM diagnostics support an additional level 4
diagnostics (-r 4
). The first of these additional diagnostics is memtest.
Similar to
memtest86,
the DCGM memtest will exercise GPU memory with various test patterns.
These patterns each given a separate test and can be enabled and
disabled by administrators.
Test Descriptions
Note
Test runtimes refer to average seconds per single iteration on a single A100 40gb GPU.
Test0 [Walking 1 bit] - This test changes one bit at a time in memory to see if it goes to a different memory location. It is designed to test the address wires. Runtime: ~3 seconds.
Test1 [Address check] - Each Memory location is filled with its own address followed by a check to see if the value in each memory location still agrees with the address. Runtime: < 1 second.
Test 2 [Moving inversions, ones&zeros] - This test uses the moving inversions algorithm from memtest86 with patterns of all ones and zeros. Runtime: ~4 seconds.
Test 3 [Moving inversions, 8 bit pat] - Same as test 1 but uses a 8 bit wide pattern of “walking” ones and zeros. Runtime: ~4 seconds.
Test 4 [Moving inversions, random pattern] - Same algorithm as test 1 but the data pattern is a random number and it’s complement. A total of 60 patterns are used. The random number sequence is different with each pass so multiple passes can increase effectiveness. Runtime: ~2 seconds.
Test 5 [Block move, 64 moves] - This test moves blocks of memory. Memory is initialized with shifting patterns that are inverted every 8 bytes. Then these blocks of memory are moved around. After the moves are completed the data patterns are checked. Runtime: ~1 second.
Test 6 [Moving inversions, 32 bit pat] - This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address. To use all possible data patterns 32 passes are made during the test. Runtime: ~155 seconds.
Test 7 [Random number sequence] - A 1MB block of memory is initialized with random patterns. These patterns and their complements are used in moving inversion tests with rest of memory. Runtime: ~2 seconds.
Test 8 [Modulo 20, random pattern] - A random pattern is generated. This pattern is used to set every 20th memory location in memory. The rest of the memory location is set to the compliment of the pattern. Repeat this for 20 times and each time the memory location to set the pattern is shifted right. Runtime: ~10 seconds.
Test 9 [Bit fade test, 2 patterns] - The bit fade test initializes all memory with a pattern and then sleeps for 1 minute. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. Runtime: ~244 seconds.
Test10 [Memory stress] - A random pattern is generated and a large kernel is launched to set all memory to the pattern. A new read and write kernel is launched immediately after the previous write kernel to check if there is any errors in memory and set the memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is written as to achieve the maximum bandwidth between the global memory and GPU. Runtime: ~6 seconds.
Note
By default Test7 and Test10 alternate for a period of 10 minutes. If any errors are detected the diagnostic will fail.
Supported Parameters
Parameter |
Syntax |
Default |
---|---|---|
test0 |
boolean |
false |
test1 |
boolean |
false |
test2 |
boolean |
false |
test3 |
boolean |
false |
test4 |
boolean |
false |
test5 |
boolean |
false |
test6 |
boolean |
false |
test7 |
boolean |
true |
test8 |
boolean |
false |
test9 |
boolean |
false |
test10 |
boolean |
true |
test_duration |
seconds |
600 |
Sample Commands
Run test7 and test10 for 10 minutes (this is the default):
$ dcgmi diag -r 4
Run each test serially for 1 hour then display results:
$ dcgmi diag -r 4 \
-p memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest.test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest.test9=true\;memtest.test10=true\;memtest.test_duration=3600
Run test0 for one minute 10 times, displaying the results each minute:
$ dcgmi diag \
--iterations 10 \
-r 4 \
-p memtest.test0=true\;memtest.test7=false\;memtest.test10=false\;memtest.test_duration=60
Pulse Test Diagnostic
Overview
The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
Test Description
By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.
The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.
Parameter |
Description |
Default |
---|---|---|
test_duration |
Seconds to spend on an iteration. This is not the exact amount of time the test will take. |
60 |
patterns |
Specify a comman-separated list of pattern indices the pulse test should use. Valid indices depend on the type of SKU. Hopper: 0-22 Ampere / Volta/ Ada: 0-20 |
All |
Note
In some cases with DCGM 2.4 and DCGM 3.0, users may encounter the following issue with running the Pulse test:
| Pulse Test | Fail - All |
| Warning | GPU 0There was an internal error during the t |
| | est: 'The pulse test exited with non-zero sta |
| | tus 1', GPU 0There was an internal error duri |
| | ng the test: 'The pulse test reported the err |
| | or: Exception raised during execution: Faile |
| | d opening file ubergemm.log for writing: Perm |
| | ission denied terminate called after throwing |
| | an instance of 'boost::wrapexcept<boost::pro |
| | perty_tree::xml_parser::xml_parser_error>' |
| | what(): result.xml: cannot open file ' |
When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) service account to run the diagnostics. If the service account does not have write access to the directory where diagnostics are run, then users may encounter this issue. To summarize, the issue happens when both these conditions are true:
The
nvidia-dcgm
service is active and thenv-hostengine
process is running (and no changes have been made to DCGM’s default install configurations)The users attempts to run
dcgmi diag -r 4
. In this case,dcgmi diag
connects to the runningnv-hostengine
(which was started by default under/root
) and thus the Pulse test is unable to create any logs.
This issue will be fixed in a future release of DCGM. In the meantime, users can do either of the following to work-around the issue:
Stop the nvidia-dcgm service before running the pulse_test
$ sudo systemctl stop nvidia-dcgm
Now run the
pulse_test
:$ dcgmi diag -r pulse_test
Restart the
nvidia-dcgm
service once the diagnostics are completed:$ sudo systemctl restart nvidia-dcgm
Edit the
systemd
unit service file to include aWorkingDirectory
option, so that the service is started in a location writeable by thenvidia-dcgm
user (be sure that the directory shown in the example below/tmp/dcgm-temp
is created):[Service] ... WorkingDirectory=/tmp/dcgm-temp ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm ...
Reload the systemd configuration and start the
nvidia-dcgm
service:$ sudo systemctl daemon-reload
$ sudo systemctl start nvidia-dcgm
Sample Commands
Run the entire diagnostic suite, including the pulse test:
$ dcgmi diag -r 4
Run just the pulse test:
$ dcgmi diag -r pulse_test
Run just the pulse test, but at a lower frequency:
$ dcgmi diag -r pulse_test -p pulse_test.freq0=3000
Run just the pulse test at a lower frequency and for a shorter time:
$ dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"
Failure Conditions
The pulse test will fail if the power supply unit cannot handle the spikes in the current.
It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.
Extended Utility Diagnostics (EUD)
Starting with DCGM 3.1, the Extended Utility Diagnostics, or EUD, is available as a new plugin. Once installed, it’s available as a separate suite of tests and is also included in levels 3 and 4 of DCGM’s diagnostics. EUD provides various tests that perform the following checks on the GPU subsystems:
Confirmation of the numerical processing engines in the GPU
Integrity of data transfers to and from the GPU
Coverage of the full onboard memory address space that is available to CUDA programs
Supported Products
EUD supports the following GPU products:
NVIDIA V100 PCIe
NVIDIA V100 SXM2 (PG503-0201 and PG503-0203)
NVIDIA V100 SXM3
NVIDIA A100-SXM-40GB
NVIDIA A100-SXM-80GB
NVIDIA A100-PCIe-80GB
NVIDIA A100 OAM (PG509-0200 and PG509-0210)
NVIDIA A800 SXM
NVIDIA A800-PCIe-80GB
NVIDIA H100-PCIe-80GB
NVIDIA H100-SXM-80GB (HGX H100)
NVIDIA H100-SXM-96GB
NVIDIA HGX H100 4-GPU 64GB
NVIDIA HGX H100 4-GPU 80GB
NVIDIA HGX H100 4-GPU 94GB
NVIDIA HGX H800 4-GPU 80GB
NVIDIA L40
NVIDIA L40S
NVIDIA L4
The EUD is only supported on the R525 and later driver branches. Support for other products and driver branches will be added in a future release.
Included Tests
The EUD supports six different test suites targeting different types of GPU functionality:
Compute : The compute test suite focuses primarily on tests which run Matrix Multiply instructions on the GPU using different numerical representations (integers, doubles, etc.). The tests are generally run in two different ways
A static constant workload to generate consistent and stable power draw
A pulsing workload
In addition to the Matrix Multiply tests there are also several miscellaneous tests which focus on exercising other functionality related to compute (e.g. instruction test, compute video test, etc.)
Graphics : The graphics test suite focuses on testing the 2D and 3D rendering engines of the GPU
Memory : The memory test suite validates the GPU memory interface. The tests in the memory suite validate that the GPU memory can function without any errors both in normal operation and under various types of stress.
High Speed Input/Output (HSIO) : The HSIO test is focused on validating NVLink and PCIE functionality focusing primarily on data transfer testing
Miscellaneous : The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are board specific tests which validate non-GPU related items on the board like voltage regulator programming or board configuration.
Default : The default test suite runs when no other test suites are explicitly specified and will run one or more tests from each of the other test suites
Getting Started with EUD
Note
The following pre-requisites apply when using the EUD:
The EUD version must match the NVIDIA driver version installed on the system. When the NVIDIA driver is updated, the EUD must also be updated to the corresponding version.
Multi-Instance GPU (MIG) mode should be disabled prior to running the EUD. To disable MIG mode, refer to the
nvidia-smi
man page:$ nvidia-smi mig --help
Any GPU telemetry (either via NVML/DCGM APIs or with
nvidia-smi dmon
/dcgmi dmon
should not be used when running the EUD. EUD interacts heavily with the driver and contention will impact testing and may cause timeouts.
Supported Deployment Models
The EUD is only supported in the following deployment models:
Deployment Model |
Description |
Supported |
---|---|---|
Bare-metal |
Running directly on the system with no abstraction layer (i.e. VM, containerization, etc) |
Yes |
Passthrough virtualization (aka “full passthrough”) |
When both GPU and NvSwitch are “passed through” to a VM. VM has exclusive access to the devices |
|
Shared nvswitch |
GPUs are passed through to VM but NvSwitch is owned by a service VM |
|
Installing the EUD packages
Install the NVIDIA EUD package using the appropriate package manager of the Linux distribution flavor.
In this release, the EUD binaries are available via a Linux archive. Follow these steps to get started:
Extract the archive under
/usr
Change ownership and group to
root
$ sudo chown -R root /usr/share/nvidia \ && sudo chgrp -R root /usr/share/nvidiaNow proceed to run the EUD
Install the local repo package
$ sudo dpkg -i nvidia-diagnostic-local-repo-ubuntu2204-525.125.06-mode1_1.0-1_amd64.debCopy the keyring file to the correct location, the exact copy command will be in the output of the dpkg command.
$ sudo cp /var/nvidia-diagnostic-local-repo-ubuntu2204-525.125.06-mode1/nvidia-diagnostic-local-D95A57C6-keyring.gpg /usr/share/keyringsInstall the diagnostic. The version of the diagnostic will match the major version specified in the local repo file.
$ sudo apt update $ sudo apt install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm $ sudo yum install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo rpm file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm $ sudo dnf install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo rpm file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm $ sudo zypper install nvidia-diagnostic-525
The files for the EUD should be installed under /usr/share/nvidia/diagnostic/
Running the EUD
On supported GPU products, by default, DCGM will run the EUD as part of the Level 3 and 4 with two separate EUD test profiles:
Within run level 3 (
dcgmi diag -r 3
), the run time of the EUD test is less than 5 mins (runs at least one test from each of the subtest suites)Within run level 4 (
dcgmi diag -r 4
), the run time of the EUD test is ~20 mins (all the test suites are run)
Note
The times provided above are the estimated runtimes of just the EUD test. The total
runtime of -r 3
or -r 4
would be longer as they include other tests.
By default, EUD will report error for the first failing test and stop. See run_on_error for details.
The EUD may also be run separately from the DCGM run levels via dcgmi diag -r eud
which runs the same set of tests as level 3
Customization options
The EUD supports optional command-line arguments that can be specified during the run.
For example to run the memory and compute tests:
$ dcgmi diag -r eud -p "eud.passthrough_args='run_tests=compute,memory'"
The -r eud
option supports the following arguments:
Option |
Description |
---|---|
|
Enables the full EUD test profile (of ~20mins) and also enables customization of tests with the |
|
Allows additional controls on the EUD diagnostic tests. See the table later in this document. |
The table below provides the additional control arguments supported for eud.passthrough_args
:
Option |
Description |
---|---|
|
Test the nth device |
|
Specify a unique log file name other than the default.
Example: %r will evaluate to the test status and will be one of PASS or FAIL
%s will evaluate to the SERIAL NUMBER of one of the devices under test.
Example: By default, the logs are created under |
|
Specify a single GPU to be tested, where w, x, y and z are hexadecimal numbers.
Example: |
|
Specify a subset of GPUs to be tested, where w, x, y and z are hexadecimal numbers.
Each GPU address is comma-separated.
Example: |
|
Specify a subset of tests to be run. Each test is comma separated and one of the following
When not specified at least one test from each of the individual subtests is run Example: |
|
Overrides the default NVSwitch topology file to use. When this argument is not present, a
system topology file (installed by the driver into Example: |
|
When this argument is present, EUD will keep running even if an error occurs for either of the requested tests and report errors for all the failing tests only after completeing all the requested tests. If the option is not specified, EUD will report error for the first failing test and stop. |
|
When this option is present all testing of the NvLink interface will be skipped |
Logging
By default, DCGM logs the runs of EUD under /tmp/dcgm
where two files are generated:
dcgm_eud.log - This plain text file contains a stdout log of the EUD test run
dcgm_eud.mle - This binary file contains the results of the EUD tests
The MLE file can be decoded to a JSON format output by running the mla
binary under /usr/share/nvidia/diagnostic
.
See the documentation titled “MLA JSON Report Decoding” in that directory for more information on the options and generated reports.
CPU Extended Utility Diagnostics (CPU EUD)
Starting with DCGM 3.3.7, the CPU Extended Utility Diagnostics, or CPU EUD, is available as a new test. Once installed, it’s available as a separate suite of tests. The DCGMI Diag CPU EUD allows administrators to test for and report potential problems in the system.
Supported Products
CPU EUD supports the following Nvidia products:
Nvidia Grace CPU
Included Tests
The CPU EUD supports three different options targeting various aspects of CPU functionality:
- CPU
The CPU test suite focuses on several critical areas to ensure the reliability and performance of the CPU. This includes tests designed to verify data correctness, monitor error counts, and validate CPU performance under different conditions.
- Memory
The memory test suite for CPU EUD validates the CPU memory interface. The tests validate on both local and remote NUMA memory nodes, utilizing the full size of memory to ensure memory can function without errors and with high performance output.
- C2C / Clink
Leverage remote memory test to saturate the C2C / Clink bus.
- PCIE
The PCIe test suite validates the PCIe interface by checking link capabilities and ensuring stable performance, including the ability to retrain links while maintaining optimal operation between host and device.
- Miscellaneous
The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are system-specific tests which validate the configuration and functionality of both CPU (e.g., CPU socket number, CPU number, CPU max/min MHz) and memory hardware to ensure the components are correctly identified and operational.
Note
By default, the CPU EUD will run one or more tests from each of the other test suites if not specified otherwise.
Getting Started with CPU EUD
Installing the CPU EUD packages
Install the Nvidia CPU EUD package using the appropriate package manager of the Linux distribution flavor.
Check the installed packages and remove all those shown in the output
$ dpkg -l | grep cpueud // Example: // ii cpueud-535 535.169-1 arm64 NVIDIA End-User cpueud // ii cpueud-local-tegra-repo-ubuntu2204-535.169-mode1 1.0-1 arm64 cpueud-local-tegra repository configuration files $ sudo dpkg --purge <found packages> // Example: // $ sudo dpkg --purge cpueud-local-tegra-repo-ubuntu2204-535.169-mode1 // $ sudo dpkg --purge cpueud-535Install the local repo package
$ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-$VERSION-mode1_1.0-1_arm64.deb // $VERSION: The version number of the package you are installing. Replace $VERSION with the actual version number of the package you're using. // Example: // $ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-535.169-mode1_1.0-1_arm64.debCopy the keyring file to the correct location, the exact copy command will be in the output of the dpkg command.
$ sudo cp /var/cpueud-local-tegra-repo-ubuntu2204-535.169/cpueud-local-tegra-FFCE45E1-keyring.gpg /usr/share/keyrings/Update the apt-get and use it install cpueud
$ sudo apt-get update $ sudo apt-get install cpueud
Check the installed packages and remove all those shown in the output.
$ sudo dnf list installed | grep cpueud // Example: // cpueud-535.aarch64 535.169-1 @cpueud-local-tegra-rhel9-535.169-mode1 // cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64 1.0-1 @@System $ sudo rpm -e <found packages> $ sudo dnf remove <found packages> // Example: // $ sudo rpm -e cpueud-535.aarch64 // $ sudo dnf remove cpueud-535.aarch64 // $ sudo rpm -e cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64 // $ sudo dnf remove cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64Install the local repo file, then install the diagnostic. Ensure the diagnostic version matches the major version specified in the local repo RPM file.
$ sudo yum install libxcrypt-compat $ sudo rpm -i cpueud-local-tegra-repo-rhel9-$VERSION-mode1-1.0-1.aarch64.rpm // $VERSION: The version number of the package you are installing. Replace $VERSION with the actual version number of the package you're using. // Example: // $ sudo rpm -i cpueud-local-tegra-repo-rhel9-535.169-mode1-1.0-1.aarch64.rpm $ sudo dnf install cpueud
The files for the EUD should be installed under /usr/share/nvidia/cpu/diagnostic/
Running the CPU EUD
Syntax
# dcgmi diag -r cpu_eud [options]
Logging
By default, DCGM logs the runs of EUD under /var/log/nvidia-dcgm/
where three files are generated:
dcgm_cpu_eud_stdout.txt - The plain text file contains a stdout log of the CPU EUD test run
dcgm_cpu_eud_stderr.txt - The plain text file contains a stderr log of the CPU EUD test run
dcgm_cpu_eud.log - This file is an encrypted log of the CPU EUD test run
You can also specify cpu_eud.tmp_dir to set the directory where you want to store the log files.
Command Usage
Default
To obtain the results in tabular format, use the following command:
# dcgmi diag -r cpu_eud
Example Output
Pass case
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.0.0 |
| Number of CPUs Detected | 1 |
| CPU EUD Test Version | eud.535.161 |
+----- Hardware ----------+------------------------------------------------+
| cpu_eud | Pass |
| | CPU0: Pass |
+---------------------------+------------------------------------------------+
Failure case
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.0.0 |
| Number of CPUs Detected | 1 |
| CPU EUD Test Version | eud.535.161 |
+----- Hardware ----------+------------------------------------------------+
| cpu_eud | Fail |
| | CPU0: Fail |
| Warning: CPU0 | Error : bad command line argument |
+---------------------------+------------------------------------------------+
JSON Output
To obtain the results in JSON format, use the following command:
# dcgmi diag -r cpu_eud -j
JSON schema for the element in tests
{
"$schema": "http://json-schema.org/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
},
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"entity_group": {
"type": "string"
},
"entity_group_id": {
"type": "integer"
},
"entity_id": {
"type": "integer"
},
"status": {
"type": "string"
},
"info": {
"type": "array",
"items": {
"type": "string"
}
},
"warnings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"error_category": {
"type": "integer"
},
"error_id": {
"type": "integer"
},
"error_severity": {
"type": "integer"
},
"warning": {
"type": "string"
}
},
"required": [
"error_category",
"error_id",
"error_severity",
"warning"
]
}
}
},
"required": [
"entity_group",
"entity_group_id",
"entity_id",
"status"
]
}
},
"test_summary": {
"type": "object",
"properties": {
"status": {
"type": "string"
},
"info": {
"type": "array",
"items": {
"type": "string"
}
},
"warnings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"error_category": {
"type": "integer"
},
"error_id": {
"type": "integer"
},
"error_severity": {
"type": "integer"
},
"warning": {
"type": "string"
}
},
"required": [
"error_category",
"error_id",
"error_severity",
"warning"
]
}
}
},
"required": [
"status"
]
}
},
"required": [
"name",
"results",
"test_summary"
]
}
Example Output
Pass case
{
"category": "Hardware",
"tests": [
{
"name": "cpu_eud",
"results": [
{
"entity_group": "CPU",
"entity_group_id": 7,
"entity_id": 0,
"status": "Pass"
},
{
"entity_group": "CPU",
"entity_group_id": 7,
"entity_id": 1,
"status": "Skip"
}
],
"test_summary": {
"status": "Pass"
}
}
]
}
Failure case
{
"category": "Hardware",
"tests": [
{
"name": "cpu_eud",
"results": [
{
"entity_group": "CPU",
"entity_group_id": 7,
"entity_id": 0,
"status": "Fail",
"warnings": [
{
"error_category": 7,
"error_id": 95,
"error_severity": 2,
"warning": "Error : bad command line argument"
}
]
},
{
"entity_group": "CPU",
"entity_group_id": 7,
"entity_id": 1,
"status": "Skip"
}
],
"test_summary": {
"status": "Fail"
}
}
]
}
Automating Responses to DCGM Diagnostic Failures
Overview
Automating workflows based on DCGM diagnostics can enable sites to handle GPU errors more efficiently. Additional data for determining the severity of errors and potential next steps is available using either the API or by parsing the JSON returned on the CLI. Besides simply reporting human readable strings of which errors occurred during the diagnostic, each error also includes a specific ID, Severity, and Category that can be useful when deciding how to handle the failure.
The latest versions of these enums can be found in dcgm_errors.h.
Error Category Enum |
VALUE |
---|---|
DCGM_FR_EC_NONE |
0 |
DCGM_FR_EC_PERF_THRESHOLD |
1 |
DCGM_FR_EC_PERF_VIOLATION |
2 |
DCGM_FR_EC_SOFTWARE_CONFIG |
3 |
DCGM_FR_EC_SOFTWARE_LIBRARY |
4 |
DCGM_FR_EC_SOFTWARE_XID |
5 |
DCGM_FR_EC_SOFTWARE_CUDA |
6 |
DCGM_FR_EC_SOFTWARE_EUD |
7 |
DCGM_FR_EC_SOFTWARE_OTHER |
8 |
DCGM_FR_EC_HARDWARE_THERMAL |
9 |
DCGM_FR_EC_HARDWARE_MEMORY |
10 |
DCGM_FR_EC_HARDWARE_NVLINK |
11 |
DCGM_FR_EC_HARDWARE_NVSWITCH |
12 |
DCGM_FR_EC_HARDWARE_PCIE |
13 |
DCGM_FR_EC_HARDWARE_POWER |
14 |
DCGM_FR_EC_HARDWARE_OTHER |
15 |
DCGM_FR_EC_INTERNAL_OTHER |
16 |
Error Severity Enum |
VALUE |
---|---|
DCGM_ERROR_NONE |
0 |
DCGM_ERROR_MONITOR |
1 |
DCGM_ERROR_ISOLATE |
2 |
DCGM_ERROR_UNKNOWN |
3 |
DCGM_ERROR_TRIAGE |
4 |
DCGM_ERROR_CONFIG |
5 |
DCGM_ERROR_RESET |
6 |
Error Enum |
Value |
Severity |
Category |
---|---|---|---|
DCGM_FR_OK |
0 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_UNKNOWN |
1 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_UNRECOGNIZED |
2 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_PCI_REPLAY_RATE |
3 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_VOLATILE_DBE_DETECTED |
4 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_VOLATILE_SBE_DETECTED |
5 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PENDING_PAGE_RETIREMENTS |
6 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_RETIRED_PAGES_LIMIT |
7 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_RETIRED_PAGES_DBE_LIMIT |
8 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CORRUPT_INFOROM |
9 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_CLOCKS_EVENT_THERMAL |
10 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_POWER_UNREADABLE |
11 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_CLOCKS_EVENT_POWER |
12 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_NVLINK_ERROR_THRESHOLD |
13 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVLINK_DOWN |
14 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVSWITCH_FATAL_ERROR |
15 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NVSWITCH_NON_FATAL_ERROR |
16 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NVSWITCH_DOWN |
17 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NO_ACCESS_TO_FILE |
18 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_NVML_API |
19 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_DEVICE_COUNT_MISMATCH |
20 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_BAD_PARAMETER |
21 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CANNOT_OPEN_LIB |
22 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_DENYLISTED_DRIVER |
23 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_NVML_LIB_BAD |
24 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_GRAPHICS_PROCESSES |
25 |
DCGM_ERROR_RESET |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_HOSTENGINE_CONN |
26 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_QUERY |
27 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_BAD_CUDA_ENV |
28 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_PERSISTENCE_MODE |
29 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_LOW_BANDWIDTH |
30 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_HIGH_LATENCY |
31 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_CANNOT_GET_FIELD_TAG |
32 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_VIOLATION |
33 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_FIELD_THRESHOLD |
34 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_FIELD_VIOLATION_DBL |
35 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_FIELD_THRESHOLD_DBL |
36 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_UNSUPPORTED_FIELD_TYPE |
37 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_THRESHOLD_TS |
38 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_FIELD_THRESHOLD_TS_DBL |
39 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_THERMAL_VIOLATIONS |
40 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_THERMAL_VIOLATIONS_TS |
41 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_TEMP_VIOLATION |
42 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_CLOCKS_EVENT_VIOLATION |
43 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_INTERNAL |
44 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_PCIE_GENERATION |
45 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_PCIE_WIDTH |
46 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_ABORTED |
47 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_TEST_DISABLED |
48 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_CANNOT_GET_STAT |
49 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_STRESS_LEVEL |
50 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_CUDA_API |
51 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_FAULTY_MEMORY |
52 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CANNOT_SET_WATCHES |
53 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CUDA_UNBOUND |
54 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_ECC_DISABLED |
55 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_ALLOC |
56 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_CUDA_DBE |
57 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_MISMATCH |
58 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CUDA_DEVICE |
59 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_ECC_UNSUPPORTED |
60 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_ECC_PENDING |
61 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_BANDWIDTH |
62 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_TARGET_POWER |
63 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_API_FAIL |
64 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_API_FAIL_GPU |
65 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_CUDA_CONTEXT |
66 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_DCGM_API |
67 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CONCURRENT_GPUS |
68 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_TOO_MANY_ERRORS |
69 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_NVLINK_CRC_ERROR_THRESHOLD |
70 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVLINK_ERROR_CRITICAL |
71 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_ENFORCED_POWER_LIMIT |
72 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_MEMORY_ALLOC_HOST |
73 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_GPU_OP_MODE |
74 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_NO_MEMORY_CLOCKS |
75 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_NO_GRAPHICS_CLOCKS |
76 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_HAD_TO_RESTORE_STATE |
77 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_L1TAG_UNSUPPORTED |
78 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_L1TAG_MISCOMPARE |
79 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_ROW_REMAP_FAILURE |
80 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_UNCONTAINED_ERROR |
81 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_XID |
DCGM_FR_EMPTY_GPU_LIST |
82 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_DBE_PENDING_PAGE_RETIREMENTS |
83 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_UNCORRECTABLE_ROW_REMAP |
84 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PENDING_ROW_REMAP |
85 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_BROKEN_P2P_MEMORY_DEVICE |
86 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_BROKEN_P2P_WRITER_DEVICE |
87 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_NVSWITCH_NVLINK_DOWN |
88 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_EUD_BINARY_PERMISSIONS |
89 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_NON_ROOT_USER |
90 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_SPAWN_FAILURE |
91 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_TIMEOUT |
92 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_ZOMBIE |
93 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_NON_ZERO_EXIT_CODE |
94 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_TEST_FAILED |
95 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_FILE_CREATE_PERMISSIONS |
96 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_PAUSE_RESUME_FAILED |
97 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_PCIE_H_REPLAY_VIOLATION |
98 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_GPU_EXPECTED_NVLINKS_UP |
99 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVSWITCH_EXPECTED_NVLINKS_UP |
100 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_XID_ERROR |
101 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_XID |
DCGM_FR_SBE_VIOLATION |
102 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_DBE_VIOLATION |
103 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PCIE_REPLAY_VIOLATION |
104 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_SBE_THRESHOLD_VIOLATION |
105 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_DBE_THRESHOLD_VIOLATION |
106 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PCIE_REPLAY_THRESHOLD_VIOLATION |
107 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_CUDA_FM_NOT_INITIALIZED |
108 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_SXID_ERROR |
109 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_XID |
These relationships are codified in dcgm_errors.c.
In general, DCGM has high confidence that errors with the ISOLATE and RESET severities should be handled immediately. Other severities may require more site-specific analysis, a re-run of the diagnostic, or a scanning of DCGM and system logs to determine the best course of action. Gathering and recording the failure types and rates over time can give datacenters insight into the best way to automate handling of GPU diagnostic errors.