DCGM Diagnostics

Overview

The NVIDIA Validation Suite (NVVS) is now called DCGM Diagnostics. As of DCGM v1.5, running NVVS as a standalone utility is now deprecated and all the functionality (including command line options) is available via the DCGM command-line utility (‘dcgmi’). For brevity, the rest of the document may use DCGM Diagnostics and NVVS interchangeably.

DCGM Diagnostic Goals

DCGM Diagnostics are designed to:

  1. Provide a system-level tool, in production environments, to assess cluster readiness levels before a workload is deployed.

  2. Facilitate multiple run modes:

    • Interactive via an administrator or user in plain text.

    • Scripted via another tool with easily parseable output.

  3. Provide multiple test timeframes to facilitate different preparedness or failure conditions:

    • Level 1 tests to use as a readiness metric

    • Level 2 tests to use as an epilogue on failure

    • Level 3 and Level 4 tests to be run by an administrator as post-mortem

  4. Integrate the following concepts into a single tool to discover deployment, system software and hardware configuration issues, basic diagnostics, integration issues, and relative system performance.

    • Deployment and Software Issues

      • NVML library access and versioning

      • CUDA library access and versioning

      • Software conflicts

    • Hardware Issues and Diagnostics

      • Pending Page Retirements

      • PCIe interface checks

      • NVLink interface checks

      • Framebuffer and memory checks

      • Compute engine checks

    • Integration Issues

      • PCIe replay counter checks

      • Topological limitations

      • Permissions, driver, and cgroups checks

      • Basic power and thermal constraint checks

    • Stress Checks

      • Power and thermal stress

      • Throughput stress

      • Constant relative system performance

      • Maximum relative system performance

      • Memory Bandwidth

  5. Provide troubleshooting help

  6. Easily integrate into Cluster Scheduler and Cluster Management applications

  7. Reduce downtime and failed GPU jobs

Beyond the Scope of the DCGM Diagnostics

DCGM Diagnostics are not designed to:

  1. Provide comprehensive hardware diagnostics

  2. Actively fix problems

  3. Replace the field diagnosis tools. Please refer to http://docs.nvidia.com/deploy/hw-field-diag/index.html for that process.

  4. Facilitate any RMA process. Please refer to http://docs.nvidia.com/deploy/rma-process/index.html for those procedures.

Run Levels and Tests

The following table describes which tests are run at each Level in DCGM Diagnostics.

Plugin

Test name

r1 (Short)
Seconds
r2 (Medium)
< 2 mins
r3 (Long)
< 30 mins
r4 (Extra Long)
1-2 hours

Software

software

Yes

Yes

Yes

Yes

PCIe + NVLink

pcie

Yes

Yes

Yes

GPU Memory

memory

Yes

Yes

Yes

Memory Bandwidth

memory_bandwidth

Yes

Yes

Yes

Diagnostics

diagnostic

Yes

Yes

Targeted Stress

targeted_stress

Yes

Yes

Targeted Power

targeted_power

Yes

Yes

Memory Stress

memtest

Yes

Input EDPp

pulse

Yes

Getting Started with DCGM Diagnostics

Command Line options

The various command line options are designed to control general execution parameters, whereas detailed changes to execution behavior are contained within the configuration files detailed in the next section.

The following table lists the various options supported by DCGM Diagnostics:

Short option

Long option

Parameter

Description

-g

--group

groupId

The device group ID to query.

--host

IP/FQDN

Connects to specified IP or fully-qualified domain name. To connect to a host engine that was started with -d (unix socket), prefix the unix socket filename with unix://. [default = localhost]

-h

--help

Displays usage information and exits.

-r

--run

diag

Run a diagnostic. (Note: higher numbered tests include all beneath.):

  • 1 - Quick (System Validation

  • 2 - Medium (Extended System Validation)

  • 3 - Long (System HW Diagnostics)

  • 4 - Extended (Longer-running System HW Diagnostics)

Specific tests to run may be specified by name, and multiple tests may be specified as a comma separated list. For example, the command:

dcgmi diag -r “pcie,diagnostic”

would run the PCIe and Diagnostic tests together.

-p

--parameters

test_name.variable_name=variable_name

Test parameters to set for this run.

-c

--configfile

full/path/to/config/file

Path to the configuration file.

-f

--fakeGpuListfakeGpuList

A comma-separated list of the fake gpus on which the diagnostic should run. For internal/testing use only. Cannot be used with -g/-i.

-i

--gpuList

gpuList

A comma-separated list of the gpus on which the diagnostic should run. Cannot be used with -g.

-v

-verbose

Show information and warnings for each test.

statsonfail

Only output the statistics files if there was a failure

--debugLogFiledebug

file

--statspath

plugin statistics path

Write the plugin statistics to the given path rather than the current directory

-d

--debugLevel

debug level

Debug level (One of NONE, FATAL, ERROR, WARN, INFO, DEBUG, VERB). Default: DEBUG. The logfile can be specified by the –debugLogFile parameter.

-j

--json

Print the output in a json format.

--throttle-mask

Specify which throttling reasons should be ignored. You can provide a comma separated list of reasons. For example, specifying ‘HW_SLOWDOWN ,SW_THERMAL’ would ignore the HW_SLOWDOWN and SW_THERMAL throttling reasons. Alternatively, you can specify the integer value of the ignore bitmask. For the bitmask, multiple reasons may be specified by the sum of their bit masks. For example, specifying ‘40’ would ignore the HW_SLOWDOWN and SW_THERMAL throttling reasons.

Valid throttling reasons and their corresponding bitmasks (given in parentheses) are:

  • HW_SLOWDOWN (8)

  • SW_THERMAL (32)

  • HW_THERMAL (64)

  • HW_POWER_BRAKE (128)

--fail-early

Enable early failure checks for the Targeted Power , Targeted Stress, and Diagnostic tests. When enabled, these tests check for a failure once every 5 seconds (can be modified by the –check-interval parameter) while the test is running instead of a single check performed after the test is complete. Disabled by default.

--check-intervalfailure

check interval

Specify the interval (in seconds) at which the early failure checks should occur for the Targeted Power, Targeted Stress, SM Stress, and Diagnostic tests when early failure checks are enabled. Default is once every 5 seconds. Interval must be between 1 and 300

--iterations

iterations

Specify a number of iterations of the diagnostic to run consecutively. (Must be greater than 0.)

--ignore_rest

Ignores the rest of the labeled arguments following this flag.

Configuration File

The DCGM Diagnostics (dcgmi diag) configuration file is a YAML -formatted text file controlling the various tests and the execution parameters.

The general format of the configuration file is shown below:

version:
spec: dcgm-diag-v1
skus:
  - name: GPU-name
    id: GPU part number
    test_name1:
      test_parameter1: value
      test_parameter2: value
    test_name2:
      test_parameter1: value
      test_parameter2: value

A standard configuration file for H100 would look like below:

version: "@CMAKE_PROJECT_VERSION@"
spec: dcgm-diag-v1
skus:
  - name: H100 80GB PCIe
    id: 2331
    targeted_power:
      is_allowed: true
      starting_matrix_dim: 1024.0
      target_power: 350.0
      use_dgemm: false
    targeted_stress:
      is_allowed: true
      use_dgemm: false
      target_stress: 15375
    sm_stress:
      is_allowed: true
      target_stress: 15375.0
      use_dgemm: false
    pcie:
      is_allowed: true
      h2d_d2h_single_pinned:
        min_pci_generation: 3.0
        min_pci_width: 16.0
      h2d_d2h_single_unpinned:
        min_pci_generation: 3.0
        min_pci_width: 16.0
    memory:
      is_allowed: true
      l1cache_size_kb_per_sm: 192.0
    diagnostic:
      is_allowed: true
      matrix_dim: 8192.0
    memory_bandwidth:
      is_allowed: true
      minimum_bandwidth: 1230000
    pulse_test:
      is_allowed: true

Usage Examples

Custom Configuration File

The default configuration file can be overridden using the -c option.

$ dcgmi diag -r 2 -c custom-diag-tests.yaml

where desired tests and parameters are included in the custom-diag-tests.yaml file.

Tests and Parameters

Specific tests and parameters can be directly specified when running diagnostics:

$ dcgmi diag -r targeted_power -p targeted_power.target_power=300.0

Iterations

DCGM also supports running tests suites in loops using the --iterations option. Using this option allows for increasing the runtime duration of the tests.

$ dcgmi diag -r pcie --iterations 3

Logging

By default, DCGM emits debugging information into logs that are stored under /var/log/nvidia-dcgm/nvvs.log.

DCGM also provides a JSON output of the results of the tests, which allows for processing by various tools.

$ dcgmi diag -r pcie -j
...
{
     "category" : "Integration",
     "tests" :
     [
             {
                     "name" : "PCIe",
                     "results" :
                     [
                             {
                                     "gpu_ids" : "0",
                                     "info" : "GPU 0 GPU 0 GPU to Host bandwidth:\t\t20.39 GB/s, GPU 0 GPU 0 Host to GPU bandwidth:\t\t27.99 GB/s, GPU 0 GPU 0 bidirectional bandwidth:\t24.79 GB/s, GPU 0 GPU 0 GPU to Host latency:\t\t1.482 us, GPU 0 GPU 0 Host to GPU latency:\t\t1.546 us, GPU 0 GPU 0 bidirectional latency:\t\t2.963 us",
                                     "status" : "Pass"
                             }
                     ]
             }
     ]
}

Overview of Plugins

The NVIDIA Validation Suite consists of a series of plugins that are each designed to accomplish a different goal.

Deployment Plugin

The deployment plugin’s purpose is to verify the compute environment is ready to run CUDA applications and is able to load the NVML library.

Preconditions

  • LD_LIBRARY_PATH must include the path to the CUDA libraries, which for version X.Y of CUDA is normally /usr/local/cuda-X.Y/lib64, which can be set by running export LD_LIBRARY_PATH=/usr/local/cuda-X.Y/lib64

  • The Linux nouveau driver must not be running, and should be blacklisted since it will conflict with the NVIDIA driver

Configuration Parameters

None at this time.

Stat Outputs

None at this time.

Failure

The plugin will fail if:

  • The corresponding device nodes for the target GPU(s) are being blocked by the operating system (e.g. cgroups) or exist without r/w permissions for the current user.

  • The NVML library libnvidia-ml.so cannot be loaded

  • The CUDA runtime libraries cannot be Loaded

  • The nouveau driver is found to be loaded

  • Any pages are pending retirement on the target GPU(s)

  • Any pending row remaps or failed row remappings on the target GPU(s).

  • Any other graphics processes are running on the target GPU(s) while the plugin runs

Diagnostic Plugin

Overview

The Diagnostic plugin is part of the level 3 tests. It performs large matrix multiplies while copying data to various addresses in the frame buffer and checking that the data can be written and read correctly.

This test performs large matrix multiplications; by default it will alternate running these multiplications at all available among 64, 32, and 16-bit precisions. It will also walk the frame buffer, writing values to different addresses and making sure that the values are written and read correctly.

Test Description

This process will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors (XIDs, temperature violations, uncorrectable memory errors, etc.) as well as the correctness of data being written and read.

Supported Parameters

The following table lists the global parameters for the diagnostic plugin:

Parameter Name

Type

Default

Description

max_sbe_errors

Double

Blank

This is the threshold beyond which SBE’s are treated as errors.

test_duration

Double

180.0

This is the time in seconds that the test should run.

use_doubles

String

False

This indicates doubles should be used instead of floats.

temperature_max

Double

30.0

This is the maximum temperature in degrees allowed during the test.

is_allowed

Bool

False

This is whether the specified test is allowed to run.

matrix_dim

Double

2048.0

This is the starting dimension of the matrix used for S/Dgemm.

precision

String


Half Single Double

This is the precision to use: half, single, or double

gflops_tolerance_pcnt

Double

0.0

This is the percent of mean below which gflops are treated as errors.

Sample Commands

Run a quick diagnostic:

Run the diagnostic for 5 minutes:

$ dcgmi diag -r 3 -p diagnostic.test_duration=300.0

Run the diagnostic, stopping if max temperature exceeds 28 degrees:

$ dcgmi diag -r 3 -p diagnostic.temperature_max=28.0

Run the diagnostic, with a smaller starting dimension for matrix operations:

$ dcgmi diag -r 3 -p diagnostic.matrix_dim=1024.0

Run the diagnostic, reporting an error if a GPU reports gflops not within 60% of the mean gflops across all GPUs:

$ dcgmi diag -r 3 -p diagnostic.gflops_tolerance_pcnt=0.60

Run the diagnostic, using double precision:

$ dcgmi diag -r 3 -p diagnostic.precision=double

Failure Conditions

  • The test will fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.

PCIe - GPU Bandwidth Plugin

The GPU bandwidth plugin’s purpose is to measure the bandwidth and latency to and from the GPUs and the host.

Preconditions

None

Sub tests

The plugin consists of several self-tests that each measure a different aspect of bandwidth or latency. Each subtest has either a pinned/unpinned pair or a p2p enabled/p2p disabled pair of identical tests. Pinned/unpinned tests use either pinned or unpinned memory when copying data between the host and the GPUs.

This plugin will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe

Each sub test is represented with a tag that is used both for specifying configuration parameters for the sub test and for outputting stats for the sub test. P2p enabled/p2p disabled tests enable or disable GPUs on the same card talking to each other directly rather than through the PCIe bus.

Sub Test Tag

Pinned/Unpinned P2P Enabled/P2P Disabled

Description

h2d_d2h_single_pinned

Pinned

Device <-> Host Bandwidth, one GPU at a time

h2d_d2h_single_unpinned

Unpinned

Device <-> Host Bandwidth, one GPU at a time

h2d_d2h_latency_pinned

Pinned

Device <-> Host Latency, one GPU at a time

h2d_d2h_latency_unpinned

Unpinned

Device <-> Host Latency, one GPU at a time

p2p_bw_p2p_enabled

P2P Enabled

Device <-> Device bandwidth one GPU pair at a time

p2p_bw_p2p_disabled

P2P Disabled

Device <-> Device bandwidth one GPU pair at a time

p2p_bw_concurrent_p2p_enabled

P2P Enabled

Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1

p2p_bw_concurrent_p2p_disabled

P2P Disabled

Device <-> Device bandwidth, concurrently, focusing on bandwidth between GPUs between GPUs likely to be directly connected to each other -> for each (index / 2) and (index / 2)+1

1d_exch_bw_p2p_enabled

P2P Enabled

Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l)

1d_exch_bw_p2p_disabled

P2P Disabled

Device <-> Device bandwidth, concurrently, focusing on bandwidth between gpus, every GPU either sending to the gpu with the index higher than itself (l2r) or to the gpu with the index lower than itself (r2l)

p2p_latency_p2p_enabled

P2P Enabled

Device <-> Device Latency, one GPU pair at a time

p2p_latency_p2p_disabled

P2P Disabled

Device <-> Device Latency, one GPU pair at a time

The following table lists the global parameters for the PCIe plugin.

Parameter Name

Type

Default

Description

test_pinned

Bool

True

Include subtests that test using pinned memory.

test_unpinned

Bool

True

Include subtests that test using unpinned memory.

test_p2p_on

Bool

True

Include subtests that require peer to peer (P2P) memory transfers between cards to occur.

test_p2p_off

Bool

True

Include subtests that do not require peer to peer (P2P) memory transfers between cards to occur.

max_pcie_replays

Float

80.0

Maximum number of PCIe replays to allow per GPU for the duration of this plugin. This is based on an expected replay rate less than per minute for PCIe Gen 3.0, assuming this plugin will run for less than a minute and allowing 10x as many replays before failure.

The following table lists the parameters to specific subtests for the PCIe plugin.

Parameter Name

Default

Sub Tests

Description

min_bandwidth

0

h2d_d2h_single_pinned, h2d_d2h_single_unpinned, h2d_d2h_concurrent_pinned, h2d_d2h_concurrent_unpinned

Minimum bandwidth in GB/s that must be reached for this sub-test to pass.

max_latency

100,000

h2d_d2h_latency_pinned, h2d_d2h_latency_unpinned

Latency in microseconds that cannot be exceeded for this sub-test to pass.

min_pci_generation

1.0

h2d_d2h_single_pinned, h2d_d2h_single_unpinned

Minimum allowed PCI generation that the GPU must be at or exceed for this sub-test to pass.

min_pci_width

1.0

h2d_d2h_single_pinned, h2d_d2h_single_unpinned

Minimum allowed PCI width that the GPU must be at or exceed for this sub-test to pass. For example, 16x = 16.0.

Memtest Diagnostic

Overview

Beginning with 2.4.0 DCGM diagnostics support an additional level 4 diagnostics (-r 4). The first of these additional diagnostics is memtest. Similar to memtest86, the DCGM memtest will exercise GPU memory with various test patterns. These patterns each given a separate test and can be enabled and disabled by administrators.

Test Descriptions

Note

Test runtimes refer to average seconds per single iteration on a single A100 40gb GPU.

Test0 [Walking 1 bit] - This test changes one bit at a time in memory to see if it goes to a different memory location. It is designed to test the address wires. Runtime: ~3 seconds.

Test1 [Address check] - Each Memory location is filled with its own address followed by a check to see if the value in each memory location still agrees with the address. Runtime: < 1 second.

Test 2 [Moving inversions, ones&zeros] - This test uses the moving inversions algorithm from memtest86 with patterns of all ones and zeros. Runtime: ~4 seconds.

Test 3 [Moving inversions, 8 bit pat] - Same as test 1 but uses a 8 bit wide pattern of “walking” ones and zeros. Runtime: ~4 seconds.

Test 4 [Moving inversions, random pattern] - Same algorithm as test 1 but the data pattern is a random number and it’s complement. A total of 60 patterns are used. The random number sequence is different with each pass so multiple passes can increase effectiveness. Runtime: ~2 seconds.

Test 5 [Block move, 64 moves] - This test moves blocks of memory. Memory is initialized with shifting patterns that are inverted every 8 bytes. Then these blocks of memory are moved around. After the moves are completed the data patterns are checked. Runtime: ~1 second.

Test 6 [Moving inversions, 32 bit pat] - This is a variation of the moving inversions algorithm that shifts the data pattern left one bit for each successive address. To use all possible data patterns 32 passes are made during the test. Runtime: ~155 seconds.

Test 7 [Random number sequence] - A 1MB block of memory is initialized with random patterns. These patterns and their complements are used in moving inversion tests with rest of memory. Runtime: ~2 seconds.

Test 8 [Modulo 20, random pattern] - A random pattern is generated. This pattern is used to set every 20th memory location in memory. The rest of the memory location is set to the compliment of the pattern. Repeat this for 20 times and each time the memory location to set the pattern is shifted right. Runtime: ~10 seconds.

Test 9 [Bit fade test, 2 patterns] - The bit fade test initializes all memory with a pattern and then sleeps for 1 minute. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. Runtime: ~244 seconds.

Test10 [Memory stress] - A random pattern is generated and a large kernel is launched to set all memory to the pattern. A new read and write kernel is launched immediately after the previous write kernel to check if there is any errors in memory and set the memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is written as to achieve the maximum bandwidth between the global memory and GPU. Runtime: ~6 seconds.

Note

By default Test7 and Test10 alternate for a period of 10 minutes. If any errors are detected the diagnostic will fail.

Supported Parameters

Parameter

Syntax

Default

test0

boolean

false

test1

boolean

false

test2

boolean

false

test3

boolean

false

test4

boolean

false

test5

boolean

false

test6

boolean

false

test7

boolean

true

test8

boolean

false

test9

boolean

false

test10

boolean

true

test_duration

seconds

600

Sample Commands

Run test7 and test10 for 10 minutes (this is the default):

$ dcgmi diag -r 4

Run each test serially for 1 hour then display results:

$ dcgmi diag -r 4 \
   -p memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest.test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest.test9=true\;memtest.test10=true\;memtest.test_duration=3600

Run test0 for one minute 10 times, displaying the results each minute:

$ dcgmi diag \
   --iterations 10 \
   -r 4 \
   -p memtest.test0=true\;memtest.test7=false\;memtest.test10=false\;memtest.test_duration=60

Pulse Test Diagnostic

Overview

The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.

Test Description

By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.

The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.


Parameter

Description

Default

test_duration

Seconds to spend on an iteration. This is not the exact amount of time the test will take.

60

patterns

Specify a comman-separated list of pattern indices the pulse test should use. Valid indices depend on the type of SKU. Hopper: 0-22 Ampere / Volta/ Ada: 0-20

All

Note

In some cases with DCGM 2.4 and DCGM 3.0, users may encounter the following issue with running the Pulse test:

| Pulse Test | Fail - All |
| Warning | GPU 0There was an internal error during the t |
| | est: 'The pulse test exited with non-zero sta |
| | tus 1', GPU 0There was an internal error duri |
| | ng the test: 'The pulse test reported the err |
| | or: Exception raised during execution: Faile |
| | d opening file ubergemm.log for writing: Perm |
| | ission denied terminate called after throwing |
| | an instance of 'boost::wrapexcept<boost::pro |
| | perty_tree::xml_parser::xml_parser_error>' |
| | what(): result.xml: cannot open file ' |

When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) service account to run the diagnostics. If the service account does not have write access to the directory where diagnostics are run, then users may encounter this issue. To summarize, the issue happens when both these conditions are true:

  1. The nvidia-dcgm service is active and the nv-hostengine process is running (and no changes have been made to DCGM’s default install configurations)

  2. The users attempts to run dcgmi diag -r 4. In this case, dcgmi diag connects to the running nv-hostengine (which was started by default under /root) and thus the Pulse test is unable to create any logs.

This issue will be fixed in a future release of DCGM. In the meantime, users can do either of the following to work-around the issue:

  1. Stop the nvidia-dcgm service before running the pulse_test

    $ sudo systemctl stop nvidia-dcgm
    

    Now run the pulse_test:

    $ dcgmi diag -r pulse_test
    

    Restart the nvidia-dcgm service once the diagnostics are completed:

    $ sudo systemctl restart nvidia-dcgm
    
  2. Edit the systemd unit service file to include a WorkingDirectory option, so that the service is started in a location writeable by the nvidia-dcgm user (be sure that the directory shown in the example below /tmp/dcgm-temp is created):

    [Service]
    
     ...
    
     WorkingDirectory=/tmp/dcgm-temp
     ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    
     ...
    

    Reload the systemd configuration and start the nvidia-dcgm service:

    $ sudo systemctl daemon-reload
    
    $ sudo systemctl start nvidia-dcgm
    

Sample Commands

Run the entire diagnostic suite, including the pulse test:

$ dcgmi diag -r 4

Run just the pulse test:

$ dcgmi diag -r pulse_test

Run just the pulse test, but at a lower frequency:

$ dcgmi diag -r pulse_test -p pulse_test.freq0=3000

Run just the pulse test at a lower frequency and for a shorter time:

$ dcgmi diag -r pulse_test -p "pulse_test.freq0=5000;pulse_test.test_duration=180"

Failure Conditions

  • The pulse test will fail if the power supply unit cannot handle the spikes in the current.

  • It will also fail if unrecoverable memory errors, temperature violations, or XIDs occur during the test.

Extended Utility Diagnostics (EUD)

Starting with DCGM 3.1, the Extended Utility Diagnostics, or EUD, is available as a new plugin. Once installed, its available as a separate suite of tests and is also included in levels 3 and 4 of DCGM’s diagnostics. EUD provides various tests that perform the following checks on the GPU subsystems:

  • Confirmation of the numerical processing engines in the GPU

  • Integrity of data transfers to and from the GPU

  • Coverage of the full onboard memory address space that is available to CUDA programs

Supported Products

EUD supports the following GPU products:

  • NVIDIA V100 PCIe

  • NVIDIA V100 SXM2 (PG503-0201 and PG503-0203)

  • NVIDIA V100 SXM3

  • NVIDIA A100-SXM-40GB

  • NVIDIA A100-SXM-80GB

  • NVIDIA A100-PCIe-80GB

  • NVIDIA A100 OAM (PG509-0200 and PG509-0210)

  • NVIDIA A800 SXM

  • NVIDIA A800-PCIe-80GB

  • NVIDIA H100-PCIe-80GB

  • NVIDIA H100-SXM-80GB (HGX H100)

  • NVIDIA H100-SXM-96GB

  • NVIDIA HGX H100 4-GPU 64GB

  • NVIDIA HGX H100 4-GPU 80GB

  • NVIDIA HGX H100 4-GPU 94GB

  • NVIDIA HGX H800 4-GPU 80GB

  • NVIDIA L40

  • NVIDIA L40S

  • NVIDIA L4

The EUD is only supported on the R525 and later driver branches. Support for other products and driver branches will be added in a future release.

Included Tests

The EUD supports six different test suites targeting different types of GPU functionality:

  • Compute : The compute test suite focuses primarily on tests which run Matrix Multiply instructions on the GPU using different numerical representations (integers, doubles, etc.). The tests are generally run in two different ways

    • A static constant workload to generate consistent and stable power draw

    • A pulsing workload

    In addition to the Matrix Multiply tests there are also several miscellaneous tests which focus on exercising other functionality related to compute (e.g. instruction test, compute video test, etc.)

  • Graphics : The graphics test suite focuses on testing the 2D and 3D rendering engines of the GPU

  • Memory : The memory test suite validates the GPU memory interface. The tests in the memory suite validate that the GPU memory can function without any errors both in normal operation and under various types of stress.

  • High Speed Input/Output (HSIO) : The HSIO test is focused on validating NVLink and PCIE functionality focusing primarily on data transfer testing

  • Miscellaneous : The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are board specific tests which validate non-GPU related items on the board like voltage regulator programming or board configuration.

  • Default : The default test suite runs when no other test suites are explicitly specified and will run one or more tests from each of the other test suites

Getting Started with EUD

Note

The following pre-requisites apply when using the EUD:

  • The EUD version must match the NVIDIA driver version installed on the system. When the NVIDIA driver is updated, the EUD must also be updated to the corresponding version.

  • Multi-Instance GPU (MIG) mode should be disabled prior to running the EUD. To disable MIG mode, refer to the nvidia-smi man page:

    $ nvidia-smi mig --help
    
  • If the DCGM agent (nv-hostengine) is running, then stop the DCGM agent (nv-hostengine) or ensure that the service was started with privileges. This can be achieved by modifying the systemd service file (under /usr/lib/systemd/system/nvidia-dcgm.service) to not start nv-hostengine with the unprivileged nvidia-dcgm service account.

    $ sudo systemctl stop nvidia-dcgm
    
  • Any GPU telemetry (either via NVML/DCGM APIs or with nvidia-smi dmon/ dcgmi dmon should not be used when running the EUD. EUD interacts heavily with the driver and contention will impact testing and may cause timeouts.

Supported Deployment Models

The EUD is only supported in the following deployment models:

Deployment Model

Description

Supported

Bare-metal

Running directly on the system with no abstraction layer (i.e. VM, containerization, etc)

Yes

Passthrough virtualization (aka “full passthrough”)

When both GPU and NvSwitch are “passed through” to a VM. VM has exclusive access to the devices

  • Single tenant VM : Yes

  • Muilti tenant VM : Yes, no NvLink

  • Execution from host : No

Shared nvswitch

GPUs are passed through to VM but NvSwitch is owned by a service VM

  • Execution from VM : Yes

  • Execution from service VM: No

  • Execution from host : No

Installing the EUD packages

Install the NVIDIA EUD package using the appropriate package manager of the Linux distribution flavor.

In this release, the EUD binaries are available via a Linux archive. Follow these steps to get started:

  • Extract the archive under /usr

  • Change ownership and group to root

    $ sudo chown -R root /usr/share/nvidia \
       && sudo chgrp -R root /usr/share/nvidia
    
  • Now proceed to run the EUD

The files for the EUD should be installed under /usr/share/nvidia/diagnostic/

Running the EUD

On supported GPU products, by default, DCGM will run the EUD as part of the Level 3 and 4 with two separate EUD test profiles:

  1. Within run level 3 (dcgmi diag -r 3), the run time of the EUD test is less than 5 mins (runs at least one test from each of the subtest suites)

  2. Within run level 4 (dcgmi diag -r 4), the run time of the EUD test is ~20 mins (all the test suites are run)

Note

The times provided above are the estimated runtimes of just the EUD test. The total runtime of -r 3 or -r 4 would be longer as they include other tests.

By default, EUD will report error for the first failing test and stop. See run_on_error for details.

The EUD may also be run separately from the DCGM run levels via dcgmi diag -r eud which runs the same set of tests as level 3

Customization options

The EUD supports optional command-line arguments that can be specified during the run.

For example to run the memory and compute tests:

$ dcgmi diag -r eud -p "eud.passthrough_args='run_tests=compute,memory'"

The -r eud option supports the following arguments:

Option

Description

eud.tmp_dir

The directory where the EUD stdout/stderr and log files named dcgm_eud_stdout.txt, dcgm_eud_stderr.txt, dcgm_eud.log, dcgm_eud.mle will be written. The default directory location is /tmp

eud.suite_level=4

Enables the full EUD test profile (of ~20mins) and also enables customization of tests with the run_tests parameter. When this option is not specified, then the default EUD test profile (~5mins) is used.

eud.passthrough_args

Allows additional controls on the EUD diagnostic tests. See the table later in this document.

The table below provides the additional control arguments supported for eud.passthrough_args:

Logging

By default, DCGM logs the runs of EUD under /tmp/dcgm where two files are generated:

  • dcgm_eud.log - This plain text file contains a stdout log of the EUD test run

  • dcgm_eud.mle - This binary file contains the results of the EUD tests

The MLE file can be decoded to a JSON format output by running the mla binary under /usr/share/nvidia/diagnostic. See the documentation titled “MLA JSON Report Decoding” in that directory for more information on the options and generated reports.

Automating Responses to DCGM Diagnostic Failures

Overview

Automating workflows based on DCGM diagnostics can enable sites to handle GPU errors more efficiently. Additional data for determining the severity of errors and potential next steps is available using either the API or by parsing the JSON returned on the CLI. Besides simply reporting human readable strings of which errors occurred during the diagnostic, each error also includes a specific ID, Severity, and Category that can be useful when deciding how to handle the failure.

The latest versions of these enums can be found in dcgm_errors.h.

Error Category Enum

VALUE

DCGM_FR_EC_NONE

0

DCGM_FR_EC_PERF_THRESHOLD

1

DCGM_FR_EC_PERF_VIOLATION

2

DCGM_FR_EC_SOFTWARE_CONFIG

3

DCGM_FR_EC_SOFTWARE_LIBRARY

4

DCGM_FR_EC_SOFTWARE_XID

5

DCGM_FR_EC_SOFTWARE_CUDA

6

DCGM_FR_EC_SOFTWARE_EUD

7

DCGM_FR_EC_SOFTWARE_OTHER

8

DCGM_FR_EC_HARDWARE_THERMAL

9

DCGM_FR_EC_HARDWARE_MEMORY

10

DCGM_FR_EC_HARDWARE_NVLINK

11

DCGM_FR_EC_HARDWARE_NVSWITCH

12

DCGM_FR_EC_HARDWARE_PCIE

13

DCGM_FR_EC_HARDWARE_POWER

14

DCGM_FR_EC_HARDWARE_OTHER

15

DCGM_FR_EC_INTERNAL_OTHER

16

Error Severity Enum

VALUE

DCGM_ERROR_NONE

0

DCGM_ERROR_MONITOR

1

DCGM_ERROR_ISOLATE

2

DCGM_ERROR_UNKNOWN

3

DCGM_ERROR_TRIAGE

4

DCGM_ERROR_CONFIG

5

DCGM_ERROR_RESET

6

Error Enum

Value

Severity

Category

DCGM_FR_OK

0

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_UNKNOWN

1

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_UNRECOGNIZED

2

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_PCI_REPLAY_RATE

3

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_VOLATILE_DBE_DETECTED

4

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_VOLATILE_SBE_DETECTED

5

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PENDING_PAGE_RETIREMENTS

6

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_RETIRED_PAGES_LIMIT

7

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_RETIRED_PAGES_DBE_LIMIT

8

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CORRUPT_INFOROM

9

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_CLOCK_THROTTLE_THERMAL

10

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_POWER_UNREADABLE

11

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_CLOCK_THROTTLE_POWER

12

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_NVLINK_ERROR_THRESHOLD

13

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVLINK_DOWN

14

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVSWITCH_FATAL_ERROR

15

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NVSWITCH_NON_FATAL_ERROR

16

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NVSWITCH_DOWN

17

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NO_ACCESS_TO_FILE

18

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_NVML_API

19

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_DEVICE_COUNT_MISMATCH

20

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_BAD_PARAMETER

21

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CANNOT_OPEN_LIB

22

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_DENYLISTED_DRIVER

23

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_NVML_LIB_BAD

24

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_GRAPHICS_PROCESSES

25

DCGM_ERROR_RESET

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_HOSTENGINE_CONN

26

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_QUERY

27

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_BAD_CUDA_ENV

28

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_PERSISTENCE_MODE

29

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_LOW_BANDWIDTH

30

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_HIGH_LATENCY

31

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_CANNOT_GET_FIELD_TAG

32

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_VIOLATION

33

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_FIELD_THRESHOLD

34

DCGM_ERROR_MONITOR

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_FIELD_VIOLATION_DBL

35

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_FIELD_THRESHOLD_DBL

36

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_UNSUPPORTED_FIELD_TYPE

37

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_THRESHOLD_TS

38

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_FIELD_THRESHOLD_TS_DBL

39

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_THERMAL_VIOLATIONS

40

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_THERMAL_VIOLATIONS_TS

41

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_TEMP_VIOLATION

42

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_THROTTLING_VIOLATION

43

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_INTERNAL

44

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_PCIE_GENERATION

45

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_PCIE_WIDTH

46

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_ABORTED

47

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_TEST_DISABLED

48

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_CANNOT_GET_STAT

49

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_STRESS_LEVEL

50

DCGM_ERROR_TRIAGE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_CUDA_API

51

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_FAULTY_MEMORY

52

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CANNOT_SET_WATCHES

53

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CUDA_UNBOUND

54

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_ECC_DISABLED

55

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_ALLOC

56

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_CUDA_DBE

57

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_MISMATCH

58

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CUDA_DEVICE

59

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_ECC_UNSUPPORTED

60

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_ECC_PENDING

61

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_BANDWIDTH

62

DCGM_ERROR_TRIAGE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_TARGET_POWER

63

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_API_FAIL

64

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_API_FAIL_GPU

65

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_CUDA_CONTEXT

66

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_DCGM_API

67

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CONCURRENT_GPUS

68

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_TOO_MANY_ERRORS

69

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_NVLINK_CRC_ERROR_THRESHOLD

70

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVLINK_ERROR_CRITICAL

71

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_ENFORCED_POWER_LIMIT

72

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_MEMORY_ALLOC_HOST

73

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_GPU_OP_MODE

74

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_NO_MEMORY_CLOCKS

75

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_NO_GRAPHICS_CLOCKS

76

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_HAD_TO_RESTORE_STATE

77

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_L1TAG_UNSUPPORTED

78

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_L1TAG_MISCOMPARE

79

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_ROW_REMAP_FAILURE

80

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_UNCONTAINED_ERROR

81

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_XID

DCGM_FR_EMPTY_GPU_LIST

82

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_DBE_PENDING_PAGE_RETIREMENTS

83

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_UNCORRECTABLE_ROW_REMAP

84

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PENDING_ROW_REMAP

85

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_BROKEN_P2P_MEMORY_DEVICE

86

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_BROKEN_P2P_WRITER_DEVICE

87

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_NVSWITCH_NVLINK_DOWN

88

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_EUD_BINARY_PERMISSIONS

89

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_NON_ROOT_USER

90

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_SPAWN_FAILURE

91

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_TIMEOUT

92

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_ZOMBIE

93

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_NON_ZERO_EXIT_CODE

94

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_TEST_FAILED

95

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_FILE_CREATE_PERMISSIONS

96

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_PAUSE_RESUME_FAILED

97

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_PCIE_H_REPLAY_VIOLATION

98

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_GPU_EXPECTED_NVLINKS_UP

99

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVSWITCH_EXPECTED_NVLINKS_UP

100

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_XID_ERROR

101

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_XID

DCGM_FR_SBE_VIOLATION

102

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_DBE_VIOLATION

103

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PCIE_REPLAY_VIOLATION

104

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_SBE_THRESHOLD_VIOLATION

105

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_DBE_THRESHOLD_VIOLATION

106

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PCIE_REPLAY_THRESHOLD_VIOLATION

107

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_CUDA_FM_NOT_INITIALIZED

108

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_SXID_ERROR

109

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_XID

These relationships are codified in dcgm_errors.c.

In general, DCGM has high confidence that errors with the ISOLATE and RESET severities should be handled immediately. Other severities may require more site-specific analysis, a re-run of the diagnostic, or a scanning of DCGM and system logs to determine the best course of action. Gathering and recording the failure types and rates over time can give datacenters insight into the best way to automate handling of GPU diagnostic errors.