Extended Utility Diagnostics (EUD)

Starting with DCGM 3.1, the Extended Utility Diagnostics, or EUD, is available as a new plugin. Once installed, it’s available as a separate suite of tests and is also included in levels 3 and 4 of DCGM’s diagnostics. EUD provides various tests that perform the following checks on the GPU subsystems:

  • Confirmation of the numerical processing engines in the GPU

  • Integrity of data transfers to and from the GPU

  • Coverage of the full onboard memory address space that is available to CUDA programs

Supported Products

EUD supports the following GPU products:

  • NVIDIA V100 PCIe

  • NVIDIA V100 SXM2 (PG503-0201 and PG503-0203)

  • NVIDIA V100 SXM3

  • NVIDIA A100-SXM-40GB

  • NVIDIA A100-SXM-80GB

  • NVIDIA A100-PCIe-80GB

  • NVIDIA A100 OAM (PG509-0200 and PG509-0210)

  • NVIDIA A800 SXM

  • NVIDIA A800-PCIe-80GB

  • NVIDIA H100-PCIe-80GB

  • NVIDIA H100-SXM-80GB (HGX H100)

  • NVIDIA H100-SXM-96GB

  • NVIDIA HGX H100 4-GPU 64GB

  • NVIDIA HGX H100 4-GPU 80GB

  • NVIDIA HGX H100 4-GPU 94GB

  • NVIDIA HGX H800 4-GPU 80GB

  • NVIDIA L40

  • NVIDIA L40S

  • NVIDIA L4

The EUD is only supported on the R525 and later driver branches. Support for other products and driver branches will be added in a future release.

Included Tests

The EUD supports six different test suites targeting different types of GPU functionality:

  • Compute : The compute test suite focuses primarily on tests which run Matrix Multiply instructions on the GPU using different numerical representations (integers, doubles, etc.). The tests are generally run in two different ways

    • A static constant workload to generate consistent and stable power draw

    • A pulsing workload

    In addition to the Matrix Multiply tests there are also several miscellaneous tests which focus on exercising other functionality related to compute (e.g. instruction test, compute video test, etc.)

  • Graphics : The graphics test suite focuses on testing the 2D and 3D rendering engines of the GPU

  • Memory : The memory test suite validates the GPU memory interface. The tests in the memory suite validate that the GPU memory can function without any errors both in normal operation and under various types of stress.

  • High Speed Input/Output (HSIO) : The HSIO test is focused on validating NVLink and PCIE functionality focusing primarily on data transfer testing

  • Miscellaneous : The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are board specific tests which validate non-GPU related items on the board like voltage regulator programming or board configuration.

  • Default : The default test suite runs when no other test suites are explicitly specified and will run one or more tests from each of the other test suites

Getting Started with EUD

Note

The following pre-requisites apply when using the EUD:

  • The EUD version must match the NVIDIA driver version installed on the system. When the NVIDIA driver is updated, the EUD must also be updated to the corresponding version.

  • Multi-Instance GPU (MIG) mode should be disabled prior to running the EUD. To disable MIG mode, refer to the nvidia-smi man page:

    $ nvidia-smi mig --help
    
  • Any GPU telemetry (either via NVML/DCGM APIs or with nvidia-smi dmon/ dcgmi dmon should not be used when running the EUD. EUD interacts heavily with the driver and contention will impact testing and may cause timeouts.

Supported Deployment Models

The EUD is only supported in the following deployment models:

Deployment Model

Description

Supported

Bare-metal

Running directly on the system with no abstraction layer (i.e. VM, containerization, etc)

Yes

Passthrough virtualization (aka “full passthrough”)

When both GPU and NvSwitch are “passed through” to a VM. VM has exclusive access to the devices

  • Single tenant VM : Yes

  • Muilti tenant VM : Yes, no NvLink

  • Execution from host : No

Shared nvswitch

GPUs are passed through to VM but NvSwitch is owned by a service VM

  • Execution from VM : Yes

  • Execution from service VM: No

  • Execution from host : No

Installing the EUD packages

Install the NVIDIA EUD package using the appropriate package manager of the Linux distribution flavor.

In this release, the EUD binaries are available via a Linux archive. Follow these steps to get started:

  • Extract the archive under /usr

  • Change ownership and group to root

    $ sudo chown -R root /usr/share/nvidia \
       && sudo chgrp -R root /usr/share/nvidia
    
  • Now proceed to run the EUD

The files for the EUD should be installed under /usr/share/nvidia/diagnostic/

Running the EUD

On supported GPU products, by default, DCGM will run the EUD as part of the Level 3 and 4 with two separate EUD test profiles:

  1. Within run level 3 (dcgmi diag -r 3), the run time of the EUD test is less than 5 mins (runs at least one test from each of the subtest suites)

  2. Within run level 4 (dcgmi diag -r 4), the run time of the EUD test is ~20 mins (all the test suites are run)

Note

The times provided above are the estimated runtimes of just the EUD test. The total runtime of -r 3 or -r 4 would be longer as they include other tests.

By default, EUD will report error for the first failing test and stop. See run_on_error for details.

The EUD may also be run separately from the DCGM run levels via dcgmi diag -r eud which runs the same set of tests as level 3

Customization options

The EUD supports optional command-line arguments that can be specified during the run.

For example to run the memory and compute tests:

$ dcgmi diag -r eud -p "eud.passthrough_args='run_tests=compute,memory'"

The -r eud option supports the following arguments:

Option

Description

eud.suite_level=4

Enables the full EUD test profile (of ~20mins) and also enables customization of tests with the run_tests parameter. When this option is not specified, then the default EUD test profile (~5mins) is used.

eud.passthrough_args

Allows additional controls on the EUD diagnostic tests. See the table later in this document.

The table below provides the additional control arguments supported for eud.passthrough_args:

Option

Description

device=<n>

Test the nth device

logfilename=<filename-path>[_%r_%s]

Specify a unique log file name other than the default. Example: logfilename=/var/log/mylogfile.log

%r will evaluate to the test status and will be one of PASS or FAIL %s will evaluate to the SERIAL NUMBER of one of the devices under test. Example: logfilename=mylogfile_%r_%s.log

By default, the logs are created under /usr/share/nvidia/diagnostic with the prefix fielddiag.

pciid=<w:x:y:z>

Specify a single GPU to be tested, where w, x, y and z are hexadecimal numbers.

  • w: PCI domain (required)

  • x: PCI bus

  • y: device

  • z: function

Example: pciid=0:2:0.0

pci_devices=<w:x:y.z>,…

Specify a subset of GPUs to be tested, where w, x, y and z are hexadecimal numbers. Each GPU address is comma-separated. Example: pci_devices=0002:03:00.0,0003:04:00.0,0004:05:00.0

run_tests=<test>,…

Specify a subset of tests to be run. Each test is comma separated and one of the following

  • misc : Miscellaneous board level tests

  • memory : GPU memory validation

  • graphics : 3D engine validation

  • compute : Compute engine validation

  • hsio : High Speed I/O (e.g. NVLink) validation

  • all : Runs all tests (equivalent of run_tests=misc,memory,graphics,compute,hsio)

When not specified at least one test from each of the individual subtests is run

Example: run_tests=misc,compute

topology_file=<file>

Overrides the default NVSwitch topology file to use. When this argument is not present, a system topology file (installed by the driver into /usr/share/nvidia/nvswitch/) will be used, if present.

Example: run_tests=topology_file

run_on_error

When this argument is present, EUD will keep running even if an error occurs for either of the requested tests and report errors for all the failing tests only after completeing all the requested tests.

If the option is not specified, EUD will report error for the first failing test and stop.

skip_nvlink

When this option is present all testing of the NvLink interface will be skipped

Logging

By default, DCGM logs the runs of EUD under /tmp/dcgm where two files are generated:

  • dcgm_eud.log - This plain text file contains a stdout log of the EUD test run

  • dcgm_eud.mle - This binary file contains the results of the EUD tests

The MLE file can be decoded to a JSON format output by running the mla binary under /usr/share/nvidia/diagnostic. See the documentation titled “MLA JSON Report Decoding” in that directory for more information on the options and generated reports.