Extended Utility Diagnostics (EUD)

Starting with DCGM 3.1, the Extended Utility Diagnostics, or EUD, is available as a new plugin. Once installed, it’s available as a separate suite of tests and is also included in levels 3 and 4 of DCGM’s diagnostics. EUD provides various tests that perform the following checks on the GPU subsystems:

Confirmation of the numerical processing engines in the GPU
Integrity of data transfers to and from the GPU
Coverage of the full onboard memory address space that is available to CUDA programs

Supported Products

EUD supports the following GPU products:

NVIDIA V100 PCIe
NVIDIA V100 SXM2 (PG503-0201 and PG503-0203)
NVIDIA V100 SXM3
NVIDIA A100-SXM-40GB
NVIDIA A100-SXM-80GB
NVIDIA A100-PCIe-80GB
NVIDIA A100 OAM (PG509-0200 and PG509-0210)
NVIDIA A800 SXM
NVIDIA A800-PCIe-80GB
NVIDIA H100-PCIe-80GB
NVIDIA H100-SXM-80GB (HGX H100)
NVIDIA H100-SXM-96GB
NVIDIA HGX H100 4-GPU 64GB
NVIDIA HGX H100 4-GPU 80GB
NVIDIA HGX H100 4-GPU 94GB
NVIDIA HGX H800 4-GPU 80GB
NVIDIA L40
NVIDIA L40S
NVIDIA L4

The EUD is only supported on the R525 and later driver branches. Support for other products and driver branches will be added in a future release.

Included Tests

The EUD supports six different test suites targeting different types of GPU functionality:

Compute : The compute test suite focuses primarily on tests which run Matrix Multiply instructions on the GPU using different numerical representations (integers, doubles, etc.). The tests are generally run in two different ways
- A static constant workload to generate consistent and stable power draw
- A pulsing workload
In addition to the Matrix Multiply tests there are also several miscellaneous tests which focus on exercising other functionality related to compute (e.g. instruction test, compute video test, etc.)
Graphics : The graphics test suite focuses on testing the 2D and 3D rendering engines of the GPU
Memory : The memory test suite validates the GPU memory interface. The tests in the memory suite validate that the GPU memory can function without any errors both in normal operation and under various types of stress.
High Speed Input/Output (HSIO) : The HSIO test is focused on validating NVLink and PCIE functionality focusing primarily on data transfer testing
Miscellaneous : The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are board specific tests which validate non-GPU related items on the board like voltage regulator programming or board configuration.
Default : The default test suite runs when no other test suites are explicitly specified and will run one or more tests from each of the other test suites

Getting Started with EUD

Note

The following pre-requisites apply when using the EUD:

The EUD version must match the NVIDIA driver version installed on the system. When the NVIDIA driver is updated, the EUD must also be updated to the corresponding version.
Multi-Instance GPU (MIG) mode should be disabled prior to running the EUD. To disable MIG mode, refer to the nvidia-smi man page:
```
$ nvidia-smi mig --help
```
Any GPU telemetry (either via NVML/DCGM APIs or with nvidia-smi dmon/ dcgmi dmon should not be used when running the EUD. EUD interacts heavily with the driver and contention will impact testing and may cause timeouts.

Supported Deployment Models

The EUD is only supported in the following deployment models:

Deployment Model	Description	Supported
Bare-metal	Running directly on the system with no abstraction layer (i.e. VM, containerization, etc)	Yes
Passthrough virtualization (aka “full passthrough”)	When both GPU and NvSwitch are “passed through” to a VM. VM has exclusive access to the devices	Single tenant VM : Yes Muilti tenant VM : Yes, no NvLink Execution from host : No
Shared nvswitch	GPUs are passed through to VM but NvSwitch is owned by a service VM	Execution from VM : Yes Execution from service VM: No Execution from host : No

Installing the EUD packages

Install the NVIDIA EUD package using the appropriate package manager of the Linux distribution flavor.

In this release, the EUD binaries are available via a Linux archive. Follow these steps to get started:
Extract the archive under /usr
Change ownership and group to root
$ sudo chown -R root /usr/share/nvidia \
   && sudo chgrp -R root /usr/share/nvidia
Now proceed to run the EUD
Install the local repo package
$ sudo dpkg -i nvidia-diagnostic-local-repo-ubuntu2204-525.125.06-mode1_1.0-1_amd64.deb
Copy the keyring file to the correct location, the exact copy command will be in the output of the dpkg command.
$ sudo cp  /var/nvidia-diagnostic-local-repo-ubuntu2204-525.125.06-mode1/nvidia-diagnostic-local-D95A57C6-keyring.gpg /usr/share/keyrings
Install the diagnostic. The version of the diagnostic will match the major version specified in the local repo file.
$ sudo apt update
$ sudo apt install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm
$ sudo yum install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo rpm file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm
$ sudo dnf install nvidia-diagnostic-525
Install the local repo file and then install the diagnostic. The version of the diagnostic will match the major version specified in the local repo rpm file.
$ sudo rpm -i nvidia-diagnostic-local-repo-rhel8-525.125.06-mode1-1.0-1.x86_64.rpm
$ sudo zypper install nvidia-diagnostic-525

The files for the EUD should be installed under /usr/share/nvidia/diagnostic/

Running the EUD

On supported GPU products, by default, DCGM will run the EUD as part of the Level 3 and 4 with two separate EUD test profiles:

Within run level 3 (dcgmi diag -r 3), the run time of the EUD test is less than 5 mins (runs at least one test from each of the subtest suites)
Within run level 4 (dcgmi diag -r 4), the run time of the EUD test is ~20 mins (all the test suites are run)

Note

The times provided above are the estimated runtimes of just the EUD test. The total runtime of -r 3 or -r 4 would be longer as they include other tests.

By default, EUD will report error for the first failing test and stop. See run_on_error for details.

The EUD may also be run separately from the DCGM run levels via dcgmi diag -r eud which runs the same set of tests as level 3

Customization options

The EUD supports optional command-line arguments that can be specified during the run.

For example to run the memory and compute tests:

$ dcgmi diag -r eud -p "eud.passthrough_args='run_tests=compute,memory'"

The -r eud option supports the following arguments:

Option	Description
`eud.suite_level=4`	Enables the full EUD test profile (of ~20mins) and also enables customization of tests with the `run_tests` parameter. When this option is not specified, then the default EUD test profile (~5mins) is used.
`eud.passthrough_args`	Allows additional controls on the EUD diagnostic tests. See the table later in this document.

The table below provides the additional control arguments supported for eud.passthrough_args:

Option	Description
`device=<n>`	Test the nth device
`logfilename=<filename-path>[_%r_%s]`	Specify a unique log file name other than the default. Example: `logfilename=/var/log/mylogfile.log` %r will evaluate to the test status and will be one of PASS or FAIL %s will evaluate to the SERIAL NUMBER of one of the devices under test. Example: `logfilename=mylogfile_%r_%s.log` By default, the logs are created under `/usr/share/nvidia/diagnostic` with the prefix `fielddiag`.
`pciid=<w:x:y:z>`	Specify a single GPU to be tested, where w, x, y and z are hexadecimal numbers. w: PCI domain (required) x: PCI bus y: device z: function Example: `pciid=0:2:0.0`
`pci_devices=<w:x:y.z>,…`	Specify a subset of GPUs to be tested, where w, x, y and z are hexadecimal numbers. Each GPU address is comma-separated. Example: `pci_devices=0002:03:00.0,0003:04:00.0,0004:05:00.0`
`run_tests=<test>,…`	Specify a subset of tests to be run. Each test is comma separated and one of the following misc : Miscellaneous board level tests memory : GPU memory validation graphics : 3D engine validation compute : Compute engine validation hsio : High Speed I/O (e.g. NVLink) validation all : Runs all tests (equivalent of run_tests=misc,memory,graphics,compute,hsio) When not specified at least one test from each of the individual subtests is run Example: `run_tests=misc,compute`
`topology_file=<file>`	Overrides the default NVSwitch topology file to use. When this argument is not present, a system topology file (installed by the driver into `/usr/share/nvidia/nvswitch/`) will be used, if present. Example: `run_tests=topology_file`
`run_on_error`	When this argument is present, EUD will keep running even if an error occurs for either of the requested tests and report errors for all the failing tests only after completeing all the requested tests. If the option is not specified, EUD will report error for the first failing test and stop.
`skip_nvlink`	When this option is present all testing of the NvLink interface will be skipped

Logging

By default, DCGM logs the runs of EUD under /tmp/dcgm where two files are generated:

dcgm_eud.log - This plain text file contains a stdout log of the EUD test run
dcgm_eud.mle - This binary file contains the results of the EUD tests

The MLE file can be decoded to a JSON format output by running the mla binary under /usr/share/nvidia/diagnostic. See the documentation titled “MLA JSON Report Decoding” in that directory for more information on the options and generated reports.