Extended Utility Diagnostics (EUD)

Starting with DCGM 3.1, the Extended Utility Diagnostics, or EUD, is available as a new plugin. Once installed, it’s available as a separate suite of tests and is also included in levels 3 and 4 of DCGM’s diagnostics. EUD provides various tests that perform the following checks on the GPU subsystems:

  • Confirmation of the numerical processing engines in the GPU

  • Integrity of data transfers to and from the GPU

  • Coverage of the full onboard memory address space that is available to CUDA programs

Supported Products

EUD supports the following GPU products:

  • NVIDIA V100 PCIe

  • NVIDIA V100 SXM2 (PG503-0201 and PG503-0203)

  • NVIDIA V100 SXM3

  • NVIDIA A100-SXM-40GB

  • NVIDIA A100-SXM-80GB

  • NVIDIA A100-PCIe-80GB

  • NVIDIA A100 OAM (PG509-0200 and PG509-0210)

  • NVIDIA A800 SXM

  • NVIDIA A800-PCIe-80GB

  • NVIDIA H100-PCIe-80GB

  • NVIDIA H100-SXM-80GB (HGX H100)

  • NVIDIA H100-SXM-96GB

  • NVIDIA HGX H100 4-GPU 64GB

  • NVIDIA HGX H100 4-GPU 80GB

  • NVIDIA HGX H100 4-GPU 94GB

  • NVIDIA HGX H800 4-GPU 80GB

  • NVIDIA L40

  • NVIDIA L40S

  • NVIDIA L4

The EUD is only supported on the R525 and later driver branches. Support for other products and driver branches will be added in a future release.

Included Tests

The EUD supports six different test suites targeting different types of GPU functionality:

  • Compute : The compute test suite focuses primarily on tests which run Matrix Multiply instructions on the GPU using different numerical representations (integers, doubles, etc.). The tests are generally run in two different ways

    • A static constant workload to generate consistent and stable power draw

    • A pulsing workload

    In addition to the Matrix Multiply tests there are also several miscellaneous tests which focus on exercising other functionality related to compute (e.g. instruction test, compute video test, etc.)

  • Graphics : The graphics test suite focuses on testing the 2D and 3D rendering engines of the GPU

  • Memory : The memory test suite validates the GPU memory interface. The tests in the memory suite validate that the GPU memory can function without any errors both in normal operation and under various types of stress.

  • High Speed Input/Output (HSIO) : The HSIO test is focused on validating NVLink and PCIE functionality focusing primarily on data transfer testing

  • Miscellaneous : The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are board specific tests which validate non-GPU related items on the board like voltage regulator programming or board configuration.

  • Default : The default test suite runs when no other test suites are explicitly specified and will run one or more tests from each of the other test suites

Getting Started with EUD

Note

The following pre-requisites apply when using the EUD:

  • The EUD version must match the NVIDIA driver version installed on the system. When the NVIDIA driver is updated, the EUD must also be updated to the corresponding version.

  • Multi-Instance GPU (MIG) mode should be disabled prior to running the EUD. To disable MIG mode, refer to the nvidia-smi man page:

    $ nvidia-smi mig --help
    
  • If the DCGM agent (nv-hostengine) is running, then stop the DCGM agent (nv-hostengine) or ensure that the service was started with privileges. This can be achieved by modifying the systemd service file (under /usr/lib/systemd/system/nvidia-dcgm.service) to not start nv-hostengine with the unprivileged nvidia-dcgm service account.

    $ sudo systemctl stop nvidia-dcgm
    
  • Any GPU telemetry (either via NVML/DCGM APIs or with nvidia-smi dmon/ dcgmi dmon should not be used when running the EUD. EUD interacts heavily with the driver and contention will impact testing and may cause timeouts.

Supported Deployment Models

The EUD is only supported in the following deployment models:

Deployment Model

Description

Supported

Bare-metal

Running directly on the system with no abstraction layer (i.e. VM, containerization, etc)

Yes

Passthrough virtualization (aka “full passthrough”)

When both GPU and NvSwitch are “passed through” to a VM. VM has exclusive access to the devices

  • Single tenant VM : Yes

  • Muilti tenant VM : Yes, no NvLink

  • Execution from host : No

Shared nvswitch

GPUs are passed through to VM but NvSwitch is owned by a service VM

  • Execution from VM : Yes

  • Execution from service VM: No

  • Execution from host : No

Installing the EUD packages

Install the NVIDIA EUD package using the appropriate package manager of the Linux distribution flavor.

In this release, the EUD binaries are available via a Linux archive. Follow these steps to get started:

  • Extract the archive under /usr

  • Change ownership and group to root

    $ sudo chown -R root /usr/share/nvidia \
       && sudo chgrp -R root /usr/share/nvidia
    
  • Now proceed to run the EUD

The files for the EUD should be installed under /usr/share/nvidia/diagnostic/

Running the EUD

On supported GPU products, by default, DCGM will run the EUD as part of the Level 3 and 4 with two separate EUD test profiles:

  1. Within run level 3 (dcgmi diag -r 3), the run time of the EUD test is less than 5 mins (runs at least one test from each of the subtest suites)

  2. Within run level 4 (dcgmi diag -r 4), the run time of the EUD test is ~20 mins (all the test suites are run)

Note

The times provided above are the estimated runtimes of just the EUD test. The total runtime of -r 3 or -r 4 would be longer as they include other tests.

By default, EUD will report error for the first failing test and stop. See run_on_error for details.

The EUD may also be run separately from the DCGM run levels via dcgmi diag -r eud which runs the same set of tests as level 3

Customization options

The EUD supports optional command-line arguments that can be specified during the run.

For example to run the memory and compute tests:

$ dcgmi diag -r eud -p "eud.passthrough_args='run_tests=compute,memory'"

The -r eud option supports the following arguments:

Option

Description

eud.tmp_dir

The directory where the EUD stdout/stderr and log files named dcgm_eud_stdout.txt, dcgm_eud_stderr.txt, dcgm_eud.log, dcgm_eud.mle will be written. The default directory location is /tmp

eud.suite_level=4

Enables the full EUD test profile (of ~20mins) and also enables customization of tests with the run_tests parameter. When this option is not specified, then the default EUD test profile (~5mins) is used.

eud.passthrough_args

Allows additional controls on the EUD diagnostic tests. See the table later in this document.

The table below provides the additional control arguments supported for eud.passthrough_args:

Logging

By default, DCGM logs the runs of EUD under /tmp/dcgm where two files are generated:

  • dcgm_eud.log - This plain text file contains a stdout log of the EUD test run

  • dcgm_eud.mle - This binary file contains the results of the EUD tests

The MLE file can be decoded to a JSON format output by running the mla binary under /usr/share/nvidia/diagnostic. See the documentation titled “MLA JSON Report Decoding” in that directory for more information on the options and generated reports.