CPU Extended Utility Diagnostics (CPU EUD)

Starting with DCGM 3.3.7, the CPU Extended Utility Diagnostics, or CPU EUD, is available as a new test. Once installed, it’s available as a separate suite of tests. The DCGMI Diag CPU EUD allows administrators to test for and report potential problems in the system.

Supported Products

CPU EUD supports the following Nvidia products:

Nvidia Grace CPU

Included Tests

The CPU EUD supports three different options targeting various aspects of CPU functionality:

CPU
The CPU test suite focuses on several critical areas to ensure the reliability and performance of the CPU. This includes tests designed to verify data correctness, monitor error counts, and validate CPU performance under different conditions.
Memory
The memory test suite for CPU EUD validates the CPU memory interface. The tests validate on both local and remote NUMA memory nodes, utilizing the full size of memory to ensure memory can function without errors and with high performance output.
C2C / Clink
Leverage remote memory test to saturate the C2C / Clink bus.
PCIE
The PCIe test suite validates the PCIe interface by checking link capabilities and ensuring stable performance, including the ability to retrain links while maintaining optimal operation between host and device.
Miscellaneous
The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are system-specific tests which validate the configuration and functionality of both CPU (e.g., CPU socket number, CPU number, CPU max/min MHz) and memory hardware to ensure the components are correctly identified and operational.

Note

By default, the CPU EUD will run one or more tests from each of the other test suites if not specified otherwise.

Getting Started with CPU EUD

Installing the CPU EUD packages

Install the Nvidia CPU EUD package using the appropriate package manager of the Linux distribution flavor.

Check the installed packages and remove all those shown in the output

$ dpkg -l | grep cpueud
// Example:
// ii  cpueud-535                                        535.169-1  arm64  NVIDIA End-User cpueud
// ii  cpueud-local-tegra-repo-ubuntu2204-535.169-mode1  1.0-1      arm64  cpueud-local-tegra repository configuration files

$ sudo dpkg --purge <found packages>
// Example:
// $ sudo dpkg --purge cpueud-local-tegra-repo-ubuntu2204-535.169-mode1
// $ sudo dpkg --purge cpueud-535

Install the local repo package

$ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-$VERSION-mode1_1.0-1_arm64.deb
// $VERSION: The version number of the package you are installing. Replace $VERSION with the actual version number of the package you're using.
// Example:
// $ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-535.169-mode1_1.0-1_arm64.deb

Copy the keyring file to the correct location, the exact copy command will be in the output of the dpkg command.

$ sudo cp /var/cpueud-local-tegra-repo-ubuntu2204-535.169/cpueud-local-tegra-FFCE45E1-keyring.gpg /usr/share/keyrings/

Update the apt-get and use it install cpueud

$ sudo apt-get update
$ sudo apt-get install cpueud

Check the installed packages and remove all those shown in the output.

$ sudo dnf list installed | grep cpueud
// Example:
// cpueud-535.aarch64                                  535.169-1                @cpueud-local-tegra-rhel9-535.169-mode1
// cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64 1.0-1                    @@System

$ sudo rpm -e <found packages>
$ sudo dnf remove <found packages>
// Example:
// $ sudo rpm -e cpueud-535.aarch64
// $ sudo dnf remove cpueud-535.aarch64
// $ sudo rpm -e cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64
// $ sudo dnf remove cpueud-local-tegra-repo-rhel9-535.169-mode1.aarch64

Install the local repo file, then install the diagnostic. Ensure the diagnostic version matches the major version specified in the local repo RPM file.

$ sudo yum install libxcrypt-compat
$ sudo rpm -i cpueud-local-tegra-repo-rhel9-$VERSION-mode1-1.0-1.aarch64.rpm
// $VERSION: The version number of the package you are installing. Replace $VERSION with the actual version number of the package you're using.
// Example:
// $ sudo rpm -i cpueud-local-tegra-repo-rhel9-535.169-mode1-1.0-1.aarch64.rpm

$ sudo dnf install cpueud

The files for the EUD should be installed under /usr/share/nvidia/cpu/diagnostic/

Running the CPU EUD

Run Levels and Tests

The duration and comprehensiveness of CPU EUD tests run can be varied by choosing a different diagnostic run level. The following table describes which tests are run at each level in DCGM diagnostics.

Plugin	Test name	r1 (Short) Seconds	r2 (Medium) < 2 mins	r3 (Long) < 30 mins	r4 (Extra Long) 1-2 hours
CPU EUD	`Opportunistic`			Yes
CPU EUD	`RmaFull`				Yes

Syntax

# dcgmi diag -r cpu_eud [options]

Running DCGM with the -r cpu_eud parameter instead of a runlevel such as -r 3 runs the default CPU tests, which are the RmaFull tests.

Logging

By default, DCGM logs the runs of EUD under /var/log/nvidia-dcgm/ where three files are generated:

dcgm_cpu_eud_stdout.txt - The plain text file contains a stdout log of the CPU EUD test run
dcgm_cpu_eud_stderr.txt - The plain text file contains a stderr log of the CPU EUD test run
dcgm_cpu_eud.log - This file is an encrypted log of the CPU EUD test run

You can also specify cpu_eud.tmp_dir to set the directory where you want to store the log files.

Command Usage

Default

To obtain the results in tabular format, use the following command:

# dcgmi diag -r cpu_eud

Example Output

Pass case

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.0.0                                          |
| Number of CPUs Detected   | 1                                              |
| CPU EUD Test Version      | eud.535.161                                    |
+-----  Hardware  ----------+------------------------------------------------+
| cpu_eud                   | Pass                                           |
|                           | CPU0: Pass                                     |
+---------------------------+------------------------------------------------+

Failure case

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.0.0                                          |
| Number of CPUs Detected   | 1                                              |
| CPU EUD Test Version      | eud.535.161                                    |
+-----  Hardware  ----------+------------------------------------------------+
| cpu_eud                   | Fail                                           |
|                           | CPU0: Fail                                     |
| Warning: CPU0             | Error : bad command line argument              |
+---------------------------+------------------------------------------------+

JSON Output

To obtain the results in JSON format, use the following command:

# dcgmi diag -r cpu_eud -j

JSON schema for the element in tests

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "entity_group": {
            "type": "string"
          },
          "entity_group_id": {
            "type": "integer"
          },
          "entity_id": {
            "type": "integer"
          },
          "status": {
            "type": "string"
          },
          "info": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "warnings": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "error_category": {
                  "type": "integer"
                },
                "error_id": {
                  "type": "integer"
                },
                "error_severity": {
                  "type": "integer"
                },
                "warning": {
                  "type": "string"
                }
              },
              "required": [
                "error_category",
                "error_id",
                "error_severity",
                "warning"
              ]
            }
          }
        },
        "required": [
          "entity_group",
          "entity_group_id",
          "entity_id",
          "status"
        ]
      }
    },
    "test_summary": {
      "type": "object",
      "properties": {
        "status": {
          "type": "string"
        },
        "info": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "warnings": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "error_category": {
                "type": "integer"
              },
              "error_id": {
                "type": "integer"
              },
              "error_severity": {
                "type": "integer"
              },
              "warning": {
                "type": "string"
              }
            },
            "required": [
              "error_category",
              "error_id",
              "error_severity",
              "warning"
            ]
          }
        }
      },
      "required": [
        "status"
      ]
    }
  },
  "required": [
    "name",
    "results",
    "test_summary"
  ]
}

Example Output

Pass case

{
  "category": "Hardware",
  "tests": [
    {
      "name": "cpu_eud",
      "results": [
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 0,
          "status": "Pass"
        },
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 1,
          "status": "Skip"
        }
      ],
      "test_summary": {
        "status": "Pass"
      }
    }
  ]
}

Failure case

{
  "category": "Hardware",
  "tests": [
    {
      "name": "cpu_eud",
      "results": [
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 0,
          "status": "Fail",
          "warnings": [
            {
              "error_category": 7,
              "error_id": 95,
              "error_severity": 2,
              "warning": "Error : bad command line argument"
            }
          ]
        },
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 1,
          "status": "Skip"
        }
      ],
      "test_summary": {
        "status": "Fail"
      }
    }
  ]
}