CPU Extended Utility Diagnostics (CPU EUD)

Starting with DCGM 3.3.7, the CPU Extended Utility Diagnostics, or CPU EUD, is available as a new test. Once installed, it’s available as a separate suite of tests. The DCGMI Diag CPU EUD allows administrators to test for and report potential problems in the system.

Supported Products

CPU EUD supports the following Nvidia products:

  • Nvidia Grace CPU

Included Tests

The CPU EUD supports three different options targeting various aspects of CPU functionality:

  • CPU

    The CPU test suite focuses on several critical areas to ensure the reliability and performance of the CPU. This includes tests designed to verify data correctness, monitor error counts, and validate CPU performance under different conditions.

  • Memory

    The memory test suite for CPU EUD validates the CPU memory interface. The tests validate on both local and remote NUMA memory nodes, utilizing the full size of memory to ensure memory can function without errors and with high performance output.

  • C2C / Clink

    Leverage remote memory test to saturate the C2C / Clink bus.

  • PCIE

    The PCIe test suite validates the PCIe interface by checking link capabilities and ensuring stable performance, including the ability to retrain links while maintaining optimal operation between host and device.

  • Miscellaneous

    The miscellaneous test suite runs tests that don’t fit into any of the other categories. Most of the tests in this category are system-specific tests which validate the configuration and functionality of both CPU (e.g., CPU socket number, CPU number, CPU max/min MHz) and memory hardware to ensure the components are correctly identified and operational.

Note

By default, the CPU EUD will run one or more tests from each of the other test suites if not specified otherwise.

Getting Started with CPU EUD

Installing the CPU EUD packages

Install the Nvidia CPU EUD package using the appropriate package manager of the Linux distribution flavor.

  • Check the installed packages and remove all those shown in the output

    $ dpkg -l | grep cpueud
    // Example:
    // ii  cpueud-535                                        535.169-1  arm64  NVIDIA End-User cpueud
    // ii  cpueud-local-tegra-repo-ubuntu2204-535.169-mode1  1.0-1      arm64  cpueud-local-tegra repository configuration files
    
    $ sudo dpkg --purge <found packages>
    // Example:
    // $ sudo dpkg --purge cpueud-local-tegra-repo-ubuntu2204-535.169-mode1
    // $ sudo dpkg --purge cpueud-535
    
  • Install the local repo package

    $ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-$VERSION-mode1_1.0-1_arm64.deb
    // $VERSION: The version number of the package you are installing. Replace $VERSION with the actual version number of the package you're using.
    // Example:
    // $ sudo dpkg -i cpueud-local-tegra-repo-ubuntu2204-535.169-mode1_1.0-1_arm64.deb
    
  • Copy the keyring file to the correct location, the exact copy command will be in the output of the dpkg command.

    $ sudo cp /var/cpueud-local-tegra-repo-ubuntu2204-535.169/cpueud-local-tegra-FFCE45E1-keyring.gpg /usr/share/keyrings/
    
  • Update the apt-get and use it install cpueud

    $ sudo apt-get update
    $ sudo apt-get install cpueud
    

The files for the EUD should be installed under /usr/share/nvidia/cpu/diagnostic/

Running the CPU EUD

Run Levels and Tests

The duration and comprehensiveness of CPU EUD tests run can be varied by choosing a different diagnostic run level. The following table describes which tests are run at each level in DCGM diagnostics.

Plugin

Test name

r1 (Short)
Seconds
r2 (Medium)
< 2 mins
r3 (Long)
< 30 mins
r4 (Extra Long)
1-2 hours

CPU EUD

Opportunistic

Yes

CPU EUD

RmaFull

Yes

Syntax

# dcgmi diag -r cpu_eud [options]

Running DCGM with the -r cpu_eud parameter instead of a runlevel such as -r 3 runs the default CPU tests, which are the RmaFull tests.

Logging

By default, DCGM logs the runs of EUD under /var/log/nvidia-dcgm/ where three files are generated:

  • dcgm_cpu_eud_stdout.txt - The plain text file contains a stdout log of the CPU EUD test run

  • dcgm_cpu_eud_stderr.txt - The plain text file contains a stderr log of the CPU EUD test run

  • dcgm_cpu_eud.log - This file is an encrypted log of the CPU EUD test run

You can also specify cpu_eud.tmp_dir to set the directory where you want to store the log files.

Command Usage

Default

To obtain the results in tabular format, use the following command:

# dcgmi diag -r cpu_eud

Example Output

  • Pass case

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.0.0                                          |
| Number of CPUs Detected   | 1                                              |
| CPU EUD Test Version      | eud.535.161                                    |
+-----  Hardware  ----------+------------------------------------------------+
| cpu_eud                   | Pass                                           |
|                           | CPU0: Pass                                     |
+---------------------------+------------------------------------------------+
  • Failure case

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.0.0                                          |
| Number of CPUs Detected   | 1                                              |
| CPU EUD Test Version      | eud.535.161                                    |
+-----  Hardware  ----------+------------------------------------------------+
| cpu_eud                   | Fail                                           |
|                           | CPU0: Fail                                     |
| Warning: CPU0             | Error : bad command line argument              |
+---------------------------+------------------------------------------------+

JSON Output

To obtain the results in JSON format, use the following command:

# dcgmi diag -r cpu_eud -j

JSON schema for the element in tests

{
  "$schema": "http://json-schema.org/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "entity_group": {
            "type": "string"
          },
          "entity_group_id": {
            "type": "integer"
          },
          "entity_id": {
            "type": "integer"
          },
          "status": {
            "type": "string"
          },
          "info": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "warnings": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "error_category": {
                  "type": "integer"
                },
                "error_id": {
                  "type": "integer"
                },
                "error_severity": {
                  "type": "integer"
                },
                "warning": {
                  "type": "string"
                }
              },
              "required": [
                "error_category",
                "error_id",
                "error_severity",
                "warning"
              ]
            }
          }
        },
        "required": [
          "entity_group",
          "entity_group_id",
          "entity_id",
          "status"
        ]
      }
    },
    "test_summary": {
      "type": "object",
      "properties": {
        "status": {
          "type": "string"
        },
        "info": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "warnings": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "error_category": {
                "type": "integer"
              },
              "error_id": {
                "type": "integer"
              },
              "error_severity": {
                "type": "integer"
              },
              "warning": {
                "type": "string"
              }
            },
            "required": [
              "error_category",
              "error_id",
              "error_severity",
              "warning"
            ]
          }
        }
      },
      "required": [
        "status"
      ]
    }
  },
  "required": [
    "name",
    "results",
    "test_summary"
  ]
}

Example Output

  • Pass case

{
  "category": "Hardware",
  "tests": [
    {
      "name": "cpu_eud",
      "results": [
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 0,
          "status": "Pass"
        },
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 1,
          "status": "Skip"
        }
      ],
      "test_summary": {
        "status": "Pass"
      }
    }
  ]
}
  • Failure case

{
  "category": "Hardware",
  "tests": [
    {
      "name": "cpu_eud",
      "results": [
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 0,
          "status": "Fail",
          "warnings": [
            {
              "error_category": 7,
              "error_id": 95,
              "error_severity": 2,
              "warning": "Error : bad command line argument"
            }
          ]
        },
        {
          "entity_group": "CPU",
          "entity_group_id": 7,
          "entity_id": 1,
          "status": "Skip"
        }
      ],
      "test_summary": {
        "status": "Fail"
      }
    }
  ]
}