Error Injection

Overview

DCGM includes an error injection framework allows users to simulate the error handling behavior of the DCGM APIs when GPU errors are encountered.

Error Injection Workflow

The basic workflow of injecting GPU errors with the framework is as follows:

  1. Start the nv-hostengine daemon

  2. Enable monitoring through DCGM (either using policies or health-watches)

  3. Determine an injection value for the targeted GPU error

  4. Inject the error using dcgmi test --inject

  5. DCGM should now report the GPU errors

Field Identifiers

A full list of field identifiers is available here: https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_fields.h

These identifiers can be used to determine the integer ids to be used in the commands in the following section.

A common list of errors that could be used in the injection framework include:

  • PCIe Replay Errors

  • ECC Errors (single double-bit error or multiple co-located single-bit errors)

  • Power Excursions

  • Thermal Excursions

  • XID Errors

  • NVLink Errors

Examples with dcgmi

In these examples, we demonstrate how DCGM can be used to detect various GPU error scenarios.

Thermal Violation

This example demonstrates an excursion above the specified GPU thermal threshold.

In this example, use the DCGM policy to watch for violations from the target temperature threshold of 50C.

In one “listening” terminal:

$ dcgmi policy --set 0,0 -T 50
Policy successfully set.

Register DCGM to watch for violations

$ dcgmi policy --reg
Listening for violations.

In another “application” terminal, launch a workload. The provided DCGM CUDA load generator can be used for this purpose. For this example, launch an FP16 GEMM on the GPU:

$ dcgmproftester11 --no-dcgm-validation -t 1004 -d 30

Back in the “listening” console, DCGM reports the thermal violations as the GPU temperature increases due to compute work:

Timestamp: Wed Sep 21 22:23:18 2022
The maximum thermal limit has violated policy manager values.
Temperature: 56
Timestamp: Wed Sep 21 22:23:28 2022
The maximum thermal limit has violated policy manager values.
Temperature: 60

PCIe Replay Errors

This example demonstrates injection of PCIe replay errors.

In the “listening” terminal:

$ dcgmi policy --set 0,0 -p
Policy successfully set.

$ dcgmi policy --reg
Listening for violations.

In another terminal, inject a contrived value of 99999:

$ dcgmi test --inject --gpuid 0 -f 202 -v 99999
Successfully injected field info.

And in the “listening” terminal, DCGM reports these PCIe replay violations

Listening for violations.
Timestamp: Thu Sep 22 01:30:34 2022
A PCIe replay event has violated policy manager values.
PCIe replay count: 99999

The same violation can also be observed when using Health watches with DCGM:

$ dcgmi health -c
+---------------------------+----------------------------------------------------------+
| Health Monitor Report                                                                |
+===========================+==========================================================+
| Overall Health            | Warning                                                  |
| GPU                       |                                                          |
| -> 0                      | Warning                                                  |
|    -> Errors              |                                                          |
|       -> PCIe system      | Warning                                                  |
|                           | Detected more than 8 PCIe replays per minute for GPU 0   |
|                           | : 99999 Reconnect PCIe card. Run system side PCIE        |
|                           | diagnostic utilities to verify hops off the GPU board    |
|                           | If issue is on the board, run the field diagnostic.      |
+---------------------------+----------------------------------------------------------+

ECC Errors

This example demonstrates injection of double-bit errors (DBEs).

In the “listening” terminal:

$ dcgmi policy --set 0,0 -e
Policy successfully set.

$ dcgmi policy --reg
Listening for violations.

In another terminal, inject a value of 4:

$ dcgmi test --inject --gpuid 0 -f 319 -v 4
Successfully injected field info.

And in the “listening” terminal, DCGM reports the ECC errors

Timestamp: Thu Sep 22 04:44:12 2022
A double-bit ECC error has violated policy manager values.
DBE error count: 2

API Examples

An example of how to inject values programmatically can be found in the following Python file:

https://github.com/NVIDIA/DCGM/blob/master/testing/python3/tests/test_injection.py