Automating Responses to DCGM Diagnostic Failures
Overview
Automating workflows based on DCGM diagnostics can enable sites to handle GPU errors more efficiently. Additional data for determining the severity of errors and potential next steps is available using either the API or by parsing the JSON returned on the CLI. Besides simply reporting human readable strings of which errors occurred during the diagnostic, each error also includes a specific ID, Severity, and Category that can be useful when deciding how to handle the failure.
The latest versions of these enums can be found in dcgm_errors.h.
Error Category Enum |
VALUE |
|---|---|
DCGM_FR_EC_NONE |
0 |
DCGM_FR_EC_PERF_THRESHOLD |
1 |
DCGM_FR_EC_PERF_VIOLATION |
2 |
DCGM_FR_EC_SOFTWARE_CONFIG |
3 |
DCGM_FR_EC_SOFTWARE_LIBRARY |
4 |
DCGM_FR_EC_SOFTWARE_XID |
5 |
DCGM_FR_EC_SOFTWARE_CUDA |
6 |
DCGM_FR_EC_SOFTWARE_EUD |
7 |
DCGM_FR_EC_SOFTWARE_OTHER |
8 |
DCGM_FR_EC_HARDWARE_THERMAL |
9 |
DCGM_FR_EC_HARDWARE_MEMORY |
10 |
DCGM_FR_EC_HARDWARE_NVLINK |
11 |
DCGM_FR_EC_HARDWARE_NVSWITCH |
12 |
DCGM_FR_EC_HARDWARE_PCIE |
13 |
DCGM_FR_EC_HARDWARE_POWER |
14 |
DCGM_FR_EC_HARDWARE_OTHER |
15 |
DCGM_FR_EC_INTERNAL_OTHER |
16 |
Error Severity Enum |
VALUE |
|---|---|
DCGM_ERROR_NONE |
0 |
DCGM_ERROR_MONITOR |
1 |
DCGM_ERROR_ISOLATE |
2 |
DCGM_ERROR_UNKNOWN |
3 |
DCGM_ERROR_TRIAGE |
4 |
DCGM_ERROR_CONFIG |
5 |
DCGM_ERROR_RESET |
6 |
Error Enum |
Value |
Severity |
Category |
|---|---|---|---|
DCGM_FR_OK |
0 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_UNKNOWN |
1 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_UNRECOGNIZED |
2 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_NONE |
DCGM_FR_PCI_REPLAY_RATE |
3 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_VOLATILE_DBE_DETECTED |
4 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_VOLATILE_SBE_DETECTED |
5 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PENDING_PAGE_RETIREMENTS |
6 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_RETIRED_PAGES_LIMIT |
7 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_RETIRED_PAGES_DBE_LIMIT |
8 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CORRUPT_INFOROM |
9 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_CLOCKS_EVENT_THERMAL |
10 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_POWER_UNREADABLE |
11 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_CLOCKS_EVENT_POWER |
12 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_NVLINK_ERROR_THRESHOLD |
13 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVLINK_DOWN |
14 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVSWITCH_FATAL_ERROR |
15 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NVSWITCH_NON_FATAL_ERROR |
16 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NVSWITCH_DOWN |
17 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_NVSWITCH |
DCGM_FR_NO_ACCESS_TO_FILE |
18 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_NVML_API |
19 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_DEVICE_COUNT_MISMATCH |
20 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_BAD_PARAMETER |
21 |
DCGM_ERROR_UNKNOWN |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CANNOT_OPEN_LIB |
22 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_DENYLISTED_DRIVER |
23 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_NVML_LIB_BAD |
24 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_LIBRARY |
DCGM_FR_GRAPHICS_PROCESSES |
25 |
DCGM_ERROR_RESET |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_HOSTENGINE_CONN |
26 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_QUERY |
27 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_BAD_CUDA_ENV |
28 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_PERSISTENCE_MODE |
29 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_LOW_BANDWIDTH |
30 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_HIGH_LATENCY |
31 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_CANNOT_GET_FIELD_TAG |
32 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_VIOLATION |
33 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_FIELD_THRESHOLD |
34 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_FIELD_VIOLATION_DBL |
35 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_FIELD_THRESHOLD_DBL |
36 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_VIOLATION |
DCGM_FR_UNSUPPORTED_FIELD_TYPE |
37 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_FIELD_THRESHOLD_TS |
38 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_FIELD_THRESHOLD_TS_DBL |
39 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_THERMAL_VIOLATIONS |
40 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_THERMAL_VIOLATIONS_TS |
41 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_TEMP_VIOLATION |
42 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_THERMAL |
DCGM_FR_CLOCKS_EVENT_VIOLATION |
43 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_INTERNAL |
44 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_PCIE_GENERATION |
45 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_PCIE_WIDTH |
46 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_ABORTED |
47 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_TEST_DISABLED |
48 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_CANNOT_GET_STAT |
49 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_STRESS_LEVEL |
50 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_CUDA_API |
51 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_FAULTY_MEMORY |
52 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CANNOT_SET_WATCHES |
53 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CUDA_UNBOUND |
54 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_ECC_DISABLED |
55 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_ALLOC |
56 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_CUDA_DBE |
57 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_MISMATCH |
58 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_CUDA_DEVICE |
59 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_ECC_UNSUPPORTED |
60 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_ECC_PENDING |
61 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_MEMORY_BANDWIDTH |
62 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_PERF_THRESHOLD |
DCGM_FR_TARGET_POWER |
63 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_API_FAIL |
64 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_API_FAIL_GPU |
65 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_CUDA_CONTEXT |
66 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_DCGM_API |
67 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_CONCURRENT_GPUS |
68 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_TOO_MANY_ERRORS |
69 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_NVLINK_CRC_ERROR_THRESHOLD |
70 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVLINK_ERROR_CRITICAL |
71 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_ENFORCED_POWER_LIMIT |
72 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_HARDWARE_POWER |
DCGM_FR_MEMORY_ALLOC_HOST |
73 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_GPU_OP_MODE |
74 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_NO_MEMORY_CLOCKS |
75 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_NO_GRAPHICS_CLOCKS |
76 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_HARDWARE_OTHER |
DCGM_FR_HAD_TO_RESTORE_STATE |
77 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_L1TAG_UNSUPPORTED |
78 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_OTHER |
DCGM_FR_L1TAG_MISCOMPARE |
79 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_ROW_REMAP_FAILURE |
80 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_UNCONTAINED_ERROR |
81 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_XID |
DCGM_FR_EMPTY_GPU_LIST |
82 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_DBE_PENDING_PAGE_RETIREMENTS |
83 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_UNCORRECTABLE_ROW_REMAP |
84 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PENDING_ROW_REMAP |
85 |
DCGM_ERROR_RESET |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_BROKEN_P2P_MEMORY_DEVICE |
86 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_BROKEN_P2P_WRITER_DEVICE |
87 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_NVSWITCH_NVLINK_DOWN |
88 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_EUD_BINARY_PERMISSIONS |
89 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_NON_ROOT_USER |
90 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_SPAWN_FAILURE |
91 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_TIMEOUT |
92 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_ZOMBIE |
93 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_NON_ZERO_EXIT_CODE |
94 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_EUD_TEST_FAILED |
95 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_EUD |
DCGM_FR_FILE_CREATE_PERMISSIONS |
96 |
DCGM_ERROR_CONFIG |
DCGM_FR_EC_SOFTWARE_CONFIG |
DCGM_FR_PAUSE_RESUME_FAILED |
97 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_INTERNAL_OTHER |
DCGM_FR_PCIE_H_REPLAY_VIOLATION |
98 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_GPU_EXPECTED_NVLINKS_UP |
99 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_NVSWITCH_EXPECTED_NVLINKS_UP |
100 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_HARDWARE_NVLINK |
DCGM_FR_XID_ERROR |
101 |
DCGM_ERROR_TRIAGE |
DCGM_FR_EC_SOFTWARE_XID |
DCGM_FR_SBE_VIOLATION |
102 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_DBE_VIOLATION |
103 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PCIE_REPLAY_VIOLATION |
104 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_SBE_THRESHOLD_VIOLATION |
105 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_DBE_THRESHOLD_VIOLATION |
106 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_MEMORY |
DCGM_FR_PCIE_REPLAY_THRESHOLD_VIOLATION |
107 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_HARDWARE_PCIE |
DCGM_FR_CUDA_FM_NOT_INITIALIZED |
108 |
DCGM_ERROR_MONITOR |
DCGM_FR_EC_SOFTWARE_CUDA |
DCGM_FR_SXID_ERROR |
109 |
DCGM_ERROR_ISOLATE |
DCGM_FR_EC_SOFTWARE_XID |
These relationships are codified in dcgm_errors.c.
In general, DCGM has high confidence that errors with the ISOLATE and RESET severities should be handled immediately. Other severities may require more site-specific analysis, a re-run of the diagnostic, or a scanning of DCGM and system logs to determine the best course of action. Gathering and recording the failure types and rates over time can give datacenters insight into the best way to automate handling of GPU diagnostic errors.