Automating Responses to DCGM Diagnostic Failures

Overview

Automating workflows based on DCGM diagnostics can enable sites to handle GPU errors more efficiently. Additional data for determining the severity of errors and potential next steps is available using either the API or by parsing the JSON returned on the CLI. Besides simply reporting human readable strings of which errors occurred during the diagnostic, each error also includes a specific ID, Severity, and Category that can be useful when deciding how to handle the failure.

The latest versions of these enums can be found in dcgm_errors.h.

Error Category Enum

VALUE

DCGM_FR_EC_NONE

0

DCGM_FR_EC_PERF_THRESHOLD

1

DCGM_FR_EC_PERF_VIOLATION

2

DCGM_FR_EC_SOFTWARE_CONFIG

3

DCGM_FR_EC_SOFTWARE_LIBRARY

4

DCGM_FR_EC_SOFTWARE_XID

5

DCGM_FR_EC_SOFTWARE_CUDA

6

DCGM_FR_EC_SOFTWARE_EUD

7

DCGM_FR_EC_SOFTWARE_OTHER

8

DCGM_FR_EC_HARDWARE_THERMAL

9

DCGM_FR_EC_HARDWARE_MEMORY

10

DCGM_FR_EC_HARDWARE_NVLINK

11

DCGM_FR_EC_HARDWARE_NVSWITCH

12

DCGM_FR_EC_HARDWARE_PCIE

13

DCGM_FR_EC_HARDWARE_POWER

14

DCGM_FR_EC_HARDWARE_OTHER

15

DCGM_FR_EC_INTERNAL_OTHER

16

Error Severity Enum

VALUE

DCGM_ERROR_NONE

0

DCGM_ERROR_MONITOR

1

DCGM_ERROR_ISOLATE

2

DCGM_ERROR_UNKNOWN

3

DCGM_ERROR_TRIAGE

4

DCGM_ERROR_CONFIG

5

DCGM_ERROR_RESET

6

Error Enum

Value

Severity

Category

DCGM_FR_OK

0

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_UNKNOWN

1

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_UNRECOGNIZED

2

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_NONE

DCGM_FR_PCI_REPLAY_RATE

3

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_VOLATILE_DBE_DETECTED

4

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_VOLATILE_SBE_DETECTED

5

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PENDING_PAGE_RETIREMENTS

6

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_RETIRED_PAGES_LIMIT

7

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_RETIRED_PAGES_DBE_LIMIT

8

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CORRUPT_INFOROM

9

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_CLOCKS_EVENT_THERMAL

10

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_POWER_UNREADABLE

11

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_CLOCKS_EVENT_POWER

12

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_NVLINK_ERROR_THRESHOLD

13

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVLINK_DOWN

14

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVSWITCH_FATAL_ERROR

15

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NVSWITCH_NON_FATAL_ERROR

16

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NVSWITCH_DOWN

17

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_NVSWITCH

DCGM_FR_NO_ACCESS_TO_FILE

18

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_NVML_API

19

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_DEVICE_COUNT_MISMATCH

20

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_BAD_PARAMETER

21

DCGM_ERROR_UNKNOWN

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CANNOT_OPEN_LIB

22

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_DENYLISTED_DRIVER

23

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_NVML_LIB_BAD

24

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_LIBRARY

DCGM_FR_GRAPHICS_PROCESSES

25

DCGM_ERROR_RESET

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_HOSTENGINE_CONN

26

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_QUERY

27

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_BAD_CUDA_ENV

28

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_PERSISTENCE_MODE

29

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_LOW_BANDWIDTH

30

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_HIGH_LATENCY

31

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_CANNOT_GET_FIELD_TAG

32

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_VIOLATION

33

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_FIELD_THRESHOLD

34

DCGM_ERROR_MONITOR

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_FIELD_VIOLATION_DBL

35

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_FIELD_THRESHOLD_DBL

36

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_VIOLATION

DCGM_FR_UNSUPPORTED_FIELD_TYPE

37

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_FIELD_THRESHOLD_TS

38

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_FIELD_THRESHOLD_TS_DBL

39

DCGM_ERROR_ISOLATE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_THERMAL_VIOLATIONS

40

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_THERMAL_VIOLATIONS_TS

41

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_TEMP_VIOLATION

42

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_THERMAL

DCGM_FR_CLOCKS_EVENT_VIOLATION

43

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_INTERNAL

44

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_PCIE_GENERATION

45

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_PCIE_WIDTH

46

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_ABORTED

47

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_TEST_DISABLED

48

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_CANNOT_GET_STAT

49

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_STRESS_LEVEL

50

DCGM_ERROR_TRIAGE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_CUDA_API

51

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_FAULTY_MEMORY

52

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CANNOT_SET_WATCHES

53

DCGM_ERROR_MONITOR

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CUDA_UNBOUND

54

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_ECC_DISABLED

55

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_ALLOC

56

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_CUDA_DBE

57

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_MISMATCH

58

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_CUDA_DEVICE

59

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_ECC_UNSUPPORTED

60

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_ECC_PENDING

61

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_MEMORY_BANDWIDTH

62

DCGM_ERROR_TRIAGE

DCGM_FR_EC_PERF_THRESHOLD

DCGM_FR_TARGET_POWER

63

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_API_FAIL

64

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_API_FAIL_GPU

65

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_CUDA_CONTEXT

66

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_DCGM_API

67

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_CONCURRENT_GPUS

68

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_TOO_MANY_ERRORS

69

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_NVLINK_CRC_ERROR_THRESHOLD

70

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVLINK_ERROR_CRITICAL

71

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_ENFORCED_POWER_LIMIT

72

DCGM_ERROR_CONFIG

DCGM_FR_EC_HARDWARE_POWER

DCGM_FR_MEMORY_ALLOC_HOST

73

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_GPU_OP_MODE

74

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_NO_MEMORY_CLOCKS

75

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_NO_GRAPHICS_CLOCKS

76

DCGM_ERROR_MONITOR

DCGM_FR_EC_HARDWARE_OTHER

DCGM_FR_HAD_TO_RESTORE_STATE

77

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_L1TAG_UNSUPPORTED

78

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_OTHER

DCGM_FR_L1TAG_MISCOMPARE

79

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_ROW_REMAP_FAILURE

80

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_UNCONTAINED_ERROR

81

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_XID

DCGM_FR_EMPTY_GPU_LIST

82

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_DBE_PENDING_PAGE_RETIREMENTS

83

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_UNCORRECTABLE_ROW_REMAP

84

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PENDING_ROW_REMAP

85

DCGM_ERROR_RESET

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_BROKEN_P2P_MEMORY_DEVICE

86

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_BROKEN_P2P_WRITER_DEVICE

87

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_NVSWITCH_NVLINK_DOWN

88

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_EUD_BINARY_PERMISSIONS

89

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_NON_ROOT_USER

90

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_SPAWN_FAILURE

91

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_TIMEOUT

92

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_ZOMBIE

93

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_NON_ZERO_EXIT_CODE

94

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_EUD_TEST_FAILED

95

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_EUD

DCGM_FR_FILE_CREATE_PERMISSIONS

96

DCGM_ERROR_CONFIG

DCGM_FR_EC_SOFTWARE_CONFIG

DCGM_FR_PAUSE_RESUME_FAILED

97

DCGM_ERROR_TRIAGE

DCGM_FR_EC_INTERNAL_OTHER

DCGM_FR_PCIE_H_REPLAY_VIOLATION

98

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_GPU_EXPECTED_NVLINKS_UP

99

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_NVSWITCH_EXPECTED_NVLINKS_UP

100

DCGM_ERROR_TRIAGE

DCGM_FR_EC_HARDWARE_NVLINK

DCGM_FR_XID_ERROR

101

DCGM_ERROR_TRIAGE

DCGM_FR_EC_SOFTWARE_XID

DCGM_FR_SBE_VIOLATION

102

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_DBE_VIOLATION

103

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PCIE_REPLAY_VIOLATION

104

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_SBE_THRESHOLD_VIOLATION

105

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_DBE_THRESHOLD_VIOLATION

106

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_MEMORY

DCGM_FR_PCIE_REPLAY_THRESHOLD_VIOLATION

107

DCGM_ERROR_ISOLATE

DCGM_FR_EC_HARDWARE_PCIE

DCGM_FR_CUDA_FM_NOT_INITIALIZED

108

DCGM_ERROR_MONITOR

DCGM_FR_EC_SOFTWARE_CUDA

DCGM_FR_SXID_ERROR

109

DCGM_ERROR_ISOLATE

DCGM_FR_EC_SOFTWARE_XID

These relationships are codified in dcgm_errors.c.

In general, DCGM has high confidence that errors with the ISOLATE and RESET severities should be handled immediately. Other severities may require more site-specific analysis, a re-run of the diagnostic, or a scanning of DCGM and system logs to determine the best course of action. Gathering and recording the failure types and rates over time can give datacenters insight into the best way to automate handling of GPU diagnostic errors.