Analyzing Xid Errors with the Xid Catalog#

On Volta and older GPUs, see Xid and SXid Errors with the Xid Catalog for older GPUs.

For Ampere and newer GPUs (including PCIe form-factor GPUs), a catalog of possible Xid events is available in the graphs below. You can also download the spreadsheet below:

Xid Catalog Reference

Table 1 Xids#

Type (XID)

Code

Mnemonic

Description

Applies to A100

Applies to H100

Applies to B100

Applies to GB200

Resolution Bucket (Immediate Action)

Resolution Bucket (Investigatory Action)

Xid 154 linkage

Trigger Conditions

XID

1

ROBUST_CHANNEL_FIFO_ERROR_FIFO_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

2

ROBUST_CHANNEL_FIFO_ERROR_SW_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

3

ROBUST_CHANNEL_FIFO_ERROR_UNK_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

4

ROBUST_CHANNEL_FIFO_ERROR_CHANNEL_BUSY

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

5

ROBUST_CHANNEL_FIFO_ERROR_RUNOUT_OVERFLOW

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

6

ROBUST_CHANNEL_FIFO_ERROR_PARSE_ERR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

7

ROBUST_CHANNEL_FIFO_ERROR_PTE_ERR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

8

ROBUST_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT

GPU stopped processing

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

9

ROBUST_CHANNEL_GR_ERROR_INSTANCE

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

10

ROBUST_CHANNEL_GR_ERROR_SINGLE_STEP

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

11

ROBUST_CHANNEL_GR_ERROR_MISSING_HW

Invalid or corrupted push buffer stream

YES

YES

YES

YES

RESTART_APP

CHECK_APP/CUDA

XID

12

ROBUST_CHANNEL_GR_ERROR_SW_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

13

ROBUST_CHANNEL_GR_EXCEPTION / ROBUST_CHANNEL_GR_ERROR_SW_NOTIFY

Graphics Engine Exception

YES

YES

YES

YES

RESTART_APP

WORKFLOW_XID_13

This event is logged for general user application faults. Typically this is an out-of-bounds error where the user has walked past the end of an array, but could also be an illegal instruction, illegal register, or other case.

In rare cases, it’s possible for a hardware failure or system software bugs to materialize as XID 13.

When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read.

XID

14

ROBUST_CHANNEL_FAKE_ERROR

Unused

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

Fake or injected error from userspace

XID

15

ROBUST_CHANNEL_SCANLINE_TIMEOUT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

16

ROBUST_CHANNEL_VBLANK_CALLBACK_TIMEOUT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

CONTACT_SUPPORT

N/A; Unused

XID

17

ROBUST_CHANNEL_PARAMETER_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

18

ROBUST_CHANNEL_BUS_MASTER_TIMEOUT_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

19

ROBUST_CHANNEL_DISP_MISSED_NOTIFIER

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

20

ROBUST_CHANNEL_MPEG_ERROR_SW_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

21

ROBUST_CHANNEL_ME_ERROR_SW_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

22

ROBUST_CHANNEL_VP_ERROR_SW_METHOD

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

23

ROBUST_CHANNEL_RC_LOGGING_ENABLED

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

24

ROBUST_CHANNEL_GR_SEMAPHORE_TIMEOUT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

25

ROBUST_CHANNEL_GR_ILLEGAL_NOTIFY

Invalid or illegal push buffer stream

YES

YES

YES

YES

RESTART_APP

CHECK_APP/CUDA

XID

26

ROBUST_CHANNEL_FIFO_ERROR_FBISTATE_TIMEOUT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

27

ROBUST_CHANNEL_VP_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

28

ROBUST_CHANNEL_VP2_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

29

ROBUST_CHANNEL_BSP_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

30

ROBUST_CHANNEL_BAD_ADDR_ACCESS

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

31

ROBUST_CHANNEL_FIFO_ERROR_MMU_ERR_FLT

GPU memory page fault

YES

YES

YES

YES

RESTART_APP

WORKFLOW_XID_31

This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs.

When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool, or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read.

XID

32

ROBUST_CHANNEL_PBDMA_ERROR

Invalid or corrupted push buffer stream

YES

YES

YES

YES

RESTART_APP

CHECK_APP/CUDA

This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions.

XID

33

ROBUST_CHANNEL_SEC_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

34

ROBUST_CHANNEL_MSVLD_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

35

ROBUST_CHANNEL_MSPDEC_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

36

ROBUST_CHANNEL_MSPPP_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

37

ROBUST_CHANNEL_FECS_ERR_UNIMP_FIRMWARE_METHOD

Driver firmware error

YES

YES

YES

YES

IGNORE

CHECK_APP/CUDA

XID

38

ROBUST_CHANNEL_FECS_ERR_WATCHDOG_TIMEOUT

Driver firmware error

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

XID

39

ROBUST_CHANNEL_CE0_ERROR

Copy Engine Exception

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

40

ROBUST_CHANNEL_CE1_ERROR

Copy Engine Exception

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

41

ROBUST_CHANNEL_CE2_ERROR

Copy Engine Exception

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

42

ROBUST_CHANNEL_VIC_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

43

ROBUST_CHANNEL_RESETCHANNEL_VERIF_ERROR

GPU stopped processing

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state.

In most cases, this is not indicative of a driver bug but rather a user application error.

XID

44

ROBUST_CHANNEL_GR_FAULT_DURING_CTXSW

Graphics Engine fault during context switch

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

XID

45

ROBUST_CHANNEL_PREEMPTIVE_REMOVAL

Preemptive cleanup, due to previous errors – Most likely to see when running multiple cuda applications and hitting a DBE

YES

YES

YES

YES

WORKFLOW_XID_45

Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)

This event is logged when the user application aborts and the kernel driver tears down the GPU application running on the GPU. Control-C, GPU resets, sigkill are all examples where the application is aborted and this event is created.

In many cases, this is not indicative of a bug but rather a user or system action.

XID

46

ROBUST_CHANNEL_GPU_TIMEOUT_ERROR

GPU stopped processing

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

XID

47

ROBUST_CHANNEL_NVENC0_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

48

ROBUST_CHANNEL_GPU_ECC_DBE

Double Bit ECC Error

YES

YES

YES

YES

WORKFLOW_XID_48

WORKFLOW_XID_48

CUDA 12.7; GPU driver R565

This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error.

The tool nvidia-smi can provide a summary of ECC errors.

If the ECC error is reported for SRAM (excludes “framebuffer”), check for SRAM DBE thresholds and follow RMA flow if exceeded - (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0). If flag is set, run field diag.

XID

49

SILENT_RUNNING_CONSTANT_LEVEL_SET_BY_REGISTRY

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

50

SILENT_RUNNING_LEVEL_TRANSITION_DUE_TO_RC_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

51

SILENT_RUNNING_STRESS_TEST_FAILURE

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

52

SILENT_RUNNING_LEVEL_TRANS_DUE_TO_TEMP_RISE

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

53

SILENT_RUNNING_TEMP_REDUCED_CLOCKING

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

54

SILENT_RUNNING_PWR_REDUCED_CLOCKING

Auxiliary power is not connected to the GPU board

YES

YES

YES

NO

CHECK_MECHANICALS

CONTACT_SUPPORT

XID

55

SILENT_RUNNING_TEMPERATURE_READ_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

56

DISPLAY_CHANNEL_EXCEPTION

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

57

FB_LINK_TRAINING_FAILURE_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

58

FB_MEMORY_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

59

PMU_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

60

ROBUST_CHANNEL_SEC2_ERROR

Video processor exception

YES

YES

YES

YES

RESTART_APP

INVESTIGATE_SW

XID

61

PMU_BREAKPOINT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

62

PMU_HALT_ERROR

Internal micro-controller halt (newer drivers)

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

CUDA 12.7; GPU driver R565

XID

63

INFOROM_DRAM_RETIREMENT_EVENT

GPU memory remapping event

YES

YES

YES

YES

IGNORE

IGNORE

CUDA 12.7; GPU driver R565

These events are logged when the GPU handles ECC memory errors on the GPU.

On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping.

On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.

XID

64

INFOROM_DRAM_RETIREMENT_FAILURE

GPU memory remapping failure

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

CUDA 12.7; GPU driver R565

These events are logged when the GPU handles ECC memory errors on the GPU.

On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping.

On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.

XID

65

ROBUST_CHANNEL_NVENC1_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

66

ROBUST_CHANNEL_FECS_ERR_REG_ACCESS_VIOLATION

Illegal access by driver

YES

YES

YES

YES

IGNORE

INVESTIGATE_SW

XID

67

ROBUST_CHANNEL_FECS_ERR_VERIF_VIOLATION

Illegal access by driver

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

XID

68

ROBUST_CHANNEL_NVDEC0_ERROR

NVDEC0 Exception

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

69

ROBUST_CHANNEL_GR_CLASS_ERROR

Graphics Engine class error

YES

YES

YES

YES

RESTART_APP

CHECK_APP/CUDA

XID

70

ROBUST_CHANNEL_CE3_ERROR

CE3: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

71

ROBUST_CHANNEL_CE4_ERROR

CE4: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

72

ROBUST_CHANNEL_CE5_ERROR

CE5: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

73

ROBUST_CHANNEL_NVENC2_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

74

NVLINK_ERROR

NVLINK Error

YES

YES

NO

NO

WORKFLOW_NVLINK_ERR

CONTACT_SUPPORT

CUDA 12.7; GPU driver R565

This event is logged when the GPU detects that a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error.

This event may indicate a hardware failure with the link itself, or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an Xid 74 simply because the link went down as a result.

The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links.

If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support.

XID

75

ROBUST_CHANNEL_CE6_ERROR

CE6: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

76

ROBUST_CHANNEL_CE7_ERROR

CE7: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

77

ROBUST_CHANNEL_CE8_ERROR

CE8: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

78

VGPU_START_ERROR

vGPU Start Error

YES

YES

YES

YES

UPDATE_SWFW

UPDATE_SWFW

XID

79

ROBUST_CHANNEL_GPU_HAS_FALLEN_OFF_THE_BUS

GPU has fallen off the bus

YES

YES

YES

YES

RESTART_BM

CONTACT_SUPPORT

CUDA 12.7; GPU driver R565

This event is logged when the GPU driver attempts to access the GPU over its PCI Express connection and finds that the GPU is not accessible.

This event is often caused by hardware failures on the PCI Express link causing the GPU to be inaccessible due to the link being brought down. Reviewing system event logs and kernel PCI event logs may provide additional indications of the source of the link failures.

This event may also be cause by failing GPU hardware or other driver issues.

XID

80

PBDMA_PUSHBUFFER_CRC_MISMATCH

Corrupted data sent to GPU

YES

YES

NO

NO

RESTART_APP

CHECK_APP/CUDA

XID

81

ROBUST_CHANNEL_VGA_SUBSYSTEM_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

82

ROBUST_CHANNEL_NVJPG0_ERROR

NVJPG0 Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

83

ROBUST_CHANNEL_NVDEC1_ERROR

NVDEC1 Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

84

ROBUST_CHANNEL_NVDEC2_ERROR

NVDEC2 Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

85

ROBUST_CHANNEL_CE9_ERROR

CE9: Unknown Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

86

ROBUST_CHANNEL_OFA0_ERROR

OFA Exception

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

87

NVTELEMETRY_DRIVER_REPORT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

88

ROBUST_CHANNEL_NVDEC3_ERROR

NVDEC3 Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

89

ROBUST_CHANNEL_NVDEC4_ERROR

NVDEC4 Error

YES

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

90

LTC_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

91

RESERVED_XID

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

92

EXCESSIVE_SBE_INTERRUPTS

High single-bit ECC error rate

YES

YES

YES

YES

IGNORE

CONTACT_SUPPORT

XID

93

INFOROM_ERASE_LIMIT_EXCEEDED

Non-fatal violation of provisioned InfoROM wear limit

YES

NO

NO

NO

IGNORE

CONTACT_SUPPORT

This event is logged when the GPU driver fails to update the InfoROM due to violation of the provisioned InfoROM wear limit that was set for the GPU using NVFlash using nvflash –=elsessionstart.

In most cases this is not indicative of a driver or flash failure, but rather the intentional use of the InfoROM wear protection feature as set by NVFlash.

Recovery steps: The GPU can be recovered from Xid 93 by clearing InfoROM erase limit using ./nvflash –-elsessionclear. If clearing the limit using nvflash doesn’t help, report the issue to NVIDIA.

XID

94

ROBUST_CHANNEL_CONTAINED_ERROR

Contained memory error

YES

YES

YES

YES

RESTART_APP

IGNORE (sympathetic)

CUDA 12.7; GPU driver R565

These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA A100 GPUs.

For Xid 94, these errors are contained to one application, and the application that encountered this error must be restarted. All other applications running at the time of the Xid are unaffected. It is recommended to reset the GPU when convenient. Applications can continue to be run until the reset can be performed.

One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors.

Xid 45 will be seen in relation to this error.

XID

95

ROBUST_CHANNEL_UNCONTAINED_ERROR

Uncontained memory error

YES

YES

YES

YES

RESET_GPU

IGNORE (sympathetic)

CUDA 12.7; GPU driver R565

These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA® A100 GPUs.

For Xid 95, these errors affect multiple applications, and the affected GPU must be reset before applications can restart. Refer https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations

One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors.

Xid 45 will be seen in relation to this error.

XID

96

ROBUST_CHANNEL_NVDEC5_ERROR

NVDEC5 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

97

ROBUST_CHANNEL_NVDEC6_ERROR

NVDEC6 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

98

ROBUST_CHANNEL_NVDEC7_ERROR

NVDEC7 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

99

ROBUST_CHANNEL_NVJPG1_ERROR

NVJPG1 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

100

ROBUST_CHANNEL_NVJPG2_ERROR

NVJPG2 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

101

ROBUST_CHANNEL_NVJPG3_ERROR

NVJPG3 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

102

ROBUST_CHANNEL_NVJPG4_ERROR

NVJPG4 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

103

ROBUST_CHANNEL_NVJPG5_ERROR

NVJPG5 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

104

ROBUST_CHANNEL_NVJPG6_ERROR

NVJPG6 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

105

ROBUST_CHANNEL_NVJPG7_ERROR

NVJPG7 Error

NO

YES

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

106

SMBPBI_TEST_MESSAGE

SMBPBI Test Message

YES

YES

YES

YES

IGNORE

IGNORE

XID

107

SMBPBI_TEST_MESSAGE_SILENT

SMBPBI Test Message Silent

YES

YES

YES

YES

IGNORE

IGNORE

XID

108

NVLINK_REMOTE_TRANSLATION_ERROR

Unused

YES

YES

YES

YES

IGNORE

XID_137_FLOW

N/A; Unused

XID

109

ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR

Context Switch Timeout Error

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

CUDA 12.7; GPU driver R570

XID

110

SEC_FAULT_ERROR

Security Fault Error

NO

YES

YES

YES

RESET_GPU

INVESTIGATE_SW

CUDA 12.7; GPU driver R565

This event should be uncommon unless there is a hardware failure. To recover, revert any recent system hardware modifications and cold reset the system. If this fails to correct the issue, contact your hardware vendor for assistance.

XID

111

BUNDLE_ERROR_EVENT

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

112

DISP_SUPERVISOR_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

113

DP_LT_FAILURE

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

114

HEAD_RG_UNDERFLOW

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

115

CORE_CHANNEL_REGS

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

116

WINDOW_CHANNEL_REGS

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

117

CURSOR_CHANNEL_REGS

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

118

HEAD_REGS

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

119

GSP_RPC_TIMEOUT

GSP RPC Timeout

YES

YES

YES

YES

RESET_GPU

INVESTIGATE_SW

These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps.

XID

120

GSP_ERROR

GSP Error

YES

YES

YES

YES

RESET_GPU

INVESTIGATE_SW

CUDA 12.7; GPU driver R565

These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps.

XID

121

C2C_ERROR

C2C Error

NO

NO

NO

YES

IGNORE

CONTACT_SUPPORT

This event may occur when the GPU driver has observed corrected errors on the C2C NVLink connection to a Grace CPU. These errors are corrected by the system and have no operational impact. Resetting the GPU at an available service window will allow the GPU to retrain the link. NOTE: repeat errors may be reported; VBIOS 97.00.90.00.00 may provide some relief from that condition

XID

122

SPI_PMU_RPC_READ_FAIL

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

123

SPI_PMU_RPC_WRITE_FAIL

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

124

SPI_PMU_RPC_ERASE_FAIL

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

125

INFOROM_FS_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

126

ROBUST_CHANNEL_CE10_ERROR

CE10: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

127

ROBUST_CHANNEL_CE11_ERROR

CE11: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

128

ROBUST_CHANNEL_CE12_ERROR

CE12: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

129

ROBUST_CHANNEL_CE13_ERROR

CE13: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

130

ROBUST_CHANNEL_CE14_ERROR

CE14: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

131

ROBUST_CHANNEL_CE15_ERROR

CE15: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

132

ROBUST_CHANNEL_CE16_ERROR

CE16: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

133

ROBUST_CHANNEL_CE17_ERROR

CE17: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

134

ROBUST_CHANNEL_CE18_ERROR

CE18: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

135

ROBUST_CHANNEL_CE19_ERROR

CE19: Unknown Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

136

ALI_TRAINING_FAIL

Link Training Failed

NO

YES

NO

NO

RESET_GPU

INVESTIGATE_LINK_SI

CUDA 12.7; GPU driver R565

XID

137

NVLINK_PRIV_ERR

NVLink Privilege Error

YES

YES

YES

YES

IGNORE

XID_137_FLOW

This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs.

When this event is logged, NVIDIA recommends the following: #. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or #. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or #. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

XID

138

ROBUST_CHANNEL_DLA_ERROR

Unused

NO

NO

NO

NO

CONTACT_SUPPORT

N/A; Unused

XID

139

ROBUST_CHANNEL_OFA1_ERROR

OFA1 Error

NO

NO

YES

YES

RESTART_APP

CONTACT_SUPPORT

XID

140

UNRECOVERABLE_ECC_ERROR_ESCAPE

ECC Unrecovered Error

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

This event may occur when the GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping. Reset the GPU, and if the problem persists, contact your hardware vendor for support.

XID

141

ROBUST_CHANNEL_FAST_PATH_ERROR

CUDA Fast Path Error

NO

YES

YES

YES

IGNORE

CONTACT_SUPPORT

XID

142

ROBUST_CHANNEL_NVENC3_ERROR

NVENC3 Error

NO

NO

NO

YES

CONTACT_SUPPORT

XID

143

GPU_INIT_ERROR

GPU Initialization Error

NO

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

CUDA 12.9; GPU driver R575

XID

144

NVLINK_SAW_ERROR

NVLINK: SAW Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

145

NVLINK_RLW_ERROR

NVLINK: RLW Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

146

NVLINK_TLW_ERROR

NVLINK: TLW Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

147

NVLINK_TREX_ERROR

NVLINK: TREX Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

148

NVLINK_NVLPW_CTRL_ERROR

NVLINK: NVLPW_CTRL Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

149

NVLINK_NETIR_ERROR

NVLINK: NETIR Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

150

NVLINK_MSE_ERROR

NVLINK: MSE Error

NO

NO

YES

YES

WORKFLOW_NVLINK5_ERR

WORKFLOW_NVLINK5_ERR

CUDA 12.7; GPU driver R565

XID

151

ROBUST_CHANNEL_KEY_ROTATION_ERROR

Key rotation Error

NO

YES

YES

YES

RESTART_VM

CONTACT_SUPPORT

XID

152

ROBUST_CHANNEL_DLA_SMMU_ERROR

DLA SMMU Error

NO

NO

NO

NO

IGNORE

CONTACT_SUPPORT

XID

153

ROBUST_CHANNEL_DLA_TIMEOUT

DLA timeout Error

NO

NO

NO

NO

IGNORE

CONTACT_SUPPORT

XID

154

GPU_RECOVERY_ACTION_CHANGED

GPU Recovery Action Changed

YES

YES

YES

YES

XID_154

N/A Informational only regarding another Xid

“Xid 154 will be seen in conjunction with other Xids and summarizes the recovery action required for other Xids. The string will be similar to “”Xid 154 GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)”” where the expected values of the text are: “”None””, “”Drain P2P””, “”Drain and Reset””, “”GPU Reset Required””, “”Node Reboot Required””. “

XID

155

NVLINK_SW_DEFINED_ERROR

NVLINK: SW Defined Error

NO

NO

YES

YES

RESET_GPU

INVESTIGATE_SW_USER

CUDA 12.7; GPU driver R565

Link down events which are flagged as “intentional” (including transitions to SLEEP) will trigger this Xid

XID

156

RESOURCE_RETIREMENT_EVENT

Resource Retirement Event

NO

YES

YES

YES

RESET_GPU

IGNORE

CUDA 12.7; GPU driver R565

XID

157

RESOURCE_RETIREMENT_FAILURE

Resource Retirement Failure

NO

YES

YES

YES

IGNORE

CONTACT_SUPPORT

No possible repairs are possible due to lack of resources. You may still run workloads or Apps, but may experience the same Xid again.

XID

158

GPU_FATAL_TIMEOUT

GPU Fatal Timeout

YES

YES

YES

YES

RESET_GPU

CONTACT_SUPPORT

yes; support with Xid introduction

XID

159

ROBUST_CHANNEL_CHI_NON_DATA_ERROR

CHI Non-Data Error

NO

NO

YES

YES

CHECK_UVM

SYMPATHETIC_REPORT_SOLO

yes; support with Xid introduction

May be seen on any C2C link-connected GPU.

XID

160

CHANNEL_RETIREMENT_EVENT

Channel Retirement Event

NO

NO

YES

YES

IGNORE

INVESTIGATE_SW

CUDA 12.9; GPU driver R575

XID

161

CHANNEL_RETIREMENT_FAILURE

Channel Retirement Failure

NO

NO

YES

YES

IGNORE

INVESTIGATE_SW

CUDA 12.9; GPU driver R575

XID

162

PSHC_REENGAGED

Power Smoothing HW Circuitry capability reengaged

NO

NO

YES

YES

XID

163

PSHC_DISENGAGED

Power Smoothing HW Circuitry capability disengaged

NO

NO

YES

YES

No GPU reset required. If power smoothing functionality is desired, the customer needs to resolve the thermal events. If disabled due to timeout, reload the driver or reset the GPU.

XID

164

PSHC_LOW_LIFETIME

Power Smoothing HW Circuitry low lifetime reached

NO

NO

YES

YES

Monitor power swings and expect to replace GPUs if power smoothing is desired. Power smoothing functionality will be disabled soon. Investigate if power swings are acceptable, and if not, take action.

XID

165

PSHC_ZERO_LIFETIME

Power Smoothing HW Circuitry lifetime exhausted

NO

NO

YES

YES

Replace GPUs if power swings are not acceptable, and power smoothing is desired. Power smoothing will be disabled by the driver and power swings will occur. Analyze datacenter infrastructure to ensure ability to absorb power swings.

XID

166

NVLINK_SECURE_CRYPTO_ERR

CC traffic seen prior to link properly being configured for encrypted traffic

NO

NO

YES

YES

Applicable to CC (confidential computing) mode only.

XID

167

PCIE_FATAL_TIMEOUT

PCIE_FATAL_TIMEOUT

NO

YES

YES

YES

XID

168

REDUCED_GPU_MEMORY_CAPACITY

Errors found in WPR (write protected region)

YES

YES

YES

YES

Should only be seen when ECC is disabled. Either ECC should be enabled (to enable row-remapping) or boot re-attempted with shifted WPR.

XID

169

SEC2_HALT_ERROR

Internal micro-controller halt

NO

YES

YES

YES

XID

170

NVLINK_SECURE_OTHER

Interrupt seen in CC mode

NO

NO

YES

YES

Applicable to CC (confidential computing) mode only.

XID

171

UNCORRECTABLE_DRAM_ERROR

Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM

YES

YES

YES

YES

XID

172

UNCORRECTABLE_SRAM_ERROR

Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM

YES

YES

YES

YES

Table 2 Xid 144-150 Decode#

Xid

Subcode V1(<R575)/V2(>=R575) V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0]

(V1(<R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode)

(V2(>=R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode)

Error Status (hex)

Resolution Bucket (Data Center Recovery Action)

(V1(<R575)) Decode for action 2

(V2(>=R575)) Decode for action 2

Action 2

Resolution Bucket (Investigatory Action)

Severity (for items with ‘*’ please see Customer User Guide tab)

HW/SW

Local/Remote (for items with ‘*’ please see Customer User Guide tab)

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000001

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000002

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000004

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000008

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: XID 48; Remote: Will lead to poison or Xid94****

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000010

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

144

SAW_MVB

——000000———-0000100001

——000000————-0000001

0x00000020

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_CTRL

——000000———-0001100010

——000000————-0000011

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000001

XID_154_EVAL

CONTACT_SUPPORT

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000002

XID_154_EVAL

CONTACT_SUPPORT

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000004

XID_154_EVAL

CHECK_NVLINK_FAILURE_FLOW

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000008

XID_154_EVAL

CHECK_NVLINK_FAILURE_FLOW

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000010

XID_154_EVAL

CHECK_NVLINK_FAILURE_FLOW

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000020

XID_154_EVAL

CHECK_NVLINK_FAILURE_FLOW

Non-fatal

SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000040

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000080

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000100

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x00000200

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_REMAP

——000000———-0010000010

——000000————-0000100

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

145

RLW_RSPCOL

——000000———-0010100010

——000000————-0000101

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_RSPCOL

——000000———-0010100010

——000000————-0000101

0x00000002

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_RSPCOL

——000000———-0010100010

——000000————-0000101

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

145

RLW_RXPIPE

——000000———00011000010

——000000————00000110

0x00000001

IGNORE

——000000———10011000010

——000000————10000110

RESET_GPU

CONTACT_SUPPORT

Non-fatal*

SW

Local: none; Remote: PacketLoss***

145

RLW_RXPIPE

——000000———00011000010

——000000————00000110

0x00000002

IGNORE

——000000———10011000010

——000000————10000110

RESET_GPU

CONTACT_SUPPORT

Non-fatal*

SW

Local: none; Remote: PacketLoss***

145

RLW_RXPIPE

——000000———00011000010

——000000————00000110

0x00000004

IGNORE

——000000———10011000010

——000000————10000110

RESET_GPU

CONTACT_SUPPORT

Non-fatal*

SW

Local: PacketLoss***; Remote: none

145

RLW_RXPIPE

——000000———-0011000010

——000000————-0000110

0x00000008

IGNORE

CONTACT_SUPPORT

Non-fatal

HW/SW

Local: none; Remote: none

145

RLW_RXPIPE

——000000———-0011000010

——000000————-0000110

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000001

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000002

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000004

XID_154_EVAL

IGNORE

Non-fatal

HW/SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000008

XID_154_EVAL

IGNORE

Non-fatal

HW/SW

Local: XC/AppCrash (Xid 45); Remote: none

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000010

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x00000020

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_SRC_TRACK

——000000———-0011100010

——000000————-0000111

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x00000002

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x00010000

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x00020000

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x00100000

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)

145

RLW_TAGSTATE

——000000———-0100000010

——000000————-0001000

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

146

TLW_CTRL

——000000———-0100100011

——000000————-0001001

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

146

TLW_CTRL

——000000———-0100100011

——000000————-0001001

0x00000002

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: XID 48; Remote: Will lead to poison or Xid94****

146

TLW_CTRL

——000000———-0100100011

——000000————-0001001

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible)

146

TLW_CTRL

——000000———-0100100011

——000000————-0001001

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

146

TLW_RX/TLW_RX_PIPE0

——000000———-0101000011

——000000————-0001010

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

146

TLW_RX/TLW_RX_PIPE0

——000000———-0101000011

——000000————-0001010

0x00000002

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none

146

TLW_RX/TLW_RX_PIPE0

——000000———-0101000011

——000000————-0001010

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible)

146

TLW_RX/TLW_RX_PIPE0

——000000———-0101000011

——000000————-0001010

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

146

TLW_RX/TLW_RX_PIPE1

——000000———-0101000011

——000000————-0001011

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

146

TLW_RX/TLW_RX_PIPE1

——000000———-0101000011

——000000————-0001011

0x00000002

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none

146

TLW_RX/TLW_RX_PIPE1

——000000———-0101000011

——000000————-0001011

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible)

146

TLW_RX/TLW_RX_PIPE1

——000000———-0101000011

——000000————-0001011

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

146

TLW_TX/TLW_TX_PIPE0

——000000———-0101100011

——000000————-0001100

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

146

TLW_TX/TLW_TX_PIPE0

——000000———-0101100011

——000000————-0001100

0x00000002

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none

146

TLW_TX/TLW_TX_PIPE0

——000000———-0101100011

——000000————-0001100

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible)

146

TLW_TX/TLW_TX_PIPE0

——000000———-0101100011

——000000————-0001100

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

146

TLW_TX/TLW_TX_PIPE1

——000000———-0101100011

——000000————-0001101

0x00000001

IGNORE

IGNORE

Non-fatal

HW

Local: none; Remote: none

146

TLW_TX/TLW_TX_PIPE1

——000000———-0101100011

——000000————-0001101

0x00000002

IGNORE

CONTACT_SUPPORT

Non-fatal

HW

Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none

146

TLW_TX/TLW_TX_PIPE1

——000000———-0101100011

——000000————-0001101

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal

HW

Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible)

146

TLW_TX/TLW_TX_PIPE1

——000000———-0101100011

——000000————-0001101

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

147

TREX

——000000———-0110000100

——000000————-0001110

0x00000001

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

NOTE: not in production code, so should not be experienced

147

TREX

——000000———-0110000100

——000000————-0001110

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

148

NVLPW_CTRL/NVLPW

——000000———-0000000101

——000000————-0001111

0x80000000

IGNORE

CONTACT_SUPPORT

Non-fatal

SW

Local: none; Remote: none

149

NETIR/NETIR_INT

——000000———-0000000110

——000000————-0011000

RESET_GPU

SYMPATHETIC_REPORT_SOLO

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——000000———-0111000110

——000000————-0010001

RESET_GPU

SYMPATHETIC_REPORT_SOLO

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——000001———-0111000110

——000001————-0010001

RESET_GPU

REPORT_ISSUE (if seen >1 per day)

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——000010———-0111000110

——000010————-0010001

RESET_GPU

INVESTIGATE_LINK_SI

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——000100———-0111000110

——000100————-0010001

RESET_GPU

INVESTIGATE_LINK_SI

Link Fatal

HW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——001010———-0111000110

——001010————-0010001

RESET_GPU

INVESTIGATE_LINK_SI

Link Fatal

HW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——001111———-0111000110

——001111————-0010001

RESET_GPU

INVESTIGATE_SW_USER

Link Fatal

SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——010000———-0111000110

——010000————-0010001

RESET_GPU

INVESTIGATE_SW_USER_LINK_SI

Link Fatal

SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——010001———-0111000110

——010001————-0010001

RESET_GPU

INVESTIGATE_SW_USER

Link Fatal

SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——010010———-0111000110

——010010————-0010001

RESET_GPU

INVESTIGATE_SW_USER_LINK_SI

Link Fatal

SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——010101———-0111000110

——010101————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——010110———-0111000110

——010110————-0010001

RESET_GPU

INVESTIGATE_SW_USER

Link Fatal

SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100000———-0111000110

——100000————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100001———-0111000110

——100001————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100010———-0111000110

——100010————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100011———-0111000110

——100011————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100100———-0111000110

——100100————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100101———-0111000110

——100101————-0010001

RESET_GPU

INVESTIGATE_PEER_DEVICE

Link Fatal

HW/SW

Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——100110———-0111000110

——100110————-0010001

IGNORE

INVESTIGATE_SW/USER

Link Fatal

HW/SW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——101000———-0111000110

——101000————-0010001

IGNORE

INVESTIGATE_HOST

Fatal

SW

Local: fatal; Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——101010———-0111000110

——101010————-0010001

RESET_GPU

INVESTIGATE_LINK_SI_AND_CABLES

Link Fatal?

HW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_LINK_EVT/NETIR_LINK_DOWN

——101011———-0111000110

——101011————-0010001

RESET_GPU

INVESTIGATE_LINK_SI_AND_CABLES

Link Fatal?

HW

Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed)

149

NETIR_BER_EVENT

——000000———-1000100110

——000000————-0010011

0x00000000

IGNORE

INVESTIGATE_LINK_SI_AND_CABLES

Non-fatal

HW

Local: none; Remote: none

149

NETIR_BER_EVENT

——000000———-1000100110

——000000————-0010011

0x00000001

IGNORE

INVESTIGATE_LINK_SI_AND_CABLES

Non-fatal

HW

Local: none; Remote: none

149

NETIR_BER_EVENT

——000000———-1000100110

——000000————-0010011

0x00000002

IGNORE

INVESTIGATE_LINK_SI_AND_CABLES

Non-fatal

HW

Local: none; Remote: none

149

NETIR_BER_EVENT

——000000———-1000100110

——000000————-0010011

0x00000003

IGNORE

INVESTIGATE_LINK_SI_AND_CABLES

Non-fatal

HW

Local: none; Remote: none

149

NETIR_MFDE_EVENT

——000000———-1001000110

——000000————-0010100

0x00000001

RESET_GPU

CONTACT_SUPPORT

Fatal**

HW/SW

Local: fatal; Remote: PacketLoss***(possible/delayed)

149

NETIR_MFDE_EVENT

——000000———-1001000110

——000000————-0010100

0x00000003

IGNORE

IGNORE

Non-fatal

NA

Local: none; Remote: none

149

NETIR_MFDE_EVENT

——000000———-1001000110

——000000————-0010100

0x00000004

RESET_GPU

CONTACT_SUPPORT

Fatal**

HW/SW

Local: fatal; Remote: PacketLoss***(possible/delayed)

149

NETIR_MFDE_EVENT

——000000———-1001000110

——000000————-0010100

0x00000005

RESET_GPU

CONTACT_SUPPORT

Fatal**

HW/SW

Local: fatal; Remote: PacketLoss***(possible/delayed)

149

NETIR_MFDE_EVENT

——000000———-1001000110

——000000————-0010100

0x00000007

RESET_GPU

CONTACT_SUPPORT

Fatal**

HW/SW

Local: fatal; Remote: PacketLoss***(possible/delayed)

150

MSE Degraded

——000000———-0000000000

——000000————-0000000

0x00000000/0xFFFFFFFF

RESET_GPU

CONTACT_SUPPORT

Fatal

FW

Local: Fatal; Remote: None

150

MSE_WATCHDOG

——000000———-0000000000

——000000————-0000000

0x00000000

RESET_GPU

CONTACT_SUPPORT

Fatal

FW

Local: Fatal; Remote: None

Table 3 Resolution Buckets#

Guidance Class

Resolution Action

CONTACT_SUPPORT

Please contact your support organization for further investigation.

RESTART_APP

The application should be restarted

RESET_GPU or RESTART_BM is not deemed necessary.

IGNORE

No Action required

WORKFLOW_XID_45

Solo: RESTART_FM Not Solo: IGNORE (follow guidance in other Xid)

RESET_GPU

Refer to https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations

RESTART_BM is not deemed necessary.

WORKFLOW_XID_48

Data Center Recovery Action Solo: RESET_GPU w/ 63 or 64: DRAIN_AND_RESET

Investagatory Action Solo: RUN_FIELDDIAG Not Solo: String in error would tell us what unit was impacted. FB: follow Xid63/64 guidance All other SRAM: check SRAM Error Threshold flag (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0. If set RUN_FIELDDIAG

CHECK_MECHANICALS

Check to ensure that device seating and all applicable connections to it are secure.

WORKFLOW_NVLINK_ERR

Extract the hex strings from the Xid error message. Note that there should be seven fields in the Xid. Unused fields would expect to be 0x0 rather than a full DWORD of 0’s. The first, third, fourth and fifth registers are valid for Hopper-based products. Evaluate the populate(d) registers. If bits other than those specifically outlined below are seen, please report a bug. First register: Bit 0, 23, 30: Can be safely ignored. Bits 1, 20: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bits 4 or 5: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 8, 9, 12, 16, 17, 24, 28: Could possibly be a HW issue: Check link mechanical connections and re-seat if a field resolution is required. Run diags if issue persists. If the issue persist, and diagnostics has passed please report a bug. Bits 21 or 22: Marginal channel SI issue. If other errors accompany this Xid, follow the resolution for those first. Otherwise, check link mechanical connections. Run Field Diags and report a bug. Bits 27, 29: If seen repeatedly, please report a bug. Third register: Bits 0, 1, 2, 6: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bit 13: Not expected to be seen in production. If seen, please report a bug. Bits 16, 19: If seen repeatedly, please run Field Diags and report a bug Bits 17, 18: If seen repeatedly, please report a bug. Fourth register: Bits 16, 17: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bit 18: These are generally sympathetic or secondary errors, though a reset of the fabric is required. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Fifth register: Bits 18, 19, 21, 22, 24, 25, 27, 28: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 20, 23, 26, 29: These errors represent a threshold of ECC errors being exceeded. There was no uncorrectable error at this time. Continue operation. If desired, Field Diags can be run to check for link integrity.

UPDATE_SWFW

Update Firmware and Software to latest versions

XID 78: basic issues will keep vGPU functionality from being able to operate; must resolve to progress 1. Guest driver version is incompatible with the host driver * In this case error string should be “Guest driver is incompatible with host driver” 2. This vGPU type is not compatible with the guest OS type/GPU type. For example, user is trying to use a compute profile on an old Maxwell GPU on windows guest. * In this case error string should be “vGPU type is not supported”

RESTART_BM

Restart bare metal, system should be restarted

WORKFLOW_NVLINK5_ERR

Please see the “XID 144-150 Decode” (was “Customer Doc 144-150”)tab for further guidance in evaluating these Xids. These errors need decoding of XID message as follows to determine the resolution action: Format of the XID error message: Xid (PCI:0000:BB:DF): <Xid Number> <sub component> <fatal vs nonfatal> <Crosscontain> <injected> <link> (<intrInfo> <errorStatus> <errorDebugData[0]> <errorDebugData[1]> <errorDebugData[2]> <errorDebugData[3]> <errorDebugData[4]>) From the above message, <intrInfo>, <errorStatus> must be decoded and evaluated using “XID 144-150 Decode” to derive the final resolution.

RESTART_VM

VM owning the affected GPU must be restarted

RESET_GPU or RESTART_BM is not deemed necessary.

XID_154

Follow XID 154 reported guidance

CHECK_UVM

If UVM/vGPU is being utilized, RESET_GPU; otherwise IGNORE

CHECK_APP/CUDA

Issue likely caused by an application passing bad data or utilizing incorrect methods in communications with GPU. Some errors will contain PID that can be used to identify source of the problem. If determined to be a driver issue then REPORT_ISSUE

WORKFLOW_XID_13

Repeat TPC and GPC, diff SMs: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat TPC and GPC, single SM: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Solo, no burst: CHECK_APP/CUDA Not Repeat TPC and GPC: CHECK_APP/CUDA Non-prod environment: CHECK_APP/CUDA If known good APP and Solo: REPORT_ISSUE

WORKFLOW_XID_31

Multiple runs needed to establish pattern Repeat MMU faults to same GPU (via PCI-ID): RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat MMU faults to diff GPU (via PCI-ID): CHECK_APP/CUDA Solo, no burst: CHECK_APP/CUDA If known good APP: REPORT_ISSUE

Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)

Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)

INVESTIGATE_SW

There is a problem with either user or NVIDIA software that needs to be investigated further. In many cases the user software may be making calls to illegal areas, poorly structured commands or other issues. This may also be a problem with NVIDIA software in which case an issue should be reported. In many cases there may be a PID that could be tracked back to the offending, originating entity.

IGNORE (sympathetic)

This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE.

XID_137_FLOW

This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs. When this event is logged, NVIDIA recommends the following: • Run the application in cuda-gdb or cuda-memcheck • Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. or • Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb File a bug if the previous two come back inconclusive to eliminate other possible causes.

INVESTIGATE_LINK_SI

Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods.

N/A Informational only regarding another Xid

Other issues need to be addressed. This Xid in informational only and is always expected to be seen with another Xid requiring a recovery action.

INVESTIGATE_SW_USER

Investigate SW or user initiated if unexpected

SYMPATHETIC_REPORT_SOLO

This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE.

XID_154_EVAL

If an XID 154 is seen along with this error, take that action. If no XID 154 present, RESTART_APP

CHECK_NVLINK_FAILURE_FLOW

Check telemetry to see if the any link went down in the partition within that last 30 seconds. If so, this Xid can be ignored as it was likely a by-product of the other conditions which should be investigated.

If no link down indication is present, REPORT_ISSUE.

REPORT_ISSUE (if seen >1 per day)

REPORT_ISSUE (if seen >1 per day)

INVESTIGATE_SW_USER_LINK_SI

Investigate software or user intervention if not expected; additionally, follow INVESTGATE_LINK_SI if needed

INVESTIGATE_PEER_DEVICE

A peer device experience the issue as decoded in the Xid144-150 table (Xid 149 in particular). Based upon the note column: Received_TS1: will be seen when peer link down reason is unknown Peer_side_down_to_sleep_state: investigate peer software and users if unexpected Peer_side_down_to_disable_state: investigate peer software and users if unexpected Peer_side_down_to_disable_and_port_lock: investigate peer software and users if unexpected Peer_side_down_due_to_thermal_event: check switch cooling Peer_side_down_due_to_force_event: investigate peer software and users if unexpected Peer_side_down_due_to_reset_event: investigate peer software and users if unexpected

INVESTIGATE_HOST

Check other logs as this is likely a secondary indicator of some action or fault (may be OOB).

INVESTIGATE_LINK_SI_AND_CABLES

A more general fault that could be cable, temperature, transceiver or seating condition. Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods.

INVESTIGATE_SW_USER

Investigate SW or user initiated if unexpected

Table 4 Customer User Guide#

Sheet Name

Column Name

Description

XIDs

Type XID

Identifies Xid entries

Code

The Xid number

Mnemonic

String to identify the condition.

Description

More descriptive identifier for the condition (“Unused” could mean Code is deprecated or V100 or earlier)

Applies to <project>

Signifies if the Code is supported on this particular product.

Resolution Bucket (Immediate Action)

Intended to reflect the action that is immediately needed in order to recover the system and get it back into service.

Resolution Bucket (Investigatory Action)

Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps.

Xid 154 linkage

Represents if the Code is also expected to trigger an Xid 154 condition representing the derived Data Center resolution.

Trigger Conditions

Description of when this condition may be seen or more details on possible actions to undertake.

XID 144-150 Decode

Xid

Xid number associated with the particular row. Each Xid represents a function of NVLink operation.

Subcode

The subsystem of the NVLink function. This is also presented in plain text in the Xid message (ex: NETIR_LINK_EVT) If the text string differs between revisions, then the two entries will be divided by a “/” (V1(<R575)/V2(>=R575)). This field is encoded in the following IntrInfo bits: V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0].

(V1(<R575)) IntrInfo decode for Data Center Recovery Action

Bitmask of IntrInfo for V1 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode.

(V2(>=R575)) IntrInfo decode for Data Center Recovery Action

Bitmask of IntrInfo for V2 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode.

Error Status (hex)

Error Status value represented by the second register presented in the parentheses.

Resolution Bucket (Immediate Action)

Intended to reflect the action that is immediately needed in order to recover the system and get it back into service.

(V1(<R575)) Decode for action 2

If needed, this will be the V1 IntrInfo decode required to undertake Action 2.

(V2(>=R575)) Decode for action 2

If needed, this will be the V2 IntrInfo decode required to undertake Action 2.

Action 2

Similar to Resolution Bucket (Immediate Action) above for the Decode for action 2 encoding

Resolution Bucket (Investigatory Action)

Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps.

Severity

Severity of the condition; Can be Link Fatal, Fatal (GPU) or non-fatal. -GPU fatal will cause all links to go down and all app channels to be RC’ed . May cause Packet Loss conditions. -Link fatal put the GPU in a “drain and reset recommended state” until jobs are drained. After job drain GPU is put to “reset required “ state so no new jobs can be launched. NOTE: * is for promoteable errors that could be non-fatal or fatal and “Action 2” would apply. ** while these are generally expected to be fatal, severity will be present and there are possible paths where this may not occur.

HW/SW

Is the condition generally HW, SW, or FW related. Some conditions can not be uniquely classified.

Local/Remote

What are the impacts of the condition on the local GPU as well as remote GPU(s) that are interconnected. NOTE: -Applies to Xid 144-148, 150. -Xid 149 will all be impacts to a local device (even if caused by a peer_side_down_* condition) -XC represents Cross Contain” * Packet Loss may present as a Xid 145 RLW_SRC_TRACK; V1 IntrInfo: ——000000———-0011100010 ; V2 IntrInfo: ——000000————-0000111; ErrStatus 0x00000004 or 0x00000008 ** Xid94 represents consumption of poisoned memory; Xid 48 represents ECC/DBE errors

Guidance Classes

Guidance Class

A resolution bucket assigned to a particular type of action.

Resolution Action

Steps to be taken to resolve the error that occurred.

This catalog provides a detailed reference on each possible Xid, and provides information on the cause of the Xid, and actions to take. You can also download the reference guide as a spreadsheet here.

The catalog is presented as a spreadsheet, with several sheets of information.

For a given Xid, use the following procedure to walk through the correct actions to take in handling the Xid.

Step 1: Determine Xid Code#

Determine the Xid Code from the Xid Message.

Each Xid message contains a single code, following a colon after the GPU identifier. In the following examples, the Xid Codes are 14, 22013 and 79 respectively.

[...] NVRM: Xid (0000:03:00): 14, Channel 00000001
[...] NVRM: Xid (PCI:0000:5a:00): 79, GPU has fallen off the bus.

Step 2: Review Xid Classification from the Xid Catalog#

In the Xid Catalog, open the “Xids” sheet and find the row with a matching “Code” to the Xid Code from Step 1.

For example, for Xid 79:

_images/example-Xid79.png

For each row, the catalog provides a brief description of the Xid in the “Description” column, as well as applicability to different revisions of GPU in the “Applies to” columns.

Note that some Xid codes are deprecated on more recent GPU models. These Xids are listed as “Unused” for the description, indicating they may be deprecated and applicable to V100 or earlier GPUs.

Step 3: Determine Data Center and Investigatory Actions#

The Xid Catalog provides two different actions for handling an Xid.

Immediate Action:

The “Resolution Bucket - Immediate Action” column in the Xid Catalog provides an immediate action that should be performed to recover a system after an Xid is observed. This is intended as an automatable action that administrators can perform to recover the system from Xid error, and ready the system for new applications.

This action can be performed as automated recovery after an Xid.

Investigatory Action:

The data center action is intended to recover the system, but in some cases, where there is a persistent failure, the Xid will reoccur, requiring a more detailed investigation to the cause of the issue, and will help to further identify if there are underlying hardware, firmware or software failures that need longer term actions to correct the issue.

If the issue reoccurs, or is not expected, the Investigatory Action column provides guidance on actions to take to investigate the issue.

Step 4: Determine Resolution Steps#

Both the Data Center Action and Investigatory Action columns provide a short summary Resolution Bucket that summarizes common actions to take that may be shared by different types of Xid codes. The actual steps to take for these actions are defined on the Fault Resolution Buckets worksheet.

For example, if data center action indicates a fault resolution bucket is “RESET_GPU,”” the row for RESET_GPU in the Fault Resolution Bucket spreadsheet provides guidance on exact actions to take.

_images/fault-resolution-bucket.png

Similarly, the resolution steps for investigatory actions are presented on the same worksheet.