Analyzing Xid Errors with the Xid Catalog#
On Volta and older GPUs, see Xid and SXid Errors with the Xid Catalog for older GPUs.
For Ampere and newer GPUs (including PCIe form-factor GPUs), a catalog of possible Xid events is available in the graphs below. You can also download the spreadsheet below:
Type (XID) |
Code |
Mnemonic |
Description |
Applies to A100 |
Applies to H100 |
Applies to B100 |
Applies to GB200 |
Resolution Bucket (Immediate Action) |
Resolution Bucket (Investigatory Action) |
Xid 154 linkage |
Trigger Conditions |
---|---|---|---|---|---|---|---|---|---|---|---|
XID |
1 |
ROBUST_CHANNEL_FIFO_ERROR_FIFO_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
2 |
ROBUST_CHANNEL_FIFO_ERROR_SW_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
3 |
ROBUST_CHANNEL_FIFO_ERROR_UNK_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
4 |
ROBUST_CHANNEL_FIFO_ERROR_CHANNEL_BUSY |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
5 |
ROBUST_CHANNEL_FIFO_ERROR_RUNOUT_OVERFLOW |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
6 |
ROBUST_CHANNEL_FIFO_ERROR_PARSE_ERR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
7 |
ROBUST_CHANNEL_FIFO_ERROR_PTE_ERR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
8 |
ROBUST_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT |
GPU stopped processing |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
9 |
ROBUST_CHANNEL_GR_ERROR_INSTANCE |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
10 |
ROBUST_CHANNEL_GR_ERROR_SINGLE_STEP |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
11 |
ROBUST_CHANNEL_GR_ERROR_MISSING_HW |
Invalid or corrupted push buffer stream |
YES |
YES |
YES |
YES |
RESTART_APP |
CHECK_APP/CUDA |
||
XID |
12 |
ROBUST_CHANNEL_GR_ERROR_SW_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
13 |
ROBUST_CHANNEL_GR_EXCEPTION / ROBUST_CHANNEL_GR_ERROR_SW_NOTIFY |
Graphics Engine Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
WORKFLOW_XID_13 |
This event is logged for general user application faults. Typically this is an out-of-bounds error where the user has walked past the end of an array, but could also be an illegal instruction, illegal register, or other case. In rare cases, it’s possible for a hardware failure or system software bugs to materialize as XID 13. When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read. |
|
XID |
14 |
ROBUST_CHANNEL_FAKE_ERROR |
Unused |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
Fake or injected error from userspace |
|
XID |
15 |
ROBUST_CHANNEL_SCANLINE_TIMEOUT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
16 |
ROBUST_CHANNEL_VBLANK_CALLBACK_TIMEOUT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
CONTACT_SUPPORT |
N/A; Unused |
|
XID |
17 |
ROBUST_CHANNEL_PARAMETER_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
18 |
ROBUST_CHANNEL_BUS_MASTER_TIMEOUT_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
19 |
ROBUST_CHANNEL_DISP_MISSED_NOTIFIER |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
20 |
ROBUST_CHANNEL_MPEG_ERROR_SW_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
21 |
ROBUST_CHANNEL_ME_ERROR_SW_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
22 |
ROBUST_CHANNEL_VP_ERROR_SW_METHOD |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
23 |
ROBUST_CHANNEL_RC_LOGGING_ENABLED |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
24 |
ROBUST_CHANNEL_GR_SEMAPHORE_TIMEOUT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
25 |
ROBUST_CHANNEL_GR_ILLEGAL_NOTIFY |
Invalid or illegal push buffer stream |
YES |
YES |
YES |
YES |
RESTART_APP |
CHECK_APP/CUDA |
||
XID |
26 |
ROBUST_CHANNEL_FIFO_ERROR_FBISTATE_TIMEOUT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
27 |
ROBUST_CHANNEL_VP_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
28 |
ROBUST_CHANNEL_VP2_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
29 |
ROBUST_CHANNEL_BSP_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
30 |
ROBUST_CHANNEL_BAD_ADDR_ACCESS |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
31 |
ROBUST_CHANNEL_FIFO_ERROR_MMU_ERR_FLT |
GPU memory page fault |
YES |
YES |
YES |
YES |
RESTART_APP |
WORKFLOW_XID_31 |
This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs. When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool, or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read. |
|
XID |
32 |
ROBUST_CHANNEL_PBDMA_ERROR |
Invalid or corrupted push buffer stream |
YES |
YES |
YES |
YES |
RESTART_APP |
CHECK_APP/CUDA |
This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions. |
|
XID |
33 |
ROBUST_CHANNEL_SEC_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
34 |
ROBUST_CHANNEL_MSVLD_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
35 |
ROBUST_CHANNEL_MSPDEC_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
36 |
ROBUST_CHANNEL_MSPPP_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
37 |
ROBUST_CHANNEL_FECS_ERR_UNIMP_FIRMWARE_METHOD |
Driver firmware error |
YES |
YES |
YES |
YES |
IGNORE |
CHECK_APP/CUDA |
||
XID |
38 |
ROBUST_CHANNEL_FECS_ERR_WATCHDOG_TIMEOUT |
Driver firmware error |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
||
XID |
39 |
ROBUST_CHANNEL_CE0_ERROR |
Copy Engine Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
40 |
ROBUST_CHANNEL_CE1_ERROR |
Copy Engine Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
41 |
ROBUST_CHANNEL_CE2_ERROR |
Copy Engine Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
42 |
ROBUST_CHANNEL_VIC_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
43 |
ROBUST_CHANNEL_RESETCHANNEL_VERIF_ERROR |
GPU stopped processing |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state. In most cases, this is not indicative of a driver bug but rather a user application error. |
|
XID |
44 |
ROBUST_CHANNEL_GR_FAULT_DURING_CTXSW |
Graphics Engine fault during context switch |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
||
XID |
45 |
ROBUST_CHANNEL_PREEMPTIVE_REMOVAL |
Preemptive cleanup, due to previous errors – Most likely to see when running multiple cuda applications and hitting a DBE |
YES |
YES |
YES |
YES |
WORKFLOW_XID_45 |
Solo: RESTART_FM Not Solo: IGNORE (follow other Xid) |
This event is logged when the user application aborts and the kernel driver tears down the GPU application running on the GPU. Control-C, GPU resets, sigkill are all examples where the application is aborted and this event is created. In many cases, this is not indicative of a bug but rather a user or system action. |
|
XID |
46 |
ROBUST_CHANNEL_GPU_TIMEOUT_ERROR |
GPU stopped processing |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
||
XID |
47 |
ROBUST_CHANNEL_NVENC0_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
48 |
ROBUST_CHANNEL_GPU_ECC_DBE |
Double Bit ECC Error |
YES |
YES |
YES |
YES |
WORKFLOW_XID_48 |
WORKFLOW_XID_48 |
CUDA 12.7; GPU driver R565 |
This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error. The tool nvidia-smi can provide a summary of ECC errors. If the ECC error is reported for SRAM (excludes “framebuffer”), check for SRAM DBE thresholds and follow RMA flow if exceeded - (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0). If flag is set, run field diag. |
XID |
49 |
SILENT_RUNNING_CONSTANT_LEVEL_SET_BY_REGISTRY |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
50 |
SILENT_RUNNING_LEVEL_TRANSITION_DUE_TO_RC_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
51 |
SILENT_RUNNING_STRESS_TEST_FAILURE |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
52 |
SILENT_RUNNING_LEVEL_TRANS_DUE_TO_TEMP_RISE |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
53 |
SILENT_RUNNING_TEMP_REDUCED_CLOCKING |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
54 |
SILENT_RUNNING_PWR_REDUCED_CLOCKING |
Auxiliary power is not connected to the GPU board |
YES |
YES |
YES |
NO |
CHECK_MECHANICALS |
CONTACT_SUPPORT |
||
XID |
55 |
SILENT_RUNNING_TEMPERATURE_READ_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
56 |
DISPLAY_CHANNEL_EXCEPTION |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
57 |
FB_LINK_TRAINING_FAILURE_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
58 |
FB_MEMORY_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
59 |
PMU_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
60 |
ROBUST_CHANNEL_SEC2_ERROR |
Video processor exception |
YES |
YES |
YES |
YES |
RESTART_APP |
INVESTIGATE_SW |
||
XID |
61 |
PMU_BREAKPOINT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
62 |
PMU_HALT_ERROR |
Internal micro-controller halt (newer drivers) |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
CUDA 12.7; GPU driver R565 |
|
XID |
63 |
INFOROM_DRAM_RETIREMENT_EVENT |
GPU memory remapping event |
YES |
YES |
YES |
YES |
IGNORE |
IGNORE |
CUDA 12.7; GPU driver R565 |
These events are logged when the GPU handles ECC memory errors on the GPU. On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html. |
XID |
64 |
INFOROM_DRAM_RETIREMENT_FAILURE |
GPU memory remapping failure |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
CUDA 12.7; GPU driver R565 |
These events are logged when the GPU handles ECC memory errors on the GPU. On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html. |
XID |
65 |
ROBUST_CHANNEL_NVENC1_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
66 |
ROBUST_CHANNEL_FECS_ERR_REG_ACCESS_VIOLATION |
Illegal access by driver |
YES |
YES |
YES |
YES |
IGNORE |
INVESTIGATE_SW |
||
XID |
67 |
ROBUST_CHANNEL_FECS_ERR_VERIF_VIOLATION |
Illegal access by driver |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
||
XID |
68 |
ROBUST_CHANNEL_NVDEC0_ERROR |
NVDEC0 Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
69 |
ROBUST_CHANNEL_GR_CLASS_ERROR |
Graphics Engine class error |
YES |
YES |
YES |
YES |
RESTART_APP |
CHECK_APP/CUDA |
||
XID |
70 |
ROBUST_CHANNEL_CE3_ERROR |
CE3: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
71 |
ROBUST_CHANNEL_CE4_ERROR |
CE4: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
72 |
ROBUST_CHANNEL_CE5_ERROR |
CE5: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
73 |
ROBUST_CHANNEL_NVENC2_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
74 |
NVLINK_ERROR |
NVLINK Error |
YES |
YES |
NO |
NO |
WORKFLOW_NVLINK_ERR |
CONTACT_SUPPORT |
CUDA 12.7; GPU driver R565 |
This event is logged when the GPU detects that a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error. This event may indicate a hardware failure with the link itself, or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an Xid 74 simply because the link went down as a result. The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links. If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support. |
XID |
75 |
ROBUST_CHANNEL_CE6_ERROR |
CE6: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
76 |
ROBUST_CHANNEL_CE7_ERROR |
CE7: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
77 |
ROBUST_CHANNEL_CE8_ERROR |
CE8: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
78 |
VGPU_START_ERROR |
vGPU Start Error |
YES |
YES |
YES |
YES |
UPDATE_SWFW |
UPDATE_SWFW |
||
XID |
79 |
ROBUST_CHANNEL_GPU_HAS_FALLEN_OFF_THE_BUS |
GPU has fallen off the bus |
YES |
YES |
YES |
YES |
RESTART_BM |
CONTACT_SUPPORT |
CUDA 12.7; GPU driver R565 |
This event is logged when the GPU driver attempts to access the GPU over its PCI Express connection and finds that the GPU is not accessible. This event is often caused by hardware failures on the PCI Express link causing the GPU to be inaccessible due to the link being brought down. Reviewing system event logs and kernel PCI event logs may provide additional indications of the source of the link failures. This event may also be cause by failing GPU hardware or other driver issues. |
XID |
80 |
PBDMA_PUSHBUFFER_CRC_MISMATCH |
Corrupted data sent to GPU |
YES |
YES |
NO |
NO |
RESTART_APP |
CHECK_APP/CUDA |
||
XID |
81 |
ROBUST_CHANNEL_VGA_SUBSYSTEM_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
82 |
ROBUST_CHANNEL_NVJPG0_ERROR |
NVJPG0 Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
83 |
ROBUST_CHANNEL_NVDEC1_ERROR |
NVDEC1 Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
84 |
ROBUST_CHANNEL_NVDEC2_ERROR |
NVDEC2 Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
85 |
ROBUST_CHANNEL_CE9_ERROR |
CE9: Unknown Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
86 |
ROBUST_CHANNEL_OFA0_ERROR |
OFA Exception |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
87 |
NVTELEMETRY_DRIVER_REPORT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
88 |
ROBUST_CHANNEL_NVDEC3_ERROR |
NVDEC3 Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
89 |
ROBUST_CHANNEL_NVDEC4_ERROR |
NVDEC4 Error |
YES |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
90 |
LTC_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
91 |
RESERVED_XID |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
92 |
EXCESSIVE_SBE_INTERRUPTS |
High single-bit ECC error rate |
YES |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
||
XID |
93 |
INFOROM_ERASE_LIMIT_EXCEEDED |
Non-fatal violation of provisioned InfoROM wear limit |
YES |
NO |
NO |
NO |
IGNORE |
CONTACT_SUPPORT |
This event is logged when the GPU driver fails to update the InfoROM due to violation of the provisioned InfoROM wear limit that was set for the GPU using NVFlash using nvflash –=elsessionstart. In most cases this is not indicative of a driver or flash failure, but rather the intentional use of the InfoROM wear protection feature as set by NVFlash. Recovery steps: The GPU can be recovered from Xid 93 by clearing InfoROM erase limit using ./nvflash –-elsessionclear. If clearing the limit using nvflash doesn’t help, report the issue to NVIDIA. |
|
XID |
94 |
ROBUST_CHANNEL_CONTAINED_ERROR |
Contained memory error |
YES |
YES |
YES |
YES |
RESTART_APP |
IGNORE (sympathetic) |
CUDA 12.7; GPU driver R565 |
These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA A100 GPUs. For Xid 94, these errors are contained to one application, and the application that encountered this error must be restarted. All other applications running at the time of the Xid are unaffected. It is recommended to reset the GPU when convenient. Applications can continue to be run until the reset can be performed. One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors. Xid 45 will be seen in relation to this error. |
XID |
95 |
ROBUST_CHANNEL_UNCONTAINED_ERROR |
Uncontained memory error |
YES |
YES |
YES |
YES |
RESET_GPU |
IGNORE (sympathetic) |
CUDA 12.7; GPU driver R565 |
These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA® A100 GPUs. For Xid 95, these errors affect multiple applications, and the affected GPU must be reset before applications can restart. Refer https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors. Xid 45 will be seen in relation to this error. |
XID |
96 |
ROBUST_CHANNEL_NVDEC5_ERROR |
NVDEC5 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
97 |
ROBUST_CHANNEL_NVDEC6_ERROR |
NVDEC6 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
98 |
ROBUST_CHANNEL_NVDEC7_ERROR |
NVDEC7 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
99 |
ROBUST_CHANNEL_NVJPG1_ERROR |
NVJPG1 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
100 |
ROBUST_CHANNEL_NVJPG2_ERROR |
NVJPG2 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
101 |
ROBUST_CHANNEL_NVJPG3_ERROR |
NVJPG3 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
102 |
ROBUST_CHANNEL_NVJPG4_ERROR |
NVJPG4 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
103 |
ROBUST_CHANNEL_NVJPG5_ERROR |
NVJPG5 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
104 |
ROBUST_CHANNEL_NVJPG6_ERROR |
NVJPG6 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
105 |
ROBUST_CHANNEL_NVJPG7_ERROR |
NVJPG7 Error |
NO |
YES |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
106 |
SMBPBI_TEST_MESSAGE |
SMBPBI Test Message |
YES |
YES |
YES |
YES |
IGNORE |
IGNORE |
||
XID |
107 |
SMBPBI_TEST_MESSAGE_SILENT |
SMBPBI Test Message Silent |
YES |
YES |
YES |
YES |
IGNORE |
IGNORE |
||
XID |
108 |
NVLINK_REMOTE_TRANSLATION_ERROR |
Unused |
YES |
YES |
YES |
YES |
IGNORE |
XID_137_FLOW |
N/A; Unused |
|
XID |
109 |
ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR |
Context Switch Timeout Error |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
CUDA 12.7; GPU driver R570 |
|
XID |
110 |
SEC_FAULT_ERROR |
Security Fault Error |
NO |
YES |
YES |
YES |
RESET_GPU |
INVESTIGATE_SW |
CUDA 12.7; GPU driver R565 |
This event should be uncommon unless there is a hardware failure. To recover, revert any recent system hardware modifications and cold reset the system. If this fails to correct the issue, contact your hardware vendor for assistance. |
XID |
111 |
BUNDLE_ERROR_EVENT |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
112 |
DISP_SUPERVISOR_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
113 |
DP_LT_FAILURE |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
114 |
HEAD_RG_UNDERFLOW |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
115 |
CORE_CHANNEL_REGS |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
116 |
WINDOW_CHANNEL_REGS |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
117 |
CURSOR_CHANNEL_REGS |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
118 |
HEAD_REGS |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
119 |
GSP_RPC_TIMEOUT |
GSP RPC Timeout |
YES |
YES |
YES |
YES |
RESET_GPU |
INVESTIGATE_SW |
These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps. |
|
XID |
120 |
GSP_ERROR |
GSP Error |
YES |
YES |
YES |
YES |
RESET_GPU |
INVESTIGATE_SW |
CUDA 12.7; GPU driver R565 |
These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps. |
XID |
121 |
C2C_ERROR |
C2C Error |
NO |
NO |
NO |
YES |
IGNORE |
CONTACT_SUPPORT |
This event may occur when the GPU driver has observed corrected errors on the C2C NVLink connection to a Grace CPU. These errors are corrected by the system and have no operational impact. Resetting the GPU at an available service window will allow the GPU to retrain the link. NOTE: repeat errors may be reported; VBIOS 97.00.90.00.00 may provide some relief from that condition |
|
XID |
122 |
SPI_PMU_RPC_READ_FAIL |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
123 |
SPI_PMU_RPC_WRITE_FAIL |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
124 |
SPI_PMU_RPC_ERASE_FAIL |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
125 |
INFOROM_FS_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
126 |
ROBUST_CHANNEL_CE10_ERROR |
CE10: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
127 |
ROBUST_CHANNEL_CE11_ERROR |
CE11: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
128 |
ROBUST_CHANNEL_CE12_ERROR |
CE12: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
129 |
ROBUST_CHANNEL_CE13_ERROR |
CE13: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
130 |
ROBUST_CHANNEL_CE14_ERROR |
CE14: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
131 |
ROBUST_CHANNEL_CE15_ERROR |
CE15: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
132 |
ROBUST_CHANNEL_CE16_ERROR |
CE16: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
133 |
ROBUST_CHANNEL_CE17_ERROR |
CE17: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
134 |
ROBUST_CHANNEL_CE18_ERROR |
CE18: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
135 |
ROBUST_CHANNEL_CE19_ERROR |
CE19: Unknown Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
136 |
ALI_TRAINING_FAIL |
Link Training Failed |
NO |
YES |
NO |
NO |
RESET_GPU |
INVESTIGATE_LINK_SI |
CUDA 12.7; GPU driver R565 |
|
XID |
137 |
NVLINK_PRIV_ERR |
NVLink Privilege Error |
YES |
YES |
YES |
YES |
IGNORE |
XID_137_FLOW |
This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs. When this event is logged, NVIDIA recommends the following: #. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or #. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or #. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. |
|
XID |
138 |
ROBUST_CHANNEL_DLA_ERROR |
Unused |
NO |
NO |
NO |
NO |
CONTACT_SUPPORT |
N/A; Unused |
||
XID |
139 |
ROBUST_CHANNEL_OFA1_ERROR |
OFA1 Error |
NO |
NO |
YES |
YES |
RESTART_APP |
CONTACT_SUPPORT |
||
XID |
140 |
UNRECOVERABLE_ECC_ERROR_ESCAPE |
ECC Unrecovered Error |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
This event may occur when the GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping. Reset the GPU, and if the problem persists, contact your hardware vendor for support. |
|
XID |
141 |
ROBUST_CHANNEL_FAST_PATH_ERROR |
CUDA Fast Path Error |
NO |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
||
XID |
142 |
ROBUST_CHANNEL_NVENC3_ERROR |
NVENC3 Error |
NO |
NO |
NO |
YES |
CONTACT_SUPPORT |
|||
XID |
143 |
GPU_INIT_ERROR |
GPU Initialization Error |
NO |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
CUDA 12.9; GPU driver R575 |
|
XID |
144 |
NVLINK_SAW_ERROR |
NVLINK: SAW Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
145 |
NVLINK_RLW_ERROR |
NVLINK: RLW Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
146 |
NVLINK_TLW_ERROR |
NVLINK: TLW Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
147 |
NVLINK_TREX_ERROR |
NVLINK: TREX Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
148 |
NVLINK_NVLPW_CTRL_ERROR |
NVLINK: NVLPW_CTRL Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
149 |
NVLINK_NETIR_ERROR |
NVLINK: NETIR Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
150 |
NVLINK_MSE_ERROR |
NVLINK: MSE Error |
NO |
NO |
YES |
YES |
WORKFLOW_NVLINK5_ERR |
WORKFLOW_NVLINK5_ERR |
CUDA 12.7; GPU driver R565 |
|
XID |
151 |
ROBUST_CHANNEL_KEY_ROTATION_ERROR |
Key rotation Error |
NO |
YES |
YES |
YES |
RESTART_VM |
CONTACT_SUPPORT |
||
XID |
152 |
ROBUST_CHANNEL_DLA_SMMU_ERROR |
DLA SMMU Error |
NO |
NO |
NO |
NO |
IGNORE |
CONTACT_SUPPORT |
||
XID |
153 |
ROBUST_CHANNEL_DLA_TIMEOUT |
DLA timeout Error |
NO |
NO |
NO |
NO |
IGNORE |
CONTACT_SUPPORT |
||
XID |
154 |
GPU_RECOVERY_ACTION_CHANGED |
GPU Recovery Action Changed |
YES |
YES |
YES |
YES |
XID_154 |
N/A Informational only regarding another Xid |
“Xid 154 will be seen in conjunction with other Xids and summarizes the recovery action required for other Xids. The string will be similar to “”Xid 154 GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)”” where the expected values of the text are: “”None””, “”Drain P2P””, “”Drain and Reset””, “”GPU Reset Required””, “”Node Reboot Required””. “ |
|
XID |
155 |
NVLINK_SW_DEFINED_ERROR |
NVLINK: SW Defined Error |
NO |
NO |
YES |
YES |
RESET_GPU |
INVESTIGATE_SW_USER |
CUDA 12.7; GPU driver R565 |
Link down events which are flagged as “intentional” (including transitions to SLEEP) will trigger this Xid |
XID |
156 |
RESOURCE_RETIREMENT_EVENT |
Resource Retirement Event |
NO |
YES |
YES |
YES |
RESET_GPU |
IGNORE |
CUDA 12.7; GPU driver R565 |
|
XID |
157 |
RESOURCE_RETIREMENT_FAILURE |
Resource Retirement Failure |
NO |
YES |
YES |
YES |
IGNORE |
CONTACT_SUPPORT |
No possible repairs are possible due to lack of resources. You may still run workloads or Apps, but may experience the same Xid again. |
|
XID |
158 |
GPU_FATAL_TIMEOUT |
GPU Fatal Timeout |
YES |
YES |
YES |
YES |
RESET_GPU |
CONTACT_SUPPORT |
yes; support with Xid introduction |
|
XID |
159 |
ROBUST_CHANNEL_CHI_NON_DATA_ERROR |
CHI Non-Data Error |
NO |
NO |
YES |
YES |
CHECK_UVM |
SYMPATHETIC_REPORT_SOLO |
yes; support with Xid introduction |
May be seen on any C2C link-connected GPU. |
XID |
160 |
CHANNEL_RETIREMENT_EVENT |
Channel Retirement Event |
NO |
NO |
YES |
YES |
IGNORE |
INVESTIGATE_SW |
CUDA 12.9; GPU driver R575 |
|
XID |
161 |
CHANNEL_RETIREMENT_FAILURE |
Channel Retirement Failure |
NO |
NO |
YES |
YES |
IGNORE |
INVESTIGATE_SW |
CUDA 12.9; GPU driver R575 |
|
XID |
162 |
PSHC_REENGAGED |
Power Smoothing HW Circuitry capability reengaged |
NO |
NO |
YES |
YES |
||||
XID |
163 |
PSHC_DISENGAGED |
Power Smoothing HW Circuitry capability disengaged |
NO |
NO |
YES |
YES |
No GPU reset required. If power smoothing functionality is desired, the customer needs to resolve the thermal events. If disabled due to timeout, reload the driver or reset the GPU. |
|||
XID |
164 |
PSHC_LOW_LIFETIME |
Power Smoothing HW Circuitry low lifetime reached |
NO |
NO |
YES |
YES |
Monitor power swings and expect to replace GPUs if power smoothing is desired. Power smoothing functionality will be disabled soon. Investigate if power swings are acceptable, and if not, take action. |
|||
XID |
165 |
PSHC_ZERO_LIFETIME |
Power Smoothing HW Circuitry lifetime exhausted |
NO |
NO |
YES |
YES |
Replace GPUs if power swings are not acceptable, and power smoothing is desired. Power smoothing will be disabled by the driver and power swings will occur. Analyze datacenter infrastructure to ensure ability to absorb power swings. |
|||
XID |
166 |
NVLINK_SECURE_CRYPTO_ERR |
CC traffic seen prior to link properly being configured for encrypted traffic |
NO |
NO |
YES |
YES |
Applicable to CC (confidential computing) mode only. |
|||
XID |
167 |
PCIE_FATAL_TIMEOUT |
PCIE_FATAL_TIMEOUT |
NO |
YES |
YES |
YES |
||||
XID |
168 |
REDUCED_GPU_MEMORY_CAPACITY |
Errors found in WPR (write protected region) |
YES |
YES |
YES |
YES |
Should only be seen when ECC is disabled. Either ECC should be enabled (to enable row-remapping) or boot re-attempted with shifted WPR. |
|||
XID |
169 |
SEC2_HALT_ERROR |
Internal micro-controller halt |
NO |
YES |
YES |
YES |
||||
XID |
170 |
NVLINK_SECURE_OTHER |
Interrupt seen in CC mode |
NO |
NO |
YES |
YES |
Applicable to CC (confidential computing) mode only. |
|||
XID |
171 |
UNCORRECTABLE_DRAM_ERROR |
Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM |
YES |
YES |
YES |
YES |
||||
XID |
172 |
UNCORRECTABLE_SRAM_ERROR |
Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM |
YES |
YES |
YES |
YES |
Xid |
Subcode V1(<R575)/V2(>=R575) V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0] |
(V1(<R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode) |
(V2(>=R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode) |
Error Status (hex) |
Resolution Bucket (Data Center Recovery Action) |
(V1(<R575)) Decode for action 2 |
(V2(>=R575)) Decode for action 2 |
Action 2 |
Resolution Bucket (Investigatory Action) |
Severity (for items with ‘*’ please see Customer User Guide tab) |
HW/SW |
Local/Remote (for items with ‘*’ please see Customer User Guide tab) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000001 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none |
|||
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000002 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000004 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000008 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: XID 48; Remote: Will lead to poison or Xid94**** |
|||
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000010 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
144 |
SAW_MVB |
——000000———-0000100001 |
——000000————-0000001 |
0x00000020 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_CTRL |
——000000———-0001100010 |
——000000————-0000011 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000001 |
XID_154_EVAL |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000002 |
XID_154_EVAL |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000004 |
XID_154_EVAL |
CHECK_NVLINK_FAILURE_FLOW |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000008 |
XID_154_EVAL |
CHECK_NVLINK_FAILURE_FLOW |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000010 |
XID_154_EVAL |
CHECK_NVLINK_FAILURE_FLOW |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000020 |
XID_154_EVAL |
CHECK_NVLINK_FAILURE_FLOW |
Non-fatal |
SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000040 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000080 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000100 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x00000200 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_REMAP |
——000000———-0010000010 |
——000000————-0000100 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
145 |
RLW_RSPCOL |
——000000———-0010100010 |
——000000————-0000101 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_RSPCOL |
——000000———-0010100010 |
——000000————-0000101 |
0x00000002 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_RSPCOL |
——000000———-0010100010 |
——000000————-0000101 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
145 |
RLW_RXPIPE |
——000000———00011000010 |
——000000————00000110 |
0x00000001 |
IGNORE |
——000000———10011000010 |
——000000————10000110 |
RESET_GPU |
CONTACT_SUPPORT |
Non-fatal* |
SW |
Local: none; Remote: PacketLoss*** |
145 |
RLW_RXPIPE |
——000000———00011000010 |
——000000————00000110 |
0x00000002 |
IGNORE |
——000000———10011000010 |
——000000————10000110 |
RESET_GPU |
CONTACT_SUPPORT |
Non-fatal* |
SW |
Local: none; Remote: PacketLoss*** |
145 |
RLW_RXPIPE |
——000000———00011000010 |
——000000————00000110 |
0x00000004 |
IGNORE |
——000000———10011000010 |
——000000————10000110 |
RESET_GPU |
CONTACT_SUPPORT |
Non-fatal* |
SW |
Local: PacketLoss***; Remote: none |
145 |
RLW_RXPIPE |
——000000———-0011000010 |
——000000————-0000110 |
0x00000008 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW/SW |
Local: none; Remote: none |
|||
145 |
RLW_RXPIPE |
——000000———-0011000010 |
——000000————-0000110 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000001 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000002 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000004 |
XID_154_EVAL |
IGNORE |
Non-fatal |
HW/SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000008 |
XID_154_EVAL |
IGNORE |
Non-fatal |
HW/SW |
Local: XC/AppCrash (Xid 45); Remote: none |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000010 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x00000020 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_SRC_TRACK |
——000000———-0011100010 |
——000000————-0000111 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x00000002 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x00010000 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x00020000 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x00100000 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible) |
|||
145 |
RLW_TAGSTATE |
——000000———-0100000010 |
——000000————-0001000 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
146 |
TLW_CTRL |
——000000———-0100100011 |
——000000————-0001001 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
146 |
TLW_CTRL |
——000000———-0100100011 |
——000000————-0001001 |
0x00000002 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: XID 48; Remote: Will lead to poison or Xid94**** |
|||
146 |
TLW_CTRL |
——000000———-0100100011 |
——000000————-0001001 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible) |
|||
146 |
TLW_CTRL |
——000000———-0100100011 |
——000000————-0001001 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE0 |
——000000———-0101000011 |
——000000————-0001010 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE0 |
——000000———-0101000011 |
——000000————-0001010 |
0x00000002 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE0 |
——000000———-0101000011 |
——000000————-0001010 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible) |
|||
146 |
TLW_RX/TLW_RX_PIPE0 |
——000000———-0101000011 |
——000000————-0001010 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE1 |
——000000———-0101000011 |
——000000————-0001011 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE1 |
——000000———-0101000011 |
——000000————-0001011 |
0x00000002 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none |
|||
146 |
TLW_RX/TLW_RX_PIPE1 |
——000000———-0101000011 |
——000000————-0001011 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible) |
|||
146 |
TLW_RX/TLW_RX_PIPE1 |
——000000———-0101000011 |
——000000————-0001011 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE0 |
——000000———-0101100011 |
——000000————-0001100 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE0 |
——000000———-0101100011 |
——000000————-0001100 |
0x00000002 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE0 |
——000000———-0101100011 |
——000000————-0001100 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible) |
|||
146 |
TLW_TX/TLW_TX_PIPE0 |
——000000———-0101100011 |
——000000————-0001100 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE1 |
——000000———-0101100011 |
——000000————-0001101 |
0x00000001 |
IGNORE |
IGNORE |
Non-fatal |
HW |
Local: none; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE1 |
——000000———-0101100011 |
——000000————-0001101 |
0x00000002 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
HW |
Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none |
|||
146 |
TLW_TX/TLW_TX_PIPE1 |
——000000———-0101100011 |
——000000————-0001101 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
HW |
Local: XID 48, PacketLoss***(possible); Remote: PacketLoss***(possible) |
|||
146 |
TLW_TX/TLW_TX_PIPE1 |
——000000———-0101100011 |
——000000————-0001101 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
147 |
TREX |
——000000———-0110000100 |
——000000————-0001110 |
0x00000001 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
NOTE: not in production code, so should not be experienced |
|||
147 |
TREX |
——000000———-0110000100 |
——000000————-0001110 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
148 |
NVLPW_CTRL/NVLPW |
——000000———-0000000101 |
——000000————-0001111 |
0x80000000 |
IGNORE |
CONTACT_SUPPORT |
Non-fatal |
SW |
Local: none; Remote: none |
|||
149 |
NETIR/NETIR_INT |
——000000———-0000000110 |
——000000————-0011000 |
RESET_GPU |
SYMPATHETIC_REPORT_SOLO |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——000000———-0111000110 |
——000000————-0010001 |
RESET_GPU |
SYMPATHETIC_REPORT_SOLO |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——000001———-0111000110 |
——000001————-0010001 |
RESET_GPU |
REPORT_ISSUE (if seen >1 per day) |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——000010———-0111000110 |
——000010————-0010001 |
RESET_GPU |
INVESTIGATE_LINK_SI |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——000100———-0111000110 |
——000100————-0010001 |
RESET_GPU |
INVESTIGATE_LINK_SI |
Link Fatal |
HW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——001010———-0111000110 |
——001010————-0010001 |
RESET_GPU |
INVESTIGATE_LINK_SI |
Link Fatal |
HW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——001111———-0111000110 |
——001111————-0010001 |
RESET_GPU |
INVESTIGATE_SW_USER |
Link Fatal |
SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——010000———-0111000110 |
——010000————-0010001 |
RESET_GPU |
INVESTIGATE_SW_USER_LINK_SI |
Link Fatal |
SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——010001———-0111000110 |
——010001————-0010001 |
RESET_GPU |
INVESTIGATE_SW_USER |
Link Fatal |
SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——010010———-0111000110 |
——010010————-0010001 |
RESET_GPU |
INVESTIGATE_SW_USER_LINK_SI |
Link Fatal |
SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——010101———-0111000110 |
——010101————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——010110———-0111000110 |
——010110————-0010001 |
RESET_GPU |
INVESTIGATE_SW_USER |
Link Fatal |
SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100000———-0111000110 |
——100000————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100001———-0111000110 |
——100001————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100010———-0111000110 |
——100010————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100011———-0111000110 |
——100011————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100100———-0111000110 |
——100100————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100101———-0111000110 |
——100101————-0010001 |
RESET_GPU |
INVESTIGATE_PEER_DEVICE |
Link Fatal |
HW/SW |
Local: PacketLoss***(posible/delayed); Remote: PacketLoss*** (possible/delayed) Other end of link: source of link fatal |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——100110———-0111000110 |
——100110————-0010001 |
IGNORE |
INVESTIGATE_SW/USER |
Link Fatal |
HW/SW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——101000———-0111000110 |
——101000————-0010001 |
IGNORE |
INVESTIGATE_HOST |
Fatal |
SW |
Local: fatal; Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——101010———-0111000110 |
——101010————-0010001 |
RESET_GPU |
INVESTIGATE_LINK_SI_AND_CABLES |
Link Fatal? |
HW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_LINK_EVT/NETIR_LINK_DOWN |
——101011———-0111000110 |
——101011————-0010001 |
RESET_GPU |
INVESTIGATE_LINK_SI_AND_CABLES |
Link Fatal? |
HW |
Local: PacketLoss***(possible/delayed); Remote: PacketLoss***(possible/delayed) |
||||
149 |
NETIR_BER_EVENT |
——000000———-1000100110 |
——000000————-0010011 |
0x00000000 |
IGNORE |
INVESTIGATE_LINK_SI_AND_CABLES |
Non-fatal |
HW |
Local: none; Remote: none |
|||
149 |
NETIR_BER_EVENT |
——000000———-1000100110 |
——000000————-0010011 |
0x00000001 |
IGNORE |
INVESTIGATE_LINK_SI_AND_CABLES |
Non-fatal |
HW |
Local: none; Remote: none |
|||
149 |
NETIR_BER_EVENT |
——000000———-1000100110 |
——000000————-0010011 |
0x00000002 |
IGNORE |
INVESTIGATE_LINK_SI_AND_CABLES |
Non-fatal |
HW |
Local: none; Remote: none |
|||
149 |
NETIR_BER_EVENT |
——000000———-1000100110 |
——000000————-0010011 |
0x00000003 |
IGNORE |
INVESTIGATE_LINK_SI_AND_CABLES |
Non-fatal |
HW |
Local: none; Remote: none |
|||
149 |
NETIR_MFDE_EVENT |
——000000———-1001000110 |
——000000————-0010100 |
0x00000001 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal** |
HW/SW |
Local: fatal; Remote: PacketLoss***(possible/delayed) |
|||
149 |
NETIR_MFDE_EVENT |
——000000———-1001000110 |
——000000————-0010100 |
0x00000003 |
IGNORE |
IGNORE |
Non-fatal |
NA |
Local: none; Remote: none |
|||
149 |
NETIR_MFDE_EVENT |
——000000———-1001000110 |
——000000————-0010100 |
0x00000004 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal** |
HW/SW |
Local: fatal; Remote: PacketLoss***(possible/delayed) |
|||
149 |
NETIR_MFDE_EVENT |
——000000———-1001000110 |
——000000————-0010100 |
0x00000005 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal** |
HW/SW |
Local: fatal; Remote: PacketLoss***(possible/delayed) |
|||
149 |
NETIR_MFDE_EVENT |
——000000———-1001000110 |
——000000————-0010100 |
0x00000007 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal** |
HW/SW |
Local: fatal; Remote: PacketLoss***(possible/delayed) |
|||
150 |
MSE Degraded |
——000000———-0000000000 |
——000000————-0000000 |
0x00000000/0xFFFFFFFF |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
FW |
Local: Fatal; Remote: None |
|||
150 |
MSE_WATCHDOG |
——000000———-0000000000 |
——000000————-0000000 |
0x00000000 |
RESET_GPU |
CONTACT_SUPPORT |
Fatal |
FW |
Local: Fatal; Remote: None |
Guidance Class |
Resolution Action |
---|---|
CONTACT_SUPPORT |
Please contact your support organization for further investigation. |
RESTART_APP |
The application should be restarted RESET_GPU or RESTART_BM is not deemed necessary. |
IGNORE |
No Action required |
WORKFLOW_XID_45 |
Solo: RESTART_FM Not Solo: IGNORE (follow guidance in other Xid) |
RESET_GPU |
Refer to https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations RESTART_BM is not deemed necessary. |
WORKFLOW_XID_48 |
Data Center Recovery Action Solo: RESET_GPU w/ 63 or 64: DRAIN_AND_RESET Investagatory Action Solo: RUN_FIELDDIAG Not Solo: String in error would tell us what unit was impacted. FB: follow Xid63/64 guidance All other SRAM: check SRAM Error Threshold flag (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0. If set RUN_FIELDDIAG |
CHECK_MECHANICALS |
Check to ensure that device seating and all applicable connections to it are secure. |
WORKFLOW_NVLINK_ERR |
Extract the hex strings from the Xid error message. Note that there should be seven fields in the Xid. Unused fields would expect to be 0x0 rather than a full DWORD of 0’s. The first, third, fourth and fifth registers are valid for Hopper-based products. Evaluate the populate(d) registers. If bits other than those specifically outlined below are seen, please report a bug. First register: Bit 0, 23, 30: Can be safely ignored. Bits 1, 20: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bits 4 or 5: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 8, 9, 12, 16, 17, 24, 28: Could possibly be a HW issue: Check link mechanical connections and re-seat if a field resolution is required. Run diags if issue persists. If the issue persist, and diagnostics has passed please report a bug. Bits 21 or 22: Marginal channel SI issue. If other errors accompany this Xid, follow the resolution for those first. Otherwise, check link mechanical connections. Run Field Diags and report a bug. Bits 27, 29: If seen repeatedly, please report a bug. Third register: Bits 0, 1, 2, 6: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bit 13: Not expected to be seen in production. If seen, please report a bug. Bits 16, 19: If seen repeatedly, please run Field Diags and report a bug Bits 17, 18: If seen repeatedly, please report a bug. Fourth register: Bits 16, 17: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bit 18: These are generally sympathetic or secondary errors, though a reset of the fabric is required. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Fifth register: Bits 18, 19, 21, 22, 24, 25, 27, 28: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 20, 23, 26, 29: These errors represent a threshold of ECC errors being exceeded. There was no uncorrectable error at this time. Continue operation. If desired, Field Diags can be run to check for link integrity. |
UPDATE_SWFW |
Update Firmware and Software to latest versions XID 78: basic issues will keep vGPU functionality from being able to operate; must resolve to progress 1. Guest driver version is incompatible with the host driver * In this case error string should be “Guest driver is incompatible with host driver” 2. This vGPU type is not compatible with the guest OS type/GPU type. For example, user is trying to use a compute profile on an old Maxwell GPU on windows guest. * In this case error string should be “vGPU type is not supported” |
RESTART_BM |
Restart bare metal, system should be restarted |
WORKFLOW_NVLINK5_ERR |
Please see the “XID 144-150 Decode” (was “Customer Doc 144-150”)tab for further guidance in evaluating these Xids. These errors need decoding of XID message as follows to determine the resolution action: Format of the XID error message: Xid (PCI:0000:BB:DF): <Xid Number> <sub component> <fatal vs nonfatal> <Crosscontain> <injected> <link> (<intrInfo> <errorStatus> <errorDebugData[0]> <errorDebugData[1]> <errorDebugData[2]> <errorDebugData[3]> <errorDebugData[4]>) From the above message, <intrInfo>, <errorStatus> must be decoded and evaluated using “XID 144-150 Decode” to derive the final resolution. |
RESTART_VM |
VM owning the affected GPU must be restarted RESET_GPU or RESTART_BM is not deemed necessary. |
XID_154 |
Follow XID 154 reported guidance |
CHECK_UVM |
If UVM/vGPU is being utilized, RESET_GPU; otherwise IGNORE |
CHECK_APP/CUDA |
Issue likely caused by an application passing bad data or utilizing incorrect methods in communications with GPU. Some errors will contain PID that can be used to identify source of the problem. If determined to be a driver issue then REPORT_ISSUE |
WORKFLOW_XID_13 |
Repeat TPC and GPC, diff SMs: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat TPC and GPC, single SM: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Solo, no burst: CHECK_APP/CUDA Not Repeat TPC and GPC: CHECK_APP/CUDA Non-prod environment: CHECK_APP/CUDA If known good APP and Solo: REPORT_ISSUE |
WORKFLOW_XID_31 |
Multiple runs needed to establish pattern Repeat MMU faults to same GPU (via PCI-ID): RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat MMU faults to diff GPU (via PCI-ID): CHECK_APP/CUDA Solo, no burst: CHECK_APP/CUDA If known good APP: REPORT_ISSUE |
Solo: RESTART_FM Not Solo: IGNORE (follow other Xid) |
Solo: RESTART_FM Not Solo: IGNORE (follow other Xid) |
INVESTIGATE_SW |
There is a problem with either user or NVIDIA software that needs to be investigated further. In many cases the user software may be making calls to illegal areas, poorly structured commands or other issues. This may also be a problem with NVIDIA software in which case an issue should be reported. In many cases there may be a PID that could be tracked back to the offending, originating entity. |
IGNORE (sympathetic) |
This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE. |
XID_137_FLOW |
This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs. When this event is logged, NVIDIA recommends the following: • Run the application in cuda-gdb or cuda-memcheck • Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. or • Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb File a bug if the previous two come back inconclusive to eliminate other possible causes. |
INVESTIGATE_LINK_SI |
Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods. |
N/A Informational only regarding another Xid |
Other issues need to be addressed. This Xid in informational only and is always expected to be seen with another Xid requiring a recovery action. |
INVESTIGATE_SW_USER |
Investigate SW or user initiated if unexpected |
SYMPATHETIC_REPORT_SOLO |
This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE. |
XID_154_EVAL |
If an XID 154 is seen along with this error, take that action. If no XID 154 present, RESTART_APP |
CHECK_NVLINK_FAILURE_FLOW |
Check telemetry to see if the any link went down in the partition within that last 30 seconds. If so, this Xid can be ignored as it was likely a by-product of the other conditions which should be investigated. If no link down indication is present, REPORT_ISSUE. |
REPORT_ISSUE (if seen >1 per day) |
REPORT_ISSUE (if seen >1 per day) |
INVESTIGATE_SW_USER_LINK_SI |
Investigate software or user intervention if not expected; additionally, follow INVESTGATE_LINK_SI if needed |
INVESTIGATE_PEER_DEVICE |
A peer device experience the issue as decoded in the Xid144-150 table (Xid 149 in particular). Based upon the note column: Received_TS1: will be seen when peer link down reason is unknown Peer_side_down_to_sleep_state: investigate peer software and users if unexpected Peer_side_down_to_disable_state: investigate peer software and users if unexpected Peer_side_down_to_disable_and_port_lock: investigate peer software and users if unexpected Peer_side_down_due_to_thermal_event: check switch cooling Peer_side_down_due_to_force_event: investigate peer software and users if unexpected Peer_side_down_due_to_reset_event: investigate peer software and users if unexpected |
INVESTIGATE_HOST |
Check other logs as this is likely a secondary indicator of some action or fault (may be OOB). |
INVESTIGATE_LINK_SI_AND_CABLES |
A more general fault that could be cable, temperature, transceiver or seating condition. Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods. |
INVESTIGATE_SW_USER |
Investigate SW or user initiated if unexpected |
Sheet Name |
Column Name |
Description |
---|---|---|
XIDs |
Type XID |
Identifies Xid entries |
Code |
The Xid number |
|
Mnemonic |
String to identify the condition. |
|
Description |
More descriptive identifier for the condition (“Unused” could mean Code is deprecated or V100 or earlier) |
|
Applies to <project> |
Signifies if the Code is supported on this particular product. |
|
Resolution Bucket (Immediate Action) |
Intended to reflect the action that is immediately needed in order to recover the system and get it back into service. |
|
Resolution Bucket (Investigatory Action) |
Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps. |
|
Xid 154 linkage |
Represents if the Code is also expected to trigger an Xid 154 condition representing the derived Data Center resolution. |
|
Trigger Conditions |
Description of when this condition may be seen or more details on possible actions to undertake. |
|
XID 144-150 Decode |
Xid |
Xid number associated with the particular row. Each Xid represents a function of NVLink operation. |
Subcode |
The subsystem of the NVLink function. This is also presented in plain text in the Xid message (ex: NETIR_LINK_EVT) If the text string differs between revisions, then the two entries will be divided by a “/” (V1(<R575)/V2(>=R575)). This field is encoded in the following IntrInfo bits: V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0]. |
|
(V1(<R575)) IntrInfo decode for Data Center Recovery Action |
Bitmask of IntrInfo for V1 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode. |
|
(V2(>=R575)) IntrInfo decode for Data Center Recovery Action |
Bitmask of IntrInfo for V2 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode. |
|
Error Status (hex) |
Error Status value represented by the second register presented in the parentheses. |
|
Resolution Bucket (Immediate Action) |
Intended to reflect the action that is immediately needed in order to recover the system and get it back into service. |
|
(V1(<R575)) Decode for action 2 |
If needed, this will be the V1 IntrInfo decode required to undertake Action 2. |
|
(V2(>=R575)) Decode for action 2 |
If needed, this will be the V2 IntrInfo decode required to undertake Action 2. |
|
Action 2 |
Similar to Resolution Bucket (Immediate Action) above for the Decode for action 2 encoding |
|
Resolution Bucket (Investigatory Action) |
Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps. |
|
Severity |
Severity of the condition; Can be Link Fatal, Fatal (GPU) or non-fatal. -GPU fatal will cause all links to go down and all app channels to be RC’ed . May cause Packet Loss conditions. -Link fatal put the GPU in a “drain and reset recommended state” until jobs are drained. After job drain GPU is put to “reset required “ state so no new jobs can be launched. NOTE: * is for promoteable errors that could be non-fatal or fatal and “Action 2” would apply. ** while these are generally expected to be fatal, severity will be present and there are possible paths where this may not occur. |
|
HW/SW |
Is the condition generally HW, SW, or FW related. Some conditions can not be uniquely classified. |
|
Local/Remote |
What are the impacts of the condition on the local GPU as well as remote GPU(s) that are interconnected. NOTE: -Applies to Xid 144-148, 150. -Xid 149 will all be impacts to a local device (even if caused by a peer_side_down_* condition) -XC represents Cross Contain” * Packet Loss may present as a Xid 145 RLW_SRC_TRACK; V1 IntrInfo: ——000000———-0011100010 ; V2 IntrInfo: ——000000————-0000111; ErrStatus 0x00000004 or 0x00000008 ** Xid94 represents consumption of poisoned memory; Xid 48 represents ECC/DBE errors |
|
Guidance Classes |
Guidance Class |
A resolution bucket assigned to a particular type of action. |
Resolution Action |
Steps to be taken to resolve the error that occurred. |
This catalog provides a detailed reference on each possible Xid, and provides information on the cause of the Xid, and actions to take. You can also download the reference guide as a spreadsheet here.
The catalog is presented as a spreadsheet, with several sheets of information.
For a given Xid, use the following procedure to walk through the correct actions to take in handling the Xid.
Step 1: Determine Xid Code#
Determine the Xid Code from the Xid Message.
Each Xid message contains a single code, following a colon after the GPU identifier. In the following examples, the Xid Codes are 14, 22013 and 79 respectively.
[...] NVRM: Xid (0000:03:00): 14, Channel 00000001
[...] NVRM: Xid (PCI:0000:5a:00): 79, GPU has fallen off the bus.
Step 2: Review Xid Classification from the Xid Catalog#
In the Xid Catalog, open the “Xids” sheet and find the row with a matching “Code” to the Xid Code from Step 1.
For example, for Xid 79:

For each row, the catalog provides a brief description of the Xid in the “Description” column, as well as applicability to different revisions of GPU in the “Applies to” columns.
Note that some Xid codes are deprecated on more recent GPU models. These Xids are listed as “Unused” for the description, indicating they may be deprecated and applicable to V100 or earlier GPUs.
Step 3: Determine Data Center and Investigatory Actions#
The Xid Catalog provides two different actions for handling an Xid.
Immediate Action:
The “Resolution Bucket - Immediate Action” column in the Xid Catalog provides an immediate action that should be performed to recover a system after an Xid is observed. This is intended as an automatable action that administrators can perform to recover the system from Xid error, and ready the system for new applications.
This action can be performed as automated recovery after an Xid.
Investigatory Action:
The data center action is intended to recover the system, but in some cases, where there is a persistent failure, the Xid will reoccur, requiring a more detailed investigation to the cause of the issue, and will help to further identify if there are underlying hardware, firmware or software failures that need longer term actions to correct the issue.
If the issue reoccurs, or is not expected, the Investigatory Action column provides guidance on actions to take to investigate the issue.
Step 4: Determine Resolution Steps#
Both the Data Center Action and Investigatory Action columns provide a short summary Resolution Bucket that summarizes common actions to take that may be shared by different types of Xid codes. The actual steps to take for these actions are defined on the Fault Resolution Buckets worksheet.
For example, if data center action indicates a fault resolution bucket is “RESET_GPU,”” the row for RESET_GPU
in the Fault Resolution Bucket spreadsheet provides guidance on exact actions to take.

Similarly, the resolution steps for investigatory actions are presented on the same worksheet.