Analyzing Xid Errors with the Xid Catalog#

On Volta and older GPUs, see Xid and SXid Errors with the Xid Catalog for older GPUs.

For Ampere and newer GPUs (including PCIe form-factor GPUs), a catalog of possible Xid events is available in the graphs below. You can also download the spreadsheet below:

Xid Catalog Reference

Xids

Table 1 Xids#
Type (XID)	Code	Mnemonic	Description	Applies to A100	Applies to H100	Applies to B100	Applies to GB200	Resolution Bucket (Immediate Action)	Resolution Bucket (Investigatory Action)	Xid 154 linkage	Trigger Conditions
XID	1	ROBUST_CHANNEL_FIFO_ERROR_FIFO_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	2	ROBUST_CHANNEL_FIFO_ERROR_SW_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	3	ROBUST_CHANNEL_FIFO_ERROR_UNK_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	4	ROBUST_CHANNEL_FIFO_ERROR_CHANNEL_BUSY	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	5	ROBUST_CHANNEL_FIFO_ERROR_RUNOUT_OVERFLOW	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	6	ROBUST_CHANNEL_FIFO_ERROR_PARSE_ERR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	7	ROBUST_CHANNEL_FIFO_ERROR_PTE_ERR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	8	ROBUST_CHANNEL_FIFO_ERROR_IDLE_TIMEOUT	GPU stopped processing	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	9	ROBUST_CHANNEL_GR_ERROR_INSTANCE	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	10	ROBUST_CHANNEL_GR_ERROR_SINGLE_STEP	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	11	ROBUST_CHANNEL_GR_ERROR_MISSING_HW	Invalid or corrupted push buffer stream	YES	YES	YES	YES	RESTART_APP	CHECK_APP/CUDA
XID	12	ROBUST_CHANNEL_GR_ERROR_SW_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	13	ROBUST_CHANNEL_GR_EXCEPTION / ROBUST_CHANNEL_GR_ERROR_SW_NOTIFY	Graphics Engine Exception	YES	YES	YES	YES	RESTART_APP	WORKFLOW_XID_13		This event is logged for general user application faults. Typically this is an out-of-bounds error where the user has walked past the end of an array, but could also be an illegal instruction, illegal register, or other case. In rare cases, it’s possible for a hardware failure or system software bugs to materialize as XID 13. When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read.
XID	14	ROBUST_CHANNEL_FAKE_ERROR	Unused	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT		Fake or injected error from userspace
XID	15	ROBUST_CHANNEL_SCANLINE_TIMEOUT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	16	ROBUST_CHANNEL_VBLANK_CALLBACK_TIMEOUT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT	CONTACT_SUPPORT		N/A; Unused
XID	17	ROBUST_CHANNEL_PARAMETER_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	18	ROBUST_CHANNEL_BUS_MASTER_TIMEOUT_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	19	ROBUST_CHANNEL_DISP_MISSED_NOTIFIER	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	20	ROBUST_CHANNEL_MPEG_ERROR_SW_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	21	ROBUST_CHANNEL_ME_ERROR_SW_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	22	ROBUST_CHANNEL_VP_ERROR_SW_METHOD	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	23	ROBUST_CHANNEL_RC_LOGGING_ENABLED	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	24	ROBUST_CHANNEL_GR_SEMAPHORE_TIMEOUT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	25	ROBUST_CHANNEL_GR_ILLEGAL_NOTIFY	Invalid or illegal push buffer stream	YES	YES	YES	YES	RESTART_APP	CHECK_APP/CUDA
XID	26	ROBUST_CHANNEL_FIFO_ERROR_FBISTATE_TIMEOUT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	27	ROBUST_CHANNEL_VP_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	28	ROBUST_CHANNEL_VP2_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	29	ROBUST_CHANNEL_BSP_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	30	ROBUST_CHANNEL_BAD_ADDR_ACCESS	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	31	ROBUST_CHANNEL_FIFO_ERROR_MMU_ERR_FLT	GPU memory page fault	YES	YES	YES	YES	RESTART_APP	WORKFLOW_XID_31		This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs. When this event is logged, NVIDIA recommends the following: 1. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool, or 2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or 3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug. NOTE: The Compute Sanitizer memcheck tool instruments the running application and reports which line of code performed the illegal read.
XID	32	ROBUST_CHANNEL_PBDMA_ERROR	Invalid or corrupted push buffer stream	YES	YES	YES	YES	RESTART_APP	CHECK_APP/CUDA		This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions.
XID	33	ROBUST_CHANNEL_SEC_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	34	ROBUST_CHANNEL_MSVLD_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	35	ROBUST_CHANNEL_MSPDEC_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	36	ROBUST_CHANNEL_MSPPP_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	37	ROBUST_CHANNEL_FECS_ERR_UNIMP_FIRMWARE_METHOD	Driver firmware error	YES	YES	YES	YES	IGNORE	CHECK_APP/CUDA
XID	38	ROBUST_CHANNEL_FECS_ERR_WATCHDOG_TIMEOUT	Driver firmware error	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT
XID	39	ROBUST_CHANNEL_CE0_ERROR	Copy Engine Exception	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	40	ROBUST_CHANNEL_CE1_ERROR	Copy Engine Exception	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	41	ROBUST_CHANNEL_CE2_ERROR	Copy Engine Exception	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	42	ROBUST_CHANNEL_VIC_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	43	ROBUST_CHANNEL_RESETCHANNEL_VERIF_ERROR	GPU stopped processing	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT		This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state. In most cases, this is not indicative of a driver bug but rather a user application error.
XID	44	ROBUST_CHANNEL_GR_FAULT_DURING_CTXSW	Graphics Engine fault during context switch	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT
XID	45	ROBUST_CHANNEL_PREEMPTIVE_REMOVAL	Preemptive cleanup, due to previous errors – Most likely to see when running multiple cuda applications and hitting a DBE	YES	YES	YES	YES	WORKFLOW_XID_45	Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)		This event is logged when the user application aborts and the kernel driver tears down the GPU application running on the GPU. Control-C, GPU resets, sigkill are all examples where the application is aborted and this event is created. In many cases, this is not indicative of a bug but rather a user or system action.
XID	46	ROBUST_CHANNEL_GPU_TIMEOUT_ERROR	GPU stopped processing	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT
XID	47	ROBUST_CHANNEL_NVENC0_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	48	ROBUST_CHANNEL_GPU_ECC_DBE	Double Bit ECC Error	YES	YES	YES	YES	WORKFLOW_XID_48	WORKFLOW_XID_48	CUDA 12.7; GPU driver R565	This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error. The tool nvidia-smi can provide a summary of ECC errors. If the ECC error is reported for SRAM (excludes “framebuffer”), check for SRAM DBE thresholds and follow RMA flow if exceeded - (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0). If flag is set, run field diag.
XID	49	SILENT_RUNNING_CONSTANT_LEVEL_SET_BY_REGISTRY	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	50	SILENT_RUNNING_LEVEL_TRANSITION_DUE_TO_RC_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	51	SILENT_RUNNING_STRESS_TEST_FAILURE	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	52	SILENT_RUNNING_LEVEL_TRANS_DUE_TO_TEMP_RISE	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	53	SILENT_RUNNING_TEMP_REDUCED_CLOCKING	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	54	SILENT_RUNNING_PWR_REDUCED_CLOCKING	Auxiliary power is not connected to the GPU board	YES	YES	YES	NO	CHECK_MECHANICALS	CONTACT_SUPPORT
XID	55	SILENT_RUNNING_TEMPERATURE_READ_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	56	DISPLAY_CHANNEL_EXCEPTION	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	57	FB_LINK_TRAINING_FAILURE_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	58	FB_MEMORY_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	59	PMU_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	60	ROBUST_CHANNEL_SEC2_ERROR	Video processor exception	YES	YES	YES	YES	RESTART_APP	INVESTIGATE_SW
XID	61	PMU_BREAKPOINT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	62	PMU_HALT_ERROR	Internal micro-controller halt (newer drivers)	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT	CUDA 12.7; GPU driver R565
XID	63	INFOROM_DRAM_RETIREMENT_EVENT	GPU memory remapping event	YES	YES	YES	YES	IGNORE	IGNORE	CUDA 12.7; GPU driver R565	These events are logged when the GPU handles ECC memory errors on the GPU. On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.
XID	64	INFOROM_DRAM_RETIREMENT_FAILURE	GPU memory remapping failure	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT	CUDA 12.7; GPU driver R565	These events are logged when the GPU handles ECC memory errors on the GPU. On GPUs that support row remapping, starting with NVIDIA® Ampere archtecture GPUs, these events provide details on row remapper activity. For more information row remapper Xids, refer to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. On earlier GPUs that support dynamic page retirement, these events provide details on dynamic page retirement activity. For more information on dynamic page retirement Xids, refer to https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.
XID	65	ROBUST_CHANNEL_NVENC1_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	66	ROBUST_CHANNEL_FECS_ERR_REG_ACCESS_VIOLATION	Illegal access by driver	YES	YES	YES	YES	IGNORE	INVESTIGATE_SW
XID	67	ROBUST_CHANNEL_FECS_ERR_VERIF_VIOLATION	Illegal access by driver	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT
XID	68	ROBUST_CHANNEL_NVDEC0_ERROR	NVDEC0 Exception	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	69	ROBUST_CHANNEL_GR_CLASS_ERROR	Graphics Engine class error	YES	YES	YES	YES	RESTART_APP	CHECK_APP/CUDA
XID	70	ROBUST_CHANNEL_CE3_ERROR	CE3: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	71	ROBUST_CHANNEL_CE4_ERROR	CE4: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	72	ROBUST_CHANNEL_CE5_ERROR	CE5: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	73	ROBUST_CHANNEL_NVENC2_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	74	NVLINK_ERROR	NVLINK Error	YES	YES	NO	NO	WORKFLOW_NVLINK_ERR	CONTACT_SUPPORT	CUDA 12.7; GPU driver R565	This event is logged when the GPU detects that a problem with a connection from the GPU to another GPU or NVSwitch over NVLink. A GPU reset or node reboot is needed to clear this error. This event may indicate a hardware failure with the link itself, or may indicate a problem with the device at the remote end of the link. For example, if a GPU fails, another GPU connected to it over NVLink may report an Xid 74 simply because the link went down as a result. The nvidia-smi nvlink command can provide additional details on NVLink errors, and connection information on the links. If this error is seen repeatedly and GPU reset or node reboot fails to clear the condition, contact your hardware vendor for support.
XID	75	ROBUST_CHANNEL_CE6_ERROR	CE6: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	76	ROBUST_CHANNEL_CE7_ERROR	CE7: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	77	ROBUST_CHANNEL_CE8_ERROR	CE8: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	78	VGPU_START_ERROR	vGPU Start Error	YES	YES	YES	YES	UPDATE_SWFW	UPDATE_SWFW
XID	79	ROBUST_CHANNEL_GPU_HAS_FALLEN_OFF_THE_BUS	GPU has fallen off the bus	YES	YES	YES	YES	RESTART_BM	CONTACT_SUPPORT	CUDA 12.7; GPU driver R565	This event is logged when the GPU driver attempts to access the GPU over its PCI Express connection and finds that the GPU is not accessible. This event is often caused by hardware failures on the PCI Express link causing the GPU to be inaccessible due to the link being brought down. Reviewing system event logs and kernel PCI event logs may provide additional indications of the source of the link failures. This event may also be cause by failing GPU hardware or other driver issues.
XID	80	PBDMA_PUSHBUFFER_CRC_MISMATCH	Corrupted data sent to GPU	YES	YES	NO	NO	RESTART_APP	CHECK_APP/CUDA
XID	81	ROBUST_CHANNEL_VGA_SUBSYSTEM_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	82	ROBUST_CHANNEL_NVJPG0_ERROR	NVJPG0 Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	83	ROBUST_CHANNEL_NVDEC1_ERROR	NVDEC1 Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	84	ROBUST_CHANNEL_NVDEC2_ERROR	NVDEC2 Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	85	ROBUST_CHANNEL_CE9_ERROR	CE9: Unknown Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	86	ROBUST_CHANNEL_OFA0_ERROR	OFA Exception	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	87	NVTELEMETRY_DRIVER_REPORT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	88	ROBUST_CHANNEL_NVDEC3_ERROR	NVDEC3 Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	89	ROBUST_CHANNEL_NVDEC4_ERROR	NVDEC4 Error	YES	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	90	LTC_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	91	RESERVED_XID	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	92	EXCESSIVE_SBE_INTERRUPTS	High single-bit ECC error rate	YES	YES	YES	YES	IGNORE	CONTACT_SUPPORT
XID	93	INFOROM_ERASE_LIMIT_EXCEEDED	Non-fatal violation of provisioned InfoROM wear limit	YES	NO	NO	NO	IGNORE	CONTACT_SUPPORT		This event is logged when the GPU driver fails to update the InfoROM due to violation of the provisioned InfoROM wear limit that was set for the GPU using NVFlash using nvflash –=elsessionstart. In most cases this is not indicative of a driver or flash failure, but rather the intentional use of the InfoROM wear protection feature as set by NVFlash. Recovery steps: The GPU can be recovered from Xid 93 by clearing InfoROM erase limit using ./nvflash –-elsessionclear. If clearing the limit using nvflash doesn’t help, report the issue to NVIDIA.
XID	94	ROBUST_CHANNEL_CONTAINED_ERROR	Contained memory error	YES	YES	YES	YES	RESTART_APP	IGNORE (sympathetic)	CUDA 12.7; GPU driver R565	These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA A100 GPUs. For Xid 94, these errors are contained to one application, and the application that encountered this error must be restarted. All other applications running at the time of the Xid are unaffected. It is recommended to reset the GPU when convenient. Applications can continue to be run until the reset can be performed. One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors. Xid 45 will be seen in relation to this error.
XID	95	ROBUST_CHANNEL_UNCONTAINED_ERROR	Uncontained memory error	YES	YES	YES	YES	RESET_GPU	IGNORE (sympathetic)	CUDA 12.7; GPU driver R565	These events (94/95) are logged when GPU drivers handle errors in GPUs that support error containment, starting with NVIDIA® A100 GPUs. For Xid 95, these errors affect multiple applications, and the affected GPU must be reset before applications can restart. Refer https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations One possible cause of containment errors is the handling of ECC memory errors. Review the NVIDIA GPU Memory Error Management manual: https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping for coverage of ECC-triggered containment errors. Xid 45 will be seen in relation to this error.
XID	96	ROBUST_CHANNEL_NVDEC5_ERROR	NVDEC5 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	97	ROBUST_CHANNEL_NVDEC6_ERROR	NVDEC6 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	98	ROBUST_CHANNEL_NVDEC7_ERROR	NVDEC7 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	99	ROBUST_CHANNEL_NVJPG1_ERROR	NVJPG1 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	100	ROBUST_CHANNEL_NVJPG2_ERROR	NVJPG2 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	101	ROBUST_CHANNEL_NVJPG3_ERROR	NVJPG3 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	102	ROBUST_CHANNEL_NVJPG4_ERROR	NVJPG4 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	103	ROBUST_CHANNEL_NVJPG5_ERROR	NVJPG5 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	104	ROBUST_CHANNEL_NVJPG6_ERROR	NVJPG6 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	105	ROBUST_CHANNEL_NVJPG7_ERROR	NVJPG7 Error	NO	YES	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	106	SMBPBI_TEST_MESSAGE	SMBPBI Test Message	YES	YES	YES	YES	IGNORE	IGNORE
XID	107	SMBPBI_TEST_MESSAGE_SILENT	SMBPBI Test Message Silent	YES	YES	YES	YES	IGNORE	IGNORE
XID	108	NVLINK_REMOTE_TRANSLATION_ERROR	Unused	YES	YES	YES	YES	IGNORE	XID_137_FLOW		N/A; Unused
XID	109	ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR	Context Switch Timeout Error	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT	CUDA 12.7; GPU driver R570
XID	110	SEC_FAULT_ERROR	Security Fault Error	NO	YES	YES	YES	RESET_GPU	INVESTIGATE_SW	CUDA 12.7; GPU driver R565	This event should be uncommon unless there is a hardware failure. To recover, revert any recent system hardware modifications and cold reset the system. If this fails to correct the issue, contact your hardware vendor for assistance.
XID	111	BUNDLE_ERROR_EVENT	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	112	DISP_SUPERVISOR_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	113	DP_LT_FAILURE	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	114	HEAD_RG_UNDERFLOW	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	115	CORE_CHANNEL_REGS	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	116	WINDOW_CHANNEL_REGS	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	117	CURSOR_CHANNEL_REGS	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	118	HEAD_REGS	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	119	GSP_RPC_TIMEOUT	GSP RPC Timeout	YES	YES	YES	YES	RESET_GPU	INVESTIGATE_SW		These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps.
XID	120	GSP_ERROR	GSP Error	YES	YES	YES	YES	RESET_GPU	INVESTIGATE_SW	CUDA 12.7; GPU driver R565	These events (119/120) may be logged when an error occurs in code running on the GSP core of the GPU and/or a timeout occurs while waiting for the GSP core of the GPU to respond to an RPC message. A GPU reset or node power cycle may be needed if the error persists. If this problem reoccurs after a power cycle, follow the NVIDIA GPU Debug Guidelines document for additional debugging steps.
XID	121	C2C_ERROR	C2C Error	NO	NO	NO	YES	IGNORE	CONTACT_SUPPORT		This event may occur when the GPU driver has observed corrected errors on the C2C NVLink connection to a Grace CPU. These errors are corrected by the system and have no operational impact. Resetting the GPU at an available service window will allow the GPU to retrain the link. NOTE: repeat errors may be reported; VBIOS 97.00.90.00.00 may provide some relief from that condition
XID	122	SPI_PMU_RPC_READ_FAIL	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	123	SPI_PMU_RPC_WRITE_FAIL	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	124	SPI_PMU_RPC_ERASE_FAIL	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	125	INFOROM_FS_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	126	ROBUST_CHANNEL_CE10_ERROR	CE10: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	127	ROBUST_CHANNEL_CE11_ERROR	CE11: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	128	ROBUST_CHANNEL_CE12_ERROR	CE12: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	129	ROBUST_CHANNEL_CE13_ERROR	CE13: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	130	ROBUST_CHANNEL_CE14_ERROR	CE14: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	131	ROBUST_CHANNEL_CE15_ERROR	CE15: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	132	ROBUST_CHANNEL_CE16_ERROR	CE16: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	133	ROBUST_CHANNEL_CE17_ERROR	CE17: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	134	ROBUST_CHANNEL_CE18_ERROR	CE18: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	135	ROBUST_CHANNEL_CE19_ERROR	CE19: Unknown Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	136	ALI_TRAINING_FAIL	Link Training Failed	NO	YES	NO	NO	RESET_GPU	INVESTIGATE_LINK_SI	CUDA 12.7; GPU driver R565
XID	137	NVLINK_PRIV_ERR	NVLink Privilege Error	YES	YES	YES	YES	IGNORE	XID_137_FLOW		This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs, but can also be driver bugs or hardware bugs. When this event is logged, NVIDIA recommends the following: #. Run the application in cuda-gdb or the Compute Sanitizer memcheck tool , or #. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or #. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
XID	138	ROBUST_CHANNEL_DLA_ERROR	Unused	NO	NO	NO	NO	CONTACT_SUPPORT			N/A; Unused
XID	139	ROBUST_CHANNEL_OFA1_ERROR	OFA1 Error	NO	NO	YES	YES	RESTART_APP	CONTACT_SUPPORT
XID	140	UNRECOVERABLE_ECC_ERROR_ESCAPE	ECC Unrecovered Error	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT		This event may occur when the GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping. Reset the GPU, and if the problem persists, contact your hardware vendor for support.
XID	141	ROBUST_CHANNEL_FAST_PATH_ERROR	CUDA Fast Path Error	NO	YES	YES	YES	IGNORE	CONTACT_SUPPORT
XID	142	ROBUST_CHANNEL_NVENC3_ERROR	NVENC3 Error	NO	NO	NO	YES	CONTACT_SUPPORT
XID	143	GPU_INIT_ERROR	GPU Initialization Error	NO	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT	CUDA 12.9; GPU driver R575
XID	144	NVLINK_SAW_ERROR	NVLINK: SAW Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	145	NVLINK_RLW_ERROR	NVLINK: RLW Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	146	NVLINK_TLW_ERROR	NVLINK: TLW Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	147	NVLINK_TREX_ERROR	NVLINK: TREX Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	148	NVLINK_NVLPW_CTRL_ERROR	NVLINK: NVLPW_CTRL Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	149	NVLINK_NETIR_ERROR	NVLINK: NETIR Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	150	NVLINK_MSE_ERROR	NVLINK: MSE Error	NO	NO	YES	YES	WORKFLOW_NVLINK5_ERR	WORKFLOW_NVLINK5_ERR	CUDA 12.7; GPU driver R565
XID	151	ROBUST_CHANNEL_KEY_ROTATION_ERROR	Key rotation Error	NO	YES	YES	YES	RESTART_VM	CONTACT_SUPPORT
XID	152	ROBUST_CHANNEL_DLA_SMMU_ERROR	DLA SMMU Error	NO	NO	NO	NO	IGNORE	CONTACT_SUPPORT
XID	153	ROBUST_CHANNEL_DLA_TIMEOUT	DLA timeout Error	NO	NO	NO	NO	IGNORE	CONTACT_SUPPORT
XID	154	GPU_RECOVERY_ACTION_CHANGED	GPU Recovery Action Changed	YES	YES	YES	YES	XID_154	N/A Informational only regarding another Xid		“Xid 154 will be seen in conjunction with other Xids and summarizes the recovery action required for other Xids. The string will be similar to “”Xid 154 GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)”” where the expected values of the text are: “”None””, “”Drain P2P””, “”Drain and Reset””, “”GPU Reset Required””, “”Node Reboot Required””. “
XID	155	NVLINK_SW_DEFINED_ERROR	NVLINK: SW Defined Error	NO	NO	YES	YES	RESET_GPU	INVESTIGATE_SW_USER	CUDA 12.7; GPU driver R565	Link down events which are flagged as “intentional” (including transitions to SLEEP) will trigger this Xid
XID	156	RESOURCE_RETIREMENT_EVENT	Resource Retirement Event	NO	YES	YES	YES	RESET_GPU	IGNORE	CUDA 12.7; GPU driver R565
XID	157	RESOURCE_RETIREMENT_FAILURE	Resource Retirement Failure	NO	YES	YES	YES	IGNORE	CONTACT_SUPPORT		No possible repairs are possible due to lack of resources. You may still run workloads or Apps, but may experience the same Xid again.
XID	158	GPU_FATAL_TIMEOUT	GPU Fatal Timeout	YES	YES	YES	YES	RESET_GPU	CONTACT_SUPPORT	yes; support with Xid introduction
XID	159	ROBUST_CHANNEL_CHI_NON_DATA_ERROR	CHI Non-Data Error	NO	NO	YES	YES	CHECK_UVM	SYMPATHETIC_REPORT_SOLO	yes; support with Xid introduction	May be seen on any C2C link-connected GPU.
XID	160	CHANNEL_RETIREMENT_EVENT	Channel Retirement Event	NO	NO	YES	YES	IGNORE	INVESTIGATE_SW	CUDA 12.9; GPU driver R575
XID	161	CHANNEL_RETIREMENT_FAILURE	Channel Retirement Failure	NO	NO	YES	YES	IGNORE	INVESTIGATE_SW	CUDA 12.9; GPU driver R575
XID	162	PSHC_REENGAGED	Power Smoothing HW Circuitry capability reengaged	NO	NO	YES	YES
XID	163	PSHC_DISENGAGED	Power Smoothing HW Circuitry capability disengaged	NO	NO	YES	YES				No GPU reset required. If power smoothing functionality is desired, the customer needs to resolve the thermal events. If disabled due to timeout, reload the driver or reset the GPU.
XID	164	PSHC_LOW_LIFETIME	Power Smoothing HW Circuitry low lifetime reached	NO	NO	YES	YES				Monitor power swings and expect to replace GPUs if power smoothing is desired. Power smoothing functionality will be disabled soon. Investigate if power swings are acceptable, and if not, take action.
XID	165	PSHC_ZERO_LIFETIME	Power Smoothing HW Circuitry lifetime exhausted	NO	NO	YES	YES				Replace GPUs if power swings are not acceptable, and power smoothing is desired. Power smoothing will be disabled by the driver and power swings will occur. Analyze datacenter infrastructure to ensure ability to absorb power swings.
XID	166	NVLINK_SECURE_CRYPTO_ERR	CC traffic seen prior to link properly being configured for encrypted traffic	NO	NO	YES	YES				Applicable to CC (confidential computing) mode only.
XID	167	PCIE_FATAL_TIMEOUT	PCIE_FATAL_TIMEOUT	NO	YES	YES	YES
XID	168	REDUCED_GPU_MEMORY_CAPACITY	Errors found in WPR (write protected region)	YES	YES	YES	YES				Should only be seen when ECC is disabled. Either ECC should be enabled (to enable row-remapping) or boot re-attempted with shifted WPR.
XID	169	SEC2_HALT_ERROR	Internal micro-controller halt	NO	YES	YES	YES
XID	170	NVLINK_SECURE_OTHER	Interrupt seen in CC mode	NO	NO	YES	YES				Applicable to CC (confidential computing) mode only.
XID	171	UNCORRECTABLE_DRAM_ERROR	Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM	YES	YES	YES	YES
XID	172	UNCORRECTABLE_SRAM_ERROR	Additional to Xid 48 providing more details on particulars of fault to differentiate DRAM/SRAM	YES	YES	YES	YES

Xid 144-150 Decode

Table 2 Xid 144-150 Decode#
Xid	Subcode V1(<R575)/V2(>=R575) V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0]	(V1(<R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode)	(V2(>=R575)) IntrInfo decode for Data Center Recovery Action IntrInfo (binary; “-” user decode)	Error Status (hex)	Resolution Bucket (Data Center Recovery Action)	(V1(<R575)) Decode for action 2	(V2(>=R575)) Decode for action 2	Action 2	Resolution Bucket (Investigatory Action)	Severity (for items with ‘*’ please see Customer User Guide tab)	HW/SW	Local/Remote (for items with ‘*’ please see Customer User Guide tab)
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000001	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000002	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000004	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000008	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: XID 48; Remote: Will lead to poison or Xid94****
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000010	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
144	SAW_MVB	——000000———-0000100001	——000000————-0000001	0x00000020	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_CTRL	——000000———-0001100010	——000000————-0000011	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000001	XID_154_EVAL				CONTACT_SUPPORT	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000002	XID_154_EVAL				CONTACT_SUPPORT	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000004	XID_154_EVAL				CHECK_NVLINK_FAILURE_FLOW	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000008	XID_154_EVAL				CHECK_NVLINK_FAILURE_FLOW	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000010	XID_154_EVAL				CHECK_NVLINK_FAILURE_FLOW	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000020	XID_154_EVAL				CHECK_NVLINK_FAILURE_FLOW	Non-fatal	SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000040	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000080	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000100	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x00000200	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_REMAP	——000000———-0010000010	——000000————-0000100	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
145	RLW_RSPCOL	——000000———-0010100010	——000000————-0000101	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_RSPCOL	——000000———-0010100010	——000000————-0000101	0x00000002	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_RSPCOL	——000000———-0010100010	——000000————-0000101	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
145	RLW_RXPIPE	——000000———00011000010	——000000————00000110	0x00000001	IGNORE	——000000———10011000010	——000000————10000110	RESET_GPU	CONTACT_SUPPORT	Non-fatal*	SW	Local: none; Remote: PacketLoss***
145	RLW_RXPIPE	——000000———00011000010	——000000————00000110	0x00000002	IGNORE	——000000———10011000010	——000000————10000110	RESET_GPU	CONTACT_SUPPORT	Non-fatal*	SW	Local: none; Remote: PacketLoss***
145	RLW_RXPIPE	——000000———00011000010	——000000————00000110	0x00000004	IGNORE	——000000———10011000010	——000000————10000110	RESET_GPU	CONTACT_SUPPORT	Non-fatal*	SW	Local: PacketLoss***; Remote: none
145	RLW_RXPIPE	——000000———-0011000010	——000000————-0000110	0x00000008	IGNORE				CONTACT_SUPPORT	Non-fatal	HW/SW	Local: none; Remote: none
145	RLW_RXPIPE	——000000———-0011000010	——000000————-0000110	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000001	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000002	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000004	XID_154_EVAL				IGNORE	Non-fatal	HW/SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000008	XID_154_EVAL				IGNORE	Non-fatal	HW/SW	Local: XC/AppCrash (Xid 45); Remote: none
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000010	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x00000020	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_SRC_TRACK	——000000———-0011100010	——000000————-0000111	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x00000002	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x00010000	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x00020000	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x00100000	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: Xid 48, AppCrash (Xid 45); Remote: PacketLoss***(possible)
145	RLW_TAGSTATE	——000000———-0100000010	——000000————-0001000	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
146	TLW_CTRL	——000000———-0100100011	——000000————-0001001	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
146	TLW_CTRL	——000000———-0100100011	——000000————-0001001	0x00000002	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: XID 48; Remote: Will lead to poison or Xid94****
146	TLW_CTRL	——000000———-0100100011	——000000————-0001001	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: XID 48, PacketLoss*(possible); Remote: PacketLoss*(possible)
146	TLW_CTRL	——000000———-0100100011	——000000————-0001001	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
146	TLW_RX/TLW_RX_PIPE0	——000000———-0101000011	——000000————-0001010	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
146	TLW_RX/TLW_RX_PIPE0	——000000———-0101000011	——000000————-0001010	0x00000002	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none
146	TLW_RX/TLW_RX_PIPE0	——000000———-0101000011	——000000————-0001010	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: XID 48, PacketLoss*(possible); Remote: PacketLoss*(possible)
146	TLW_RX/TLW_RX_PIPE0	——000000———-0101000011	——000000————-0001010	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
146	TLW_RX/TLW_RX_PIPE1	——000000———-0101000011	——000000————-0001011	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
146	TLW_RX/TLW_RX_PIPE1	——000000———-0101000011	——000000————-0001011	0x00000002	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none
146	TLW_RX/TLW_RX_PIPE1	——000000———-0101000011	——000000————-0001011	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: XID 48, PacketLoss*(possible); Remote: PacketLoss*(possible)
146	TLW_RX/TLW_RX_PIPE1	——000000———-0101000011	——000000————-0001011	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
146	TLW_TX/TLW_TX_PIPE0	——000000———-0101100011	——000000————-0001100	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
146	TLW_TX/TLW_TX_PIPE0	——000000———-0101100011	——000000————-0001100	0x00000002	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none
146	TLW_TX/TLW_TX_PIPE0	——000000———-0101100011	——000000————-0001100	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: XID 48, PacketLoss*(possible); Remote: PacketLoss*(possible)
146	TLW_TX/TLW_TX_PIPE0	——000000———-0101100011	——000000————-0001100	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
146	TLW_TX/TLW_TX_PIPE1	——000000———-0101100011	——000000————-0001101	0x00000001	IGNORE				IGNORE	Non-fatal	HW	Local: none; Remote: none
146	TLW_TX/TLW_TX_PIPE1	——000000———-0101100011	——000000————-0001101	0x00000002	IGNORE				CONTACT_SUPPORT	Non-fatal	HW	Local: Will lead to Xid 48. Will lead to poison or Xid94/95****; Remote: none
146	TLW_TX/TLW_TX_PIPE1	——000000———-0101100011	——000000————-0001101	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal	HW	Local: XID 48, PacketLoss*(possible); Remote: PacketLoss*(possible)
146	TLW_TX/TLW_TX_PIPE1	——000000———-0101100011	——000000————-0001101	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
147	TREX	——000000———-0110000100	——000000————-0001110	0x00000001	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	NOTE: not in production code, so should not be experienced
147	TREX	——000000———-0110000100	——000000————-0001110	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
148	NVLPW_CTRL/NVLPW	——000000———-0000000101	——000000————-0001111	0x80000000	IGNORE				CONTACT_SUPPORT	Non-fatal	SW	Local: none; Remote: none
149	NETIR/NETIR_INT	——000000———-0000000110	——000000————-0011000		RESET_GPU				SYMPATHETIC_REPORT_SOLO	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——000000———-0111000110	——000000————-0010001		RESET_GPU				SYMPATHETIC_REPORT_SOLO	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——000001———-0111000110	——000001————-0010001		RESET_GPU				REPORT_ISSUE (if seen >1 per day)	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——000010———-0111000110	——000010————-0010001		RESET_GPU				INVESTIGATE_LINK_SI	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——000100———-0111000110	——000100————-0010001		RESET_GPU				INVESTIGATE_LINK_SI	Link Fatal	HW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——001010———-0111000110	——001010————-0010001		RESET_GPU				INVESTIGATE_LINK_SI	Link Fatal	HW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——001111———-0111000110	——001111————-0010001		RESET_GPU				INVESTIGATE_SW_USER	Link Fatal	SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——010000———-0111000110	——010000————-0010001		RESET_GPU				INVESTIGATE_SW_USER_LINK_SI	Link Fatal	SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——010001———-0111000110	——010001————-0010001		RESET_GPU				INVESTIGATE_SW_USER	Link Fatal	SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——010010———-0111000110	——010010————-0010001		RESET_GPU				INVESTIGATE_SW_USER_LINK_SI	Link Fatal	SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——010101———-0111000110	——010101————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——010110———-0111000110	——010110————-0010001		RESET_GPU				INVESTIGATE_SW_USER	Link Fatal	SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100000———-0111000110	——100000————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100001———-0111000110	——100001————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100010———-0111000110	——100010————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100011———-0111000110	——100011————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100100———-0111000110	——100100————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100101———-0111000110	——100101————-0010001		RESET_GPU				INVESTIGATE_PEER_DEVICE	Link Fatal	HW/SW	Local: PacketLoss*(posible/delayed); Remote: PacketLoss* (possible/delayed) Other end of link: source of link fatal
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——100110———-0111000110	——100110————-0010001		IGNORE				INVESTIGATE_SW/USER	Link Fatal	HW/SW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——101000———-0111000110	——101000————-0010001		IGNORE				INVESTIGATE_HOST	Fatal	SW	Local: fatal; Remote: PacketLoss***(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——101010———-0111000110	——101010————-0010001		RESET_GPU				INVESTIGATE_LINK_SI_AND_CABLES	Link Fatal?	HW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_LINK_EVT/NETIR_LINK_DOWN	——101011———-0111000110	——101011————-0010001		RESET_GPU				INVESTIGATE_LINK_SI_AND_CABLES	Link Fatal?	HW	Local: PacketLoss*(possible/delayed); Remote: PacketLoss*(possible/delayed)
149	NETIR_BER_EVENT	——000000———-1000100110	——000000————-0010011	0x00000000	IGNORE				INVESTIGATE_LINK_SI_AND_CABLES	Non-fatal	HW	Local: none; Remote: none
149	NETIR_BER_EVENT	——000000———-1000100110	——000000————-0010011	0x00000001	IGNORE				INVESTIGATE_LINK_SI_AND_CABLES	Non-fatal	HW	Local: none; Remote: none
149	NETIR_BER_EVENT	——000000———-1000100110	——000000————-0010011	0x00000002	IGNORE				INVESTIGATE_LINK_SI_AND_CABLES	Non-fatal	HW	Local: none; Remote: none
149	NETIR_BER_EVENT	——000000———-1000100110	——000000————-0010011	0x00000003	IGNORE				INVESTIGATE_LINK_SI_AND_CABLES	Non-fatal	HW	Local: none; Remote: none
149	NETIR_MFDE_EVENT	——000000———-1001000110	——000000————-0010100	0x00000001	RESET_GPU				CONTACT_SUPPORT	Fatal**	HW/SW	Local: fatal; Remote: PacketLoss***(possible/delayed)
149	NETIR_MFDE_EVENT	——000000———-1001000110	——000000————-0010100	0x00000003	IGNORE				IGNORE	Non-fatal	NA	Local: none; Remote: none
149	NETIR_MFDE_EVENT	——000000———-1001000110	——000000————-0010100	0x00000004	RESET_GPU				CONTACT_SUPPORT	Fatal**	HW/SW	Local: fatal; Remote: PacketLoss***(possible/delayed)
149	NETIR_MFDE_EVENT	——000000———-1001000110	——000000————-0010100	0x00000005	RESET_GPU				CONTACT_SUPPORT	Fatal**	HW/SW	Local: fatal; Remote: PacketLoss***(possible/delayed)
149	NETIR_MFDE_EVENT	——000000———-1001000110	——000000————-0010100	0x00000007	RESET_GPU				CONTACT_SUPPORT	Fatal**	HW/SW	Local: fatal; Remote: PacketLoss***(possible/delayed)
150	MSE Degraded	——000000———-0000000000	——000000————-0000000	0x00000000/0xFFFFFFFF	RESET_GPU				CONTACT_SUPPORT	Fatal	FW	Local: Fatal; Remote: None
150	MSE_WATCHDOG	——000000———-0000000000	——000000————-0000000	0x00000000	RESET_GPU				CONTACT_SUPPORT	Fatal	FW	Local: Fatal; Remote: None

Resolution Buckets

Table 3 Resolution Buckets#
Guidance Class	Resolution Action
CONTACT_SUPPORT	Please contact your support organization for further investigation.
RESTART_APP	The application should be restarted RESET_GPU or RESTART_BM is not deemed necessary.
IGNORE	No Action required
WORKFLOW_XID_45	Solo: RESTART_FM Not Solo: IGNORE (follow guidance in other Xid)
RESET_GPU	Refer to https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html for GPU Reset capabilities & limitations RESTART_BM is not deemed necessary.
WORKFLOW_XID_48	Data Center Recovery Action Solo: RESET_GPU w/ 63 or 64: DRAIN_AND_RESET Investagatory Action Solo: RUN_FIELDDIAG Not Solo: String in error would tell us what unit was impacted. FB: follow Xid63/64 guidance All other SRAM: check SRAM Error Threshold flag (nvidia-smi <sram_threshold_exceeded> or NSM Msg Type 0x3, Cmd Code 0x7D, bit 0. If set RUN_FIELDDIAG
CHECK_MECHANICALS	Check to ensure that device seating and all applicable connections to it are secure.
WORKFLOW_NVLINK_ERR	Extract the hex strings from the Xid error message. Note that there should be seven fields in the Xid. Unused fields would expect to be 0x0 rather than a full DWORD of 0’s. The first, third, fourth and fifth registers are valid for Hopper-based products. Evaluate the populate(d) registers. If bits other than those specifically outlined below are seen, please report a bug. First register: Bit 0, 23, 30: Can be safely ignored. Bits 1, 20: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bits 4 or 5: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 8, 9, 12, 16, 17, 24, 28: Could possibly be a HW issue: Check link mechanical connections and re-seat if a field resolution is required. Run diags if issue persists. If the issue persist, and diagnostics has passed please report a bug. Bits 21 or 22: Marginal channel SI issue. If other errors accompany this Xid, follow the resolution for those first. Otherwise, check link mechanical connections. Run Field Diags and report a bug. Bits 27, 29: If seen repeatedly, please report a bug. Third register: Bits 0, 1, 2, 6: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bit 13: Not expected to be seen in production. If seen, please report a bug. Bits 16, 19: If seen repeatedly, please run Field Diags and report a bug Bits 17, 18: If seen repeatedly, please report a bug. Fourth register: Bits 16, 17: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Bit 18: These are generally sympathetic or secondary errors, though a reset of the fabric is required. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug. Fifth register: Bits 18, 19, 21, 22, 24, 25, 27, 28: Likely HW issue with ECC/Parity –> If seen more than 2 times on the same link, report a bug. Bits 20, 23, 26, 29: These errors represent a threshold of ECC errors being exceeded. There was no uncorrectable error at this time. Continue operation. If desired, Field Diags can be run to check for link integrity.
UPDATE_SWFW	Update Firmware and Software to latest versions XID 78: basic issues will keep vGPU functionality from being able to operate; must resolve to progress 1. Guest driver version is incompatible with the host driver * In this case error string should be “Guest driver is incompatible with host driver” 2. This vGPU type is not compatible with the guest OS type/GPU type. For example, user is trying to use a compute profile on an old Maxwell GPU on windows guest. * In this case error string should be “vGPU type is not supported”
RESTART_BM	Restart bare metal, system should be restarted
WORKFLOW_NVLINK5_ERR	Please see the “XID 144-150 Decode” (was “Customer Doc 144-150”)tab for further guidance in evaluating these Xids. These errors need decoding of XID message as follows to determine the resolution action: Format of the XID error message: Xid (PCI:0000:BB:DF): <Xid Number> <sub component> <fatal vs nonfatal> <Crosscontain> <injected> <link> (<intrInfo> <errorStatus> <errorDebugData[0]> <errorDebugData[1]> <errorDebugData[2]> <errorDebugData[3]> <errorDebugData[4]>) From the above message, <intrInfo>, <errorStatus> must be decoded and evaluated using “XID 144-150 Decode” to derive the final resolution.
RESTART_VM	VM owning the affected GPU must be restarted RESET_GPU or RESTART_BM is not deemed necessary.
XID_154	Follow XID 154 reported guidance
CHECK_UVM	If UVM/vGPU is being utilized, RESET_GPU; otherwise IGNORE
CHECK_APP/CUDA	Issue likely caused by an application passing bad data or utilizing incorrect methods in communications with GPU. Some errors will contain PID that can be used to identify source of the problem. If determined to be a driver issue then REPORT_ISSUE
WORKFLOW_XID_13	Repeat TPC and GPC, diff SMs: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat TPC and GPC, single SM: RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Solo, no burst: CHECK_APP/CUDA Not Repeat TPC and GPC: CHECK_APP/CUDA Non-prod environment: CHECK_APP/CUDA If known good APP and Solo: REPORT_ISSUE
WORKFLOW_XID_31	Multiple runs needed to establish pattern Repeat MMU faults to same GPU (via PCI-ID): RUN_DCGMEUD (possible HW issue); if pass RUN_FIELDDIAGS Repeat MMU faults to diff GPU (via PCI-ID): CHECK_APP/CUDA Solo, no burst: CHECK_APP/CUDA If known good APP: REPORT_ISSUE
Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)	Solo: RESTART_FM Not Solo: IGNORE (follow other Xid)
INVESTIGATE_SW	There is a problem with either user or NVIDIA software that needs to be investigated further. In many cases the user software may be making calls to illegal areas, poorly structured commands or other issues. This may also be a problem with NVIDIA software in which case an issue should be reported. In many cases there may be a PID that could be tracked back to the offending, originating entity.
IGNORE (sympathetic)	This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE.
XID_137_FLOW	This event is logged when a fault is reported by the remote MMU, such as when an illegal NVLink peer-to-peer access is made by an applicable unit on the chip. Typically these are application-level bugs. When this event is logged, NVIDIA recommends the following: • Run the application in cuda-gdb or cuda-memcheck • Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. or • Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb File a bug if the previous two come back inconclusive to eliminate other possible causes.
INVESTIGATE_LINK_SI	Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods.
N/A Informational only regarding another Xid	Other issues need to be addressed. This Xid in informational only and is always expected to be seen with another Xid requiring a recovery action.
INVESTIGATE_SW_USER	Investigate SW or user initiated if unexpected
SYMPATHETIC_REPORT_SOLO	This is a sympathetic error that is expected to be seen with other conditions. Resolution for the other errors should be undertaken first. If this error was seen independently or all other resolutions aren’t suffficient then REPORT_ISSUE.
XID_154_EVAL	If an XID 154 is seen along with this error, take that action. If no XID 154 present, RESTART_APP
CHECK_NVLINK_FAILURE_FLOW	Check telemetry to see if the any link went down in the partition within that last 30 seconds. If so, this Xid can be ignored as it was likely a by-product of the other conditions which should be investigated. If no link down indication is present, REPORT_ISSUE.
REPORT_ISSUE (if seen >1 per day)	REPORT_ISSUE (if seen >1 per day)
INVESTIGATE_SW_USER_LINK_SI	Investigate software or user intervention if not expected; additionally, follow INVESTGATE_LINK_SI if needed
INVESTIGATE_PEER_DEVICE	A peer device experience the issue as decoded in the Xid144-150 table (Xid 149 in particular). Based upon the note column: Received_TS1: will be seen when peer link down reason is unknown Peer_side_down_to_sleep_state: investigate peer software and users if unexpected Peer_side_down_to_disable_state: investigate peer software and users if unexpected Peer_side_down_to_disable_and_port_lock: investigate peer software and users if unexpected Peer_side_down_due_to_thermal_event: check switch cooling Peer_side_down_due_to_force_event: investigate peer software and users if unexpected Peer_side_down_due_to_reset_event: investigate peer software and users if unexpected
INVESTIGATE_HOST	Check other logs as this is likely a secondary indicator of some action or fault (may be OOB).
INVESTIGATE_LINK_SI_AND_CABLES	A more general fault that could be cable, temperature, transceiver or seating condition. Refer to GB200 Resiliency Service Flow for appropriate Access or Trunk link telemetry and investigation methods.
INVESTIGATE_SW_USER	Investigate SW or user initiated if unexpected

Customer User Guide

Table 4 Customer User Guide#
Sheet Name	Column Name	Description
XIDs	Type XID	Identifies Xid entries
	Code	The Xid number
	Mnemonic	String to identify the condition.
	Description	More descriptive identifier for the condition (“Unused” could mean Code is deprecated or V100 or earlier)
	Applies to <project>	Signifies if the Code is supported on this particular product.
	Resolution Bucket (Immediate Action)	Intended to reflect the action that is immediately needed in order to recover the system and get it back into service.
	Resolution Bucket (Investigatory Action)	Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps.
	Xid 154 linkage	Represents if the Code is also expected to trigger an Xid 154 condition representing the derived Data Center resolution.
	Trigger Conditions	Description of when this condition may be seen or more details on possible actions to undertake.
XID 144-150 Decode	Xid	Xid number associated with the particular row. Each Xid represents a function of NVLink operation.
	Subcode	The subsystem of the NVLink function. This is also presented in plain text in the Xid message (ex: NETIR_LINK_EVT) If the text string differs between revisions, then the two entries will be divided by a “/” (V1(<R575)/V2(>=R575)). This field is encoded in the following IntrInfo bits: V1(<R575): IntrInfo[9:5] V2(>=R575):IntrInfo[6:0].
	(V1(<R575)) IntrInfo decode for Data Center Recovery Action	Bitmask of IntrInfo for V1 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode.
	(V2(>=R575)) IntrInfo decode for Data Center Recovery Action	Bitmask of IntrInfo for V2 messages. IntrInfo is the first register presented in the parentheses. Requires conversion of hexadecimal value to binary and applying the mask below. “-” bits are for optional user decode.
	Error Status (hex)	Error Status value represented by the second register presented in the parentheses.
	Resolution Bucket (Immediate Action)	Intended to reflect the action that is immediately needed in order to recover the system and get it back into service.
	(V1(<R575)) Decode for action 2	If needed, this will be the V1 IntrInfo decode required to undertake Action 2.
	(V2(>=R575)) Decode for action 2	If needed, this will be the V2 IntrInfo decode required to undertake Action 2.
	Action 2	Similar to Resolution Bucket (Immediate Action) above for the Decode for action 2 encoding
	Resolution Bucket (Investigatory Action)	Intended to reflect the action that is needed to investigate the fault further to try and avoid the condition occurring again. This may require FieldDiags (to check for HW issues), investigation of SI, software investigation or other steps.
	Severity	Severity of the condition; Can be Link Fatal, Fatal (GPU) or non-fatal. -GPU fatal will cause all links to go down and all app channels to be RC’ed . May cause Packet Loss conditions. -Link fatal put the GPU in a “drain and reset recommended state” until jobs are drained. After job drain GPU is put to “reset required “ state so no new jobs can be launched. NOTE: * is for promoteable errors that could be non-fatal or fatal and “Action 2” would apply. ** while these are generally expected to be fatal, severity will be present and there are possible paths where this may not occur.
	HW/SW	Is the condition generally HW, SW, or FW related. Some conditions can not be uniquely classified.
	Local/Remote	What are the impacts of the condition on the local GPU as well as remote GPU(s) that are interconnected. NOTE: -Applies to Xid 144-148, 150. -Xid 149 will all be impacts to a local device (even if caused by a peer_side_down_* condition) -XC represents Cross Contain” * Packet Loss may present as a Xid 145 RLW_SRC_TRACK; V1 IntrInfo: ——000000———-0011100010 ; V2 IntrInfo: ——000000————-0000111; ErrStatus 0x00000004 or 0x00000008 ** Xid94 represents consumption of poisoned memory; Xid 48 represents ECC/DBE errors
Guidance Classes	Guidance Class	A resolution bucket assigned to a particular type of action.
	Resolution Action	Steps to be taken to resolve the error that occurred.

This catalog provides a detailed reference on each possible Xid, and provides information on the cause of the Xid, and actions to take. You can also download the reference guide as a spreadsheet here.

The catalog is presented as a spreadsheet, with several sheets of information.

For a given Xid, use the following procedure to walk through the correct actions to take in handling the Xid.

Step 1: Determine Xid Code#

Determine the Xid Code from the Xid Message.

Each Xid message contains a single code, following a colon after the GPU identifier. In the following examples, the Xid Codes are 14, 22013 and 79 respectively.

[...] NVRM: Xid (0000:03:00): 14, Channel 00000001
[...] NVRM: Xid (PCI:0000:5a:00): 79, GPU has fallen off the bus.

Step 2: Review Xid Classification from the Xid Catalog#

In the Xid Catalog, open the “Xids” sheet and find the row with a matching “Code” to the Xid Code from Step 1.

For example, for Xid 79:

For each row, the catalog provides a brief description of the Xid in the “Description” column, as well as applicability to different revisions of GPU in the “Applies to” columns.

Note that some Xid codes are deprecated on more recent GPU models. These Xids are listed as “Unused” for the description, indicating they may be deprecated and applicable to V100 or earlier GPUs.

Step 3: Determine Data Center and Investigatory Actions#

The Xid Catalog provides two different actions for handling an Xid.

Immediate Action:

The “Resolution Bucket - Immediate Action” column in the Xid Catalog provides an immediate action that should be performed to recover a system after an Xid is observed. This is intended as an automatable action that administrators can perform to recover the system from Xid error, and ready the system for new applications.

This action can be performed as automated recovery after an Xid.

Investigatory Action:

The data center action is intended to recover the system, but in some cases, where there is a persistent failure, the Xid will reoccur, requiring a more detailed investigation to the cause of the issue, and will help to further identify if there are underlying hardware, firmware or software failures that need longer term actions to correct the issue.

If the issue reoccurs, or is not expected, the Investigatory Action column provides guidance on actions to take to investigate the issue.

Step 4: Determine Resolution Steps#

Both the Data Center Action and Investigatory Action columns provide a short summary Resolution Bucket that summarizes common actions to take that may be shared by different types of Xid codes. The actual steps to take for these actions are defined on the Fault Resolution Buckets worksheet.

For example, if data center action indicates a fault resolution bucket is “RESET_GPU,”” the row for RESET_GPU in the Fault Resolution Bucket spreadsheet provides guidance on exact actions to take.

Similarly, the resolution steps for investigatory actions are presented on the same worksheet.