Introduction
This document explains what Xid messages are, and is intended to assist system administrators, developers, and FAEs in understanding the meaning behind these messages as an aid in analyzing and resolving GPU-related problems.
What Is an Xid Message
The Xid message is an error report from the NVIDIA driver that is printed to the operating system's kernel log or event log. Xid messages indicate that a general GPU error occurred, most often due to the driver programming the GPU incorrectly or to corruption of the commands sent to the GPU. The messages can be indicative of a hardware problem, an NVIDIA software problem, or a user application problem.
These messages provide diagnostic information that can be used by both users and NVIDIA to aid in debugging reported problems.
The meaning of each message is consistent across driver versions.
How to Use Xid Messages
Xid messages are intended to be used as debugging guides. Because many problems can have multiple possible root causes it’s not always feasible to understand each issue from the Xid value alone.
For example, an Xid error might indicate that a user program tried to access invalid memory. But, in theory, memory corruption due to PCIE or frame buffer (“FB”) problems could corrupt any command and thus cause almost any error. Generally, the Xid classifications listed below should be used as a starting point for further investigation of each problem.
Working with Xid Errors
Viewing Xid Error Messages
Under Linux, the Xid error messages are placed in the location /var/log/messages.
Grep for “NVRM: Xid”to find all the Xid messages.
The following is an example of a Xid string:
[…] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7
[…] NVRM: Xid (0000:03:00): 14, Channel 00000001
- The first Xid in the log file is preceded by a line that contains the GPU GUID and device IDs.
In the above example,
The GUID is a globally unique, immutable identifier for each GPU.
- Each subsequent Xid line contains the device ID, the Xid error, and information about the Xid.
In the above example,
Tools That Provide Additional Information About Xid Errors
NVIDIA provides two additional tools that may be helpful when dealing with Xid errors.
nvidia-smi is a command-line program that installs with the NVIDIA driver. It reports basic monitoring and configuration data about each GPU in the system. nvidia-smi can list ECC error counts (Xid 48) and indicate if a power cable is unplugged (Xid 54), among other things. Please see the nvidia-smi man page for more info. Run ‘nvidia-smi –q’ for basic output.
NVIDIA Validation Suite (NVVS) is a health checking tool that is provided as part of the GPU Deployment Kit, located at https://developer.nvidia.com/gpu-deployment-kit. NVVS can check for basic GPU health, including the presence of ECC errors, PCIe problems, bandwidth issues, and general problems with running CUDA programs. NVVS documentation is included in the GPU Deployment Kit.
Analyzing Xid Errors
The following table lists the recommended actions to take for various issues encountered.
Issue | Recommended Action |
Suspected User Programming Issues |
Run the debugger tools. See the cuda-memcheck and cuda-gdb docs at http://docs.nvidia.com/cuda/index.html |
Suspected Hardware Problems |
Contact the hardware vendor. They can run through their hardware diagnostic process. |
Suspected Driver Problems |
File a bug with NVIDIA. |
Xid Error Listing
The following table lists the Xid errors along with the potential causes for each.
XID | Failure | Causes | ||||||
---|---|---|---|---|---|---|---|---|
HW Error | Driver Error | User App Error | System Memory Corruption | Bus Error | Thermal Issue | FB Corruption | ||
1 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
2 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
3 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
4 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
GPU semaphore timeout |
X |
X |
X |
X |
X |
|||
5 |
Unused |
|||||||
6 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
7 |
Invalid or corrupted push buffer address |
X |
X |
X |
||||
8 |
GPU stopped processing |
X |
X |
X |
X |
|||
9 |
Driver error programming GPU |
X |
||||||
10 |
Unused |
|||||||
11 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
|||
12 |
Driver error handling GPU exception |
X |
||||||
13 |
Graphics Engine Exception |
X |
X |
X |
X |
X |
X |
|
14 |
Unused |
|||||||
15 |
Unused |
|||||||
16 |
Display engine hung |
X |
||||||
17 |
Unused |
|||||||
18 |
Bus mastering disabled in PCI Config Space |
X |
||||||
19 |
Display Engine error |
X |
||||||
20 |
Invalid or corrupted Mpeg push buffer |
X |
X |
X |
X |
|||
21 |
Invalid or corrupted Motion Estimation push buffer |
X |
X |
X |
X |
|||
22 |
Invalid or corrupted Video Processor push buffer |
X |
X |
X |
X |
|||
23 |
Unused |
|||||||
24 |
GPU semaphore timeout |
X |
X |
X |
X |
X |
X |
|
25 |
Invalid or illegal push buffer stream |
X |
X |
X |
X |
X |
||
26 |
Framebuffer timeout |
X |
||||||
27 |
Video processor exception |
X |
||||||
28 |
Video processor exception |
X |
||||||
29 |
Video processor exception |
X |
||||||
30 |
GPU semaphore access error |
X |
||||||
31 |
GPU memory page fault |
X |
X |
|||||
32 |
Invalid or corrupted push buffer stream |
X |
X |
X |
X |
X |
||
33 |
Internal micro-controller error |
X |
||||||
34 |
Video processor exception |
X |
||||||
35 |
Video processor exception |
X |
||||||
36 |
Video processor exception |
X |
||||||
37 |
Driver firmware error |
X |
X |
X |
||||
38 |
Driver firmware error |
X |
||||||
39 |
Unused |
|||||||
40 |
Unused |
|||||||
41 |
Unused |
|||||||
42 |
Video processor exception |
X |
||||||
43 |
GPU stopped processing |
X |
X |
|||||
44 |
Graphics Engine fault during context switch |
X |
||||||
45 |
Preemptive cleanup, due to previous errors -- Most likely to see when running multiple cuda applications and hitting a DBE |
X |
||||||
46 |
GPU stopped processing |
X |
||||||
47 |
Video processor exception |
X |
||||||
48 |
Double Bit ECC Error |
X |
||||||
49 |
Unused |
|||||||
50 |
Unused |
|||||||
51 |
Unused |
|||||||
52 |
Unused |
|||||||
53 |
Unused |
|||||||
54 |
Auxiliary power is not connected to the GPU board |
|||||||
55 |
Unused |
|||||||
56 |
Display Engine error |
X |
X |
|||||
57 |
Error programming video memory interface |
X |
X |
X |
||||
58 |
Unstable video memory interface detected |
X |
X |
|||||
EDC error – clarified in printout |
X |
|||||||
59 |
Internal micro-controller error (older drivers) |
X |
||||||
60 |
Video processor exception |
X |
||||||
61 |
Internal micro-controller breakpoint/warning (newer drivers) |
|||||||
62 |
Internal micro-controller halt (newer drivers) |
X |
X |
X |
||||
63 |
ECC page retirement recording event |
X |
X |
X |
||||
64 |
ECC page retirement recording failure |
X |
X |
|||||
65 |
Video processor exception |
X |
X |
|||||
66 |
Illegal access by driver |
X |
X |
|||||
67 |
Illegal access by driver |
X |
X |
|||||
68 |
Video processor exception |
X |
X |
|||||
69 |
Graphics Engine class error |
X |
X |
|||||
70 |
CE3: Unknown Error |
X |
X |
|||||
71 |
CE4: Unknown Error |
X |
X |
|||||
72 |
CE5: Unknown Error |
X |
X |
|||||
73 |
NVENC2 Error |
X |
X |
|||||
74 |
NVLINK Error |
X |
X |
X |
||||
75 |
Reserved |
|||||||
76 |
Reserved |
|||||||
77 |
Reserved |
|||||||
78 |
vGPU Start Error |
X |
||||||
79 |
GPU has fallen off the bus |
X |
X |
X |
X |
X |
||
80 |
Corrupted data sent to GPU |
X |
X |
X |
X |
X |
||
81 |
VGA Subsystem Error |
X |
||||||
82 |
Reserved |
|||||||
83 |
Reserved |
|||||||
84 |
Reserved |
|||||||
85 |
Reserved |
|||||||
86 |
Reserved |
|||||||
87 |
Reserved |
|||||||
88 |
Reserved |
|||||||
89 |
Reserved |
|||||||
90 |
Reserved |
|||||||
91 |
Reserved |
|||||||
92 |
High single-bit ECC error rate |
X |
X |
Common XID Errors
This section provides more information on some common Xid errors.
XID 13: GR: SW Notify Error
This event is logged for general user application faults. Typically this is an out-of-bounds error where the user has walked past the end of an array, but could also be an illegal instruction, illegal register, or other case.
In rare cases, it’s possible for a hardware failure or system software bugs to materialize as XID 13.
When this event is logged, NVIDIA recommends the following:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. |
XID 31: Fifo: MMU Error
This event is logged when a fault is reported by the MMU, such as when an illegal address access is made by an applicable unit on the chip Typically these are application-level bugs, but can also be driver bugs or hardware bugs.
When this event is logged, NVIDIA recommends the following:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
Note: The cuda-memcheck tool instruments the running application and reports which line of code performed the illegal read. |
XID 32: PBDMA Error
This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions.
XID 43: RESET CHANNEL VERIF ERROR
This event is logged when a user application hits a software induced fault and must terminate. The GPU remains in a healthy state.
In most cases, this is not indicative of a driver bug but rather a user application error.
XID 45: OS: Preemptive Channel Removal
This event is logged when the user application aborts and the kernel driver tears down the GPU application running on the GPU. Control-C, GPU resets, sigkill are all examples where the application is aborted and this event is created.
In many cases, this is not indicative of a bug but rather a user or system action.
XID 48: DBE (Double Bit Error) ECC Error
This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error.
The tool nvidia-smi can provide a summary of ECC errors. See “Tools That Provide Additional Information About Xid Errors”.
Notices
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.