Working with Xid Errors#

Viewing Xid Messages#

Under Linux, the Xid error messages are logged in the kernel log buffer, and depending on Linux distribution these are typically logged to journal and flushed to files, such as /var/log/messages or /var/log/syslog.

Grep for “NVRM: Xid” to find all the Xid messages.

The following is an example of a Xid string:

[...] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7
[...] NVRM: Xid (0000:03:00): 14, Channel 00000001
  • The first Xid in the log file is preceded by a line that contains the GPU GUID and device IDs.

In the above example:

  • “GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7” is the GUID, a globally unique, immutable identifier for each GPU.

  • Each subsequent Xid line contains the device ID, the Xid error, and information about the Xid.

    In the above example:

    • “0000:03:00” is the PCI Express bus, device, and function ID.

    • “14” is the Xid error identifier.

    • “Channel 00000001” is data specific to that Xid error.

Tools that Provide Additional Information about Xid Errors#

NVIDIA provides three additional tools that may be helpful when dealing with Xid errors.

  • nvidia-smi is a command-line program that installs with the NVIDIA driver. It reports basic monitoring and configuration data about each GPU in the system. nvidia-smi can list ECC error counts (Xid 48), indicate if a power cable is unplugged (Xid 54), or provide any applicable GPU Recovery Action (Xid 154), among other things. Please see the nvidia-smi main page for more information. Run nvidia-smi -q for basic output.

  • NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts, and governance policies including power and clock management. DCGM diagnostics is a health checking tool that can check for basic GPU health, including the presence of ECC errors, PCIe problems, bandwidth issues, and general problems with running CUDA programs.

    DCGM is documented and downloadable at https://developer.nvidia.com/dcgm

  • nvidia-bug-report.sh is a script that installs with the NVIDIA driver. It collects debug logs and command outputs from the system, including kernel logs and logs collected by the NVIDIA driver itself. The command should be run as root:

    sudo nvidia-bug-report.sh
    

    The output of this tool is a single compressed text file, nvidia-bug-report.log.gz, that can be included when reporting problems to NVIDIA.

    nvidia-bug-report.sh will typically run quickly, but in rare cases may run slowly. Allow up to one hour for it to complete. If the command remains hung, run the command with additional arguments as:

    nvidia-bug-report.sh --safe-mode --extra-system-data
    

    This will collect alternative logs, in such a way that it should avoid common causes of hangs during debug collection.

Analyzing Xid Errors#

The following table lists the recommended actions to take for various issues encountered.

Issue

Recommended Action

Suspected User Programming Issues

Run the debugger tools. Refer to the Compute Sanitizer “memcheck” tool and CUDA-GDB documentation.

Suspected Hardware Problems

Contact the hardware vendor. They can run through their hardware diagnostic process.

Suspected Driver Problems

File a bug with NVIDIA, including output of the command nvidia-bug-report.sh. Refer to the document GPU Debug Guidelines for guidance on gathering additional information to provide to NVIDIA and troubleshooting common Xid causes.