Debugging Issues#

Script to Capture NVIDIA Bug Report After XID 119 and 120 on Host#

Users may encounter issues with XID 119 and XID 120 on the host. To troubleshoot efficiently, debug logs should be captured using the provided script after a reboot and before any workload starts. The script is designed to collect detailed logs when these XIDs occur.

  • XID 119: Indicates GSP entering a hang state.

  • XID 120: Indicates a GSP crash.

The script monitors XIDs (e.g., 119, 120) and captures additional debug logs for effective troubleshooting.

Next Steps

  1. Driver Setup: Install vGPU Driver version 16.6, 17.2, NV-AIE 5.0, or 4.2 on GPUs with Hopper or Ada architectures.

  2. Run Script:

    1. Execute the script after rebooting and before starting workloads.

    2. Specify XIDs to monitor (e.g., 119, 120):

      • Linux: ./nvidia-xid-monitor-linux.sh 119,120

      • ESXi: ./nvidia-xid-monitor-vmware.sh 119,120

    3. The script generates a bug report and saves it with a timestamp.

  3. Wait for Logs:

    1. After XID 119 or 120 occurs, avoid rebooting, shutting down, or migrating VMs until the script completes log generation.

For more detailed instructions and additional information, visit the full article here.

Generate a Log File for Support#

When troubleshooting vGPU-related issues, providing the correct logs and diagnostic information is essential for faster resolution.

Next Steps

  1. Generate an NVIDIA bug report from the host and attach the generated log file when reaching out for support.

  2. Collect system information from a Windows VM, save the report as a .nfo file and attach it when requesting support.

  3. Use NVIDIA SMI commands for debugging:

    1. Monitor GPU usage by application: nvidia-smi vgpu -p

    2. Capture frame buffer session: nvidia-smi vgpu -fs

    3. Check encoder session usage (should be minimal for vGPU workloads): nvidia-smi vgpu -es

  4. Attach the nvidia-bug-report from the host and the msinfo32 report from the Windows VM when reporting an issue. This ensures a faster and more accurate diagnosis.

Capturing Debug Logs for the nvidia-topologyd Service#

The nvidia-topologyd service runs on the vGPU host and generates a virtual PCIe topology XML file at /var/run/nvidia-topologyd/virtualTopology.xml. This file contains information about which vGPU is attached to which physical NUMA node on the server, and is used by applications running inside a VM to be aware of PCIe device placement.

The service generates the XML file at system startup and then goes into an inactive (dead) state. This is normal behavior — the service does not run continuously because the physical GPU topology does not change after startup.

When NVIDIA Support requests debug logs from this service, use the following procedure:

Next Steps

  1. Verify the current state of the service:

    /usr/bin/systemctl status nvidia-topologyd.service
    

    An inactive (dead) status with a status=0/SUCCESS exit code is expected and indicates normal operation.

  2. To generate debug output, set the logging level to 2 in the configuration file /etc/nvidia/nvidia-topologyd.conf.

  3. Restart the VM or restart the nvidia-topologyd service.

  4. Capture the debug output from /var/log/messages and provide it to NVIDIA Support along with the standard nvidia-bug-report from the host.

Note

For GPU Operator deployments, the nvidia-topologyd service is started as a daemon by the driver container itself, not through systemd. In this case, the nvidia-topologyd.conf configuration must be provided as a ConfigMap using the driver.virtualTopology.config parameter.

For more detailed instructions and additional information, visit the full article here.

Ada Lovelace GPU Server Crashes with XID Errors 119 and 120 When Multiple VMs Are Booted or Shut Down Simultaneously#

When multiple VMs on a host with an Ada Lovelace GPU are simultaneously booted or shut down, XID errors 119 and 120 appear in the log files on the hypervisor host, potentially followed by a server crash. The root cause is a memory handling error condition in the NVIDIA Virtual GPU Manager. This issue was resolved in NVIDIA vGPU software 16.6 and 17.2.

Next Steps

  1. Upgrade the NVIDIA Virtual GPU Manager to the latest release in the branch that you are using. For currently supported vGPU software releases:

    • vGPU branch 16: upgrade to 16.6 or later

    • vGPU branches 19 and 20: not affected, because the fix was already incorporated before those releases