vGPU Configuration Issues#

vGPU VM Startup Issue on vSphere – “Out of Resources” Error#

When attempting to start a GPU-enabled VM on vSphere, the VM fails to start and only displays a generic “out of resources” error. This issue can occur due to various configuration or resource allocation problems.

Next Steps

  1. Review available GPU resources to ensure sufficient capacity.

  2. Verify that the hypervisor properly assigns and recognizes the GPU.

  3. Follow this checklist for a step-by-step resolution.

vGPU VM Startup Issue on KVM Hypervisor#

When attempting to start a sixth vGPU VM on a KVM hypervisor with an SR-IOV capable GPU, the VM fails to start and hangs. This occurs because PCIe Alternate Routing ID (ARI) is disabled in the System BIOS, causing virtual devices beyond the fifth to be marked as “rev ff” instead of “rev a1.”

Next Steps

  1. Access the System BIOS.

  2. Enable PCIe Alternate Routing ID (ARI).

For more detailed instructions and additional information, visit the full article here.

Sharing a Single GPU Across Multiple VMs#

Users looking to assign a single physical GPU to multiple VMs may face limitations with GPU passthrough, as it does not support one-to-many relationships. NVIDIA vGPU is typically the solution for sharing an entire GPU across VMs. If users require hardware partitioning and strict resource isolation in a multi-tenant environment, they should use vGPU within MIG partitions, allowing multiple VMs to share a single GPU efficiently while maintaining resource isolation. MIG-backed vGPUs are not yet available, and no GPUs actively supported with the vGPU Graphics line of products are available.

Next Steps

  1. Use vGPU profiles to share the entire GPU across VMs.

  2. If multi-tenancy is not required, consider enabling MIG mode to partition the GPU into isolated slices. For details, refer to the MIG documentation.

Mixing different GPUs in a Single Node#

Combining different GPUs in the same node, such as Ampere and Ada-based GPUs, is unsupported due to their Resource Manager (RM) software/hardware differences. While NVIDIA’s mixed-size vGPU mode allows different vGPU profiles on the same GPU, it does not enable mixing entirely different GPU architectures within a single node.

Next Steps

  1. Use a single GPU architecture per host to ensure compatibility with the vGPU manager.

  2. To test multiple architectures, separate them across different nodes.

vGPU VM Fails to Start on AMD EPYC Platform When AER Is Disabled in BIOS#

On servers using AMD EPYC processors, a VM configured with a vGPU profile that supports SR-IOV may fail to start on an Ubuntu KVM hypervisor. The error is caused by the ‘Enable AER Cap’ setting being disabled in the server BIOS. The following error may appear in the logs:

Error log: error: Failed to start domain instance-00001876
error: internal error: qemu unexpectedly closed the monitor:
2023-05-08T08:56:23.279918Z qemu-system-x86_64: -device
vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/6bd25631-d901-43d1-be80-c1e3283bb32b,display=off,bus=pci.0,addr=0x7: vfio 6bd25631-d901-43d1-be80-c1e3283bb32b: failed to setup container for group 116: Failed to set iommu for container: Invalid argument

Next Steps

  1. Access the server BIOS settings.

  2. Enable the following settings:

    • Enable AER Cap (set to Auto or Enabled)

    • IOMMU

    • PCIe ARI Support

    • PCIe 10 Bit Tag Support

    • Above 4G Decoding

    • SR-IOV Support

  3. Save the BIOS settings and reboot the server.

For more detailed instructions and additional information, visit the full article here.

vGPU VM Fails to Power On with NVIDIA RTX A6000 or NVIDIA A40#

When configuring a VM to use an NVIDIA RTX A6000 or NVIDIA A40 GPU for virtual GPU, the VM may fail to power on if the card is not in the correct display mode, or if the SR-IOV BIOS setting is disabled on the hypervisor host. Two distinct error messages may be displayed:

  • If the card is not in the correct display mode: Error: Internal Error: Can't determine the number of VFs for PCI

  • If the card is in the correct display mode but SR-IOV BIOS is disabled: Error: Failed, Starting VM name Internal error: Subprocess exited with unexpected code 1; stdout = [ Enabling VFs on 0000:86:00.0 ]; stderr = [ /usr/lib/nvidia/sriov-manage: line 204: echo: write error: Cannot allocate memory]

Next Steps

  1. If the card is not in the correct display mode, use the displaymodeselector tool to switch the card from display-enabled mode to displayless mode. The displaymodeselector tool is available from the NVIDIA Display Mode Selector Tool page on the NVIDIA Developer website.

  2. If the card is in the correct display mode but SR-IOV BIOS is disabled, enable the VT-D/IOMMU and SR-IOV BIOS settings on the hypervisor host, then reboot the host.

  3. Ensure that all other prerequisites for using NVIDIA vGPU are met.

For more detailed instructions and additional information, visit the full article here.