vGPU Configuration Issues
When attempting to start a GPU-enabled VM on vSphere, the VM fails to start and only displays a generic “out of resources” error. This issue can occur due to various configuration or resource allocation problems.
Next Steps
Review available GPU resources to ensure sufficient capacity.
Verify that the hypervisor properly assigns and recognizes the GPU.
Follow this checklist for a step-by-step resolution.
When attempting to start a sixth vGPU VM on a KVM hypervisor with an SR-IOV capable GPU, the VM fails to start and hangs. This occurs because PCIe Alternate Routing ID (ARI) is disabled in the System BIOS, causing virtual devices beyond the fifth to be marked as “rev ff” instead of “rev a1.”
Next Steps
Access the System BIOS.
Enable PCIe Alternate Routing ID (ARI).
For more detailed instructions and additional information, visit the full article here.
Users looking to assign a single physical GPU to multiple VMs may face limitations with GPU passthrough, as it does not support one-to-many relationships. NVIDIA vGPU is typically the solution for sharing an entire GPU across VMs. If users require hardware partitioning and strict resource isolation in a multi-tenant environment, they should use vGPU within MIG partitions, allowing multiple VMs to share a single GPU efficiently while maintaining resource isolation. MIG-backed vGPUs are not yet available, and no GPUs actively supported with the vGPU Graphics line of products are available.
Next Steps
Use vGPU profiles to share the entire GPU across VMs.
If multi-tenancy is not required, consider enabling MIG mode to partition the GPU into isolated slices. For details, refer to the MIG documentation.
Combining different GPUs in the same node, such as Ampere and Ada-based GPUs, is unsupported due to their Resource Manager (RM) software/hardware differences. While NVIDIA’s mixed-size vGPU mode allows different vGPU profiles on the same GPU, it does not enable mixing entirely different GPU architectures within a single node.
Next Steps
Use a single GPU architecture per host to ensure compatibility with the vGPU manager.
To test multiple architectures, separate them across different nodes.