vGPU Deployment#
This guide covers deploying NIM LLM in virtualized GPU environments using NVIDIA vGPU technology. vGPU enables multiple virtual machines to share a single physical GPU, making it a common deployment pattern in enterprise data centers and cloud environments.
Overview#
NVIDIA vGPU software creates virtual GPUs that can be shared across multiple virtual machines (VMs). NIM LLM supports deployment on vGPU-enabled VMs, allowing enterprises to run LLM inference workloads in their existing virtualized infrastructure.
When to use vGPU deployment:
Your organization uses VMware vSphere, Red Hat KVM, Citrix Hypervisor, or other supported hypervisors with NVIDIA vGPU.
You need to share GPU resources across multiple workloads or tenants.
Your IT policy requires VM-based isolation rather than container-only deployment.
You are deploying in an NVIDIA AI Enterprise environment.
Prerequisites#
Hardware#
NVIDIA GPU that supports vGPU (refer to the NVIDIA vGPU supported products)
Sufficient GPU frame buffer allocated to the VM for the target model (refer to Memory and frame buffer on vGPU in Troubleshooting)
Software#
NVIDIA vGPU software installed and licensed on the hypervisor host
NVIDIA vGPU guest driver installed inside the VM
NVIDIA Container Toolkit 1.14+ installed inside the VM
Docker 24.0+ or another OCI container runtime installed inside the VM
Linux guest OS (refer to the NVIDIA vGPU guest OS support matrix)
Verify vGPU Inside the VM#
Confirm that the vGPU is visible inside the VM:
nvidia-smi
Expected output shows a virtual GPU with allocated frame buffer:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.xx.xx Driver Version: 560.xx.xx CUDA Version: 12.x |
|-------------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|===========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB-... On | 00000000:00:10.0 Off | 0 |
| N/A 35C P0 45W / N/A | 0MiB / 40960MiB | 0% Default |
+-------------------------------------------+----------------------+----------------------+
Verify that the NVIDIA Container Toolkit is configured:
docker run --rm --gpus=all nvidia/cuda:12.6.3-base-ubuntu24.04 nvidia-smi
Once the vGPU and container toolkit are verified, NIM can be deployed on the VM following the standard Quickstart instructions.
Troubleshooting#
Memory and frame buffer on vGPU#
On vGPU, the usable GPU frame buffer inside the VM is less than the configured vGPU profile size because the vGPU software reserves a portion for its own use. For exact usable frame buffer per profile, refer to the NVIDIA vGPU documentation.
Choosing a vGPU profile — Select a vGPU profile (for example A100-40C, H100-80C) that provides enough frame buffer for the model you intend to serve. As a guideline:
Model Size |
Approximate GPU Memory Required |
Example vGPU Profiles |
|---|---|---|
7B–8B (FP16/BF16) |
~16 GB |
A100-20C, H100-20C or larger |
7B–8B (FP8) |
~10 GB |
A100-10C, H100-10C or larger |
13B (FP16/BF16) |
~28 GB |
A100-40C, H100-40C or larger |
70B (FP8) |
~70 GB |
Full GPU profile (A100-40C x2 or A100-80C) |
These are rough estimates. Actual memory usage depends on the model, quantization, context length, and KV cache configuration. Use list-model-profiles inside the VM to see which profiles are compatible with the available GPU memory.
GPU memory utilization — In vGPU environments, NIM may automatically lower the default GPU memory utilization (--gpu-memory-utilization) to account for the reduced available frame buffer. If the memory estimator detects a shared-memory environment (vGPU or UMA), it sets the value to the greater of 0.5 or the estimated minimum required fraction.
You can override this using the passthrough environment variable:
docker run --gpus=all \
-e NIM_PASSTHROUGH_ARGS="--gpu-memory-utilization 0.85" \
...
Note
Setting GPU memory utilization too high in a vGPU environment can cause out-of-memory errors because the vGPU software reserves frame buffer that is not visible to the application. Start with a conservative value and increase gradually.
Profile Selection Reports Insufficient Memory#
If list-model-profiles shows all profiles as “Incompatible with system,” the vGPU profile does not have enough frame buffer for any available model profile.
Resolution:
Increase the vGPU profile size (allocate more frame buffer to the VM) at the hypervisor level.
Use a smaller model or a more aggressively quantized profile (FP8, MXFP4).
Decrease the maximum model length to lower the KV cache memory requirement.
vGPU Not Detected#
If nvidia-smi does not show a GPU inside the VM:
Verify that the NVIDIA vGPU software is installed and licensed on the hypervisor host.
Verify that a vGPU profile is assigned to the VM in the hypervisor management console.
Verify that the NVIDIA vGPU guest driver is installed inside the VM.
Check the hypervisor logs for vGPU allocation errors.
Refer to the NVIDIA vGPU troubleshooting guide for detailed diagnostics.