Is this page helpful?

vGPU Deployment#

NVIDIA vGPU lets multiple virtual machines (VMs) share a single physical GPU, which makes it a common deployment pattern in enterprise data centers and cloud environments. NIM LLM supports deployment on vGPU-enabled VMs, so you can run LLM inference workloads in existing virtualized infrastructure.

vGPU deployment is a good fit in the following situations:

Your organization uses VMware vSphere, Red Hat KVM, Citrix Hypervisor, or other supported hypervisors with NVIDIA vGPU.
You need to share GPU resources across multiple workloads or tenants.
Your IT policy requires VM-based isolation rather than container-only deployment.
You are deploying in an NVIDIA AI Enterprise environment.

Prerequisites#

Before deploying NIM LLM on vGPU, ensure the following hardware and software prerequisites are met.

Hardware#

Prepare the following hardware:

An NVIDIA GPU that supports vGPU is available (refer to the NVIDIA vGPU supported products).
A VM with sufficient GPU frame buffer allocated for the target model (refer to Memory and frame buffer on vGPU in Troubleshooting).

Software#

Install or configure the following software:

NVIDIA vGPU software, installed and licensed on the hypervisor host
NVIDIA vGPU guest driver, installed inside the VM
NVIDIA Container Toolkit 1.14+, installed inside the VM
Docker 24.0+ or another OCI container runtime, installed inside the VM
Linux guest OS (refer to the NVIDIA vGPU guest OS support matrix)

Verify vGPU Inside the VM#

Confirm that the vGPU is visible inside the VM:

nvidia-smi

Expected output shows a virtual GPU with allocated frame buffer:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.xx.xx    Driver Version: 560.xx.xx    CUDA Version: 12.x                 |
|-------------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M   | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap   |         Memory-Usage | GPU-Util  Compute M. |
|===========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB-...      On    | 00000000:00:10.0 Off |                    0 |
| N/A   35C    P0              45W /  N/A   |      0MiB / 40960MiB |      0%      Default |
+-------------------------------------------+----------------------+----------------------+

Verify that the NVIDIA Container Toolkit is configured:

docker run --rm --gpus=all nvidia/cuda:12.6.3-base-ubuntu24.04 nvidia-smi

Once the vGPU and container toolkit are verified, NIM can be deployed on the VM following the standard Quickstart instructions.

Troubleshooting#

Memory and Frame Buffer on vGPU#

On vGPU, the usable GPU frame buffer inside the VM is less than the configured vGPU profile size because the vGPU software reserves a portion for its own use. For exact usable frame buffer per profile, refer to the NVIDIA vGPU documentation.

Choosing a vGPU profile — Select a vGPU profile (for example A100-40C, H100-80C) that provides enough frame buffer for the model you intend to serve. As a guideline:

Model Size	Approximate GPU Memory Required	Example vGPU Profiles
7B–8B (FP16/BF16)	~16 GB	A100-20C, H100-20C or larger
7B–8B (FP8)	~10 GB	A100-10C, H100-10C or larger
13B (FP16/BF16)	~28 GB	A100-40C, H100-40C or larger
70B (FP8)	~70 GB	Full GPU profile (A100-40C x2 or A100-80C)

These are rough estimates. Actual memory usage depends on the model, quantization, context length, and KV cache configuration. Use list-model-profiles inside the VM to see which profiles are compatible with the available GPU memory.

GPU memory utilization — In vGPU environments, NIM may automatically lower the default GPU memory utilization (--gpu-memory-utilization) to account for the reduced available frame buffer. If the memory estimator detects a shared-memory environment (vGPU or UMA), it sets the value to the greater of 0.5 or the estimated minimum required fraction.

You can override this using the passthrough environment variable:

docker run --gpus=all \
  -e NIM_PASSTHROUGH_ARGS="--gpu-memory-utilization 0.85" \
  ...

Note

Setting GPU memory utilization too high in a vGPU environment can cause out-of-memory errors because the vGPU software reserves frame buffer that is not visible to the application. Start with a conservative value and increase gradually.

Profile Selection Reports Insufficient Memory#

If list-model-profiles shows all profiles as “Incompatible with system,” the vGPU profile does not have enough frame buffer for any available model profile.

Use the following resolutions:

Increase the vGPU profile size (allocate more frame buffer to the VM) at the hypervisor level.
Use a smaller model or a more aggressively quantized profile (FP8, MXFP4).
Decrease the maximum model length to lower the KV cache memory requirement.

vGPU Not Detected#

If nvidia-smi does not show a GPU inside the VM:

Verify that the NVIDIA vGPU software is installed and licensed on the hypervisor host.
Verify that a vGPU profile is assigned to the VM in the hypervisor management console.
Verify that the NVIDIA vGPU guest driver is installed inside the VM.
Check the hypervisor logs for vGPU allocation errors.

Refer to the NVIDIA vGPU troubleshooting guide for detailed diagnostics.

Additional Resources#

Refer to the following resources for more information: