vGPU Deployment#

NVIDIA vGPU lets multiple virtual machines (VMs) share a single physical GPU, which makes it a common deployment pattern in enterprise data centers and cloud environments. NIM LLM supports deployment on vGPU-enabled VMs, so you can run LLM inference workloads in existing virtualized infrastructure.

vGPU deployment is a good fit in the following situations:

  • Your organization uses VMware vSphere, Red Hat KVM, Citrix Hypervisor, or other supported hypervisors with NVIDIA vGPU.

  • You need to share GPU resources across multiple workloads or tenants.

  • Your IT policy requires VM-based isolation rather than container-only deployment.

  • You are deploying in an NVIDIA AI Enterprise environment.

Prerequisites#

Before deploying NIM LLM on vGPU, ensure the following hardware and software prerequisites are met.

Hardware#

Prepare the following hardware:

Software#

Install or configure the following software:

Verify vGPU Inside the VM#

Confirm that the vGPU is visible inside the VM:

nvidia-smi

Expected output shows a virtual GPU with allocated frame buffer:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.xx.xx    Driver Version: 560.xx.xx    CUDA Version: 12.x                 |
|-------------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M   | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap   |         Memory-Usage | GPU-Util  Compute M. |
|===========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB-...      On    | 00000000:00:10.0 Off |                    0 |
| N/A   35C    P0              45W /  N/A   |      0MiB / 40960MiB |      0%      Default |
+-------------------------------------------+----------------------+----------------------+

Verify that the NVIDIA Container Toolkit is configured:

docker run --rm --gpus=all nvidia/cuda:12.6.3-base-ubuntu24.04 nvidia-smi

Once the vGPU and container toolkit are verified, NIM can be deployed on the VM following the standard Quickstart instructions.

Troubleshooting#

Memory and frame buffer on vGPU#

On vGPU, the usable GPU frame buffer inside the VM is less than the configured vGPU profile size because the vGPU software reserves a portion for its own use. For exact usable frame buffer per profile, refer to the NVIDIA vGPU documentation.

Choosing a vGPU profile — Select a vGPU profile (for example A100-40C, H100-80C) that provides enough frame buffer for the model you intend to serve. As a guideline:

Model Size

Approximate GPU Memory Required

Example vGPU Profiles

7B–8B (FP16/BF16)

~16 GB

A100-20C, H100-20C or larger

7B–8B (FP8)

~10 GB

A100-10C, H100-10C or larger

13B (FP16/BF16)

~28 GB

A100-40C, H100-40C or larger

70B (FP8)

~70 GB

Full GPU profile (A100-40C x2 or A100-80C)

These are rough estimates. Actual memory usage depends on the model, quantization, context length, and KV cache configuration. Use list-model-profiles inside the VM to see which profiles are compatible with the available GPU memory.

GPU memory utilization — In vGPU environments, NIM may automatically lower the default GPU memory utilization (--gpu-memory-utilization) to account for the reduced available frame buffer. If the memory estimator detects a shared-memory environment (vGPU or UMA), it sets the value to the greater of 0.5 or the estimated minimum required fraction.

You can override this using the passthrough environment variable:

docker run --gpus=all \
  -e NIM_PASSTHROUGH_ARGS="--gpu-memory-utilization 0.85" \
  ...

Note

Setting GPU memory utilization too high in a vGPU environment can cause out-of-memory errors because the vGPU software reserves frame buffer that is not visible to the application. Start with a conservative value and increase gradually.

Profile Selection Reports Insufficient Memory#

If list-model-profiles shows all profiles as “Incompatible with system,” the vGPU profile does not have enough frame buffer for any available model profile.

Use the following resolutions:

  • Increase the vGPU profile size (allocate more frame buffer to the VM) at the hypervisor level.

  • Use a smaller model or a more aggressively quantized profile (FP8, MXFP4).

  • Decrease the maximum model length to lower the KV cache memory requirement.

vGPU Not Detected#

If nvidia-smi does not show a GPU inside the VM:

  • Verify that the NVIDIA vGPU software is installed and licensed on the hypervisor host.

  • Verify that a vGPU profile is assigned to the VM in the hypervisor management console.

  • Verify that the NVIDIA vGPU guest driver is installed inside the VM.

  • Check the hypervisor logs for vGPU allocation errors.

Refer to the NVIDIA vGPU troubleshooting guide for detailed diagnostics.

Additional Resources#

Refer to the following resources for more information: