GPU Operator with KubeVirt#
Introduction#
KubeVirt is a virtual machine management add-on to Kubernetes that allows you to run and manage virtual machines in a Kubernetes cluster. It eliminates the need to manage separate clusters for virtual machine and container workloads, as both can now coexist in a single Kubernetes cluster.
Up until this point, the GPU Operator only provisioned worker nodes for running GPU-accelerated containers. Now, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines.
The prerequisites needed for running containers and virtual machines with GPU(s) differs, with the primary difference being the drivers required. For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the NVIDIA vGPU Manager is needed for creating vGPU devices.
The GPU Operator can now be configured to deploy different software components on worker nodes depending on what GPU workload is configured to run on those nodes. Consider the following example.
Node A receives the following software components:
NVIDIA Datacenter Driver
- to install the driverNVIDIA Container Toolkit
- to ensure containers can properly access GPUsNVIDIA Kubernetes Device Plugin
- to discover and advertise GPU resources to kubeletNVIDIA DCGM and DCGM Exporter
- to monitor the GPU(s)
Node B receives the following software components:
VFIO Manager
- to load vfio-pci and bind it to all GPUs on the nodeSandbox Device Plugin
- to discover and advertise the passthrough GPUs to kubelet
Node C receives the following software components:
NVIDIA vGPU Manager
- to install the driverNVIDIA vGPU Device Manager
- to create vGPU devices on the nodeSandbox Device Plugin
- to discover and advertise the vGPU devices to kubelet
Assumptions, constraints, and dependencies#
A GPU worker node can run GPU workloads of a particular type - containers, virtual machines with GPU Passthrough, or virtual machines with vGPU - but not a combination of any of them.
The cluster admin or developer has knowledge about their cluster ahead of time, and can properly label nodes to indicate what types of GPU workloads they will run.
Worker nodes running GPU accelerated virtual machines (with pGPU or vGPU) are assumed to be bare metal.
The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached.
Users must manually add all passthrough GPU and vGPU resources to the
permittedDevices
list in the KubeVirt CR before assigning them to KubeVirt virtual machines. See the KubeVirt documentation for more information.MIG-backed vGPUs are not supported.
Prerequisites#
The virtualization and IOMMU extensions (Intel VT-d or AMD IOMMU) are enabled in the BIOS.
The host is booted with
intel_iommu=on
oramd_iommu=on
on the kernel command line.If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the NVIDIA vGPU Documentation to ensure you have met all of the prerequisites for using NVIDIA vGPU.
KubeVirt is installed in the cluster.
Starting with KubeVirt v0.58.2 and v0.59.1, set the
DisableMDEVConfiguration
feature gate:$ kubectl patch kubevirt -n kubevirt kubevirt --type='json' \ -p='[{"op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]'
Example Output
kubevirt.kubevirt.io/kubevirt patched
Getting Started#
The high-level workflow for using the GPU Operator with KubeVirt is as follows:
Ensure the disable mediated devices configuration feature gate is set.
Label worker nodes based on the GPU workloads they will run.
Install the GPU Operator and set
sandboxWorkloads.enabled=true
There are additional steps required if using NVIDIA vGPU, which will be covered in subsequent sections
Labeling worker nodes#
Use the following command to add a label to a worker node:
$ kubectl label node <node-name> --overwrite nvidia.com/gpu.workload.config=vm-vgpu
You can assign the following values to the label - container
, vm-passthrough
, and vm-vgpu
. The GPU Operator uses the value of this label when determining which operands to deploy on each worker node.
If the node label nvidia.com/gpu.workload.config
does not exist on the node, the GPU Operator will assume the default GPU workload configuration, container
, and will deploy the software components needed to support this workload type.
To override the default GPU workload configuration, set the following value in ClusterPolicy
: sandboxWorkloads.defaultWorkload=<config>
.
Install the GPU Operator#
Follow one of the below subsections for installing the GPU Operator, depending on whether you plan to use NVIDIA vGPU or not.
In general, the flag sandboxWorkloads.enabled
in ClusterPolicy
controls whether the GPU Operator can provision GPU worker nodes
for virtual machine workloads, in addition to container workloads. This flag is disabled by default, meaning all nodes get provisioned with the same
software which enable container workloads, and the nvidia.com/gpu.workload.config
node label is not used.
Note
The term sandboxing
refers to running software in a separate isolated environment, typically for added security (i.e. a virtual machine). We use the term sandbox workloads
to signify workloads that run in a virtual machine, irrespective of the virtualization technology used.
Install the GPU Operator (without NVIDIA vGPU)#
Install the GPU Operator, enabling sandboxWorkloads
:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.2 \
--set sandboxWorkloads.enabled=true
Install the GPU Operator (with NVIDIA vGPU)#
Build a private NVIDIA vGPU Manager container image and push to a private registry. Follow the steps provided in this section.
Create a namespace for GPU Operator:
$ kubectl create namespace gpu-operator
Create an ImagePullSecret for accessing the NVIDIA vGPU Manager image:
$ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
--docker-server=${PRIVATE_REGISTRY} --docker-username=<username> \
--docker-password=<password> \
--docker-email=<email-id> -n gpu-operator
Install the GPU Operator with sandboxWorkloads
and vgpuManager
enabled and specify the NVIDIA vGPU Manager image built previously:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.2 \
--set sandboxWorkloads.enabled=true \
--set vgpuManager.enabled=true \
--set vgpuManager.repository=<path to private repository> \
--set vgpuManager.image=vgpu-manager \
--set vgpuManager.version=<driver version> \
--set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}
The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices which can be assigned to KubeVirt virtual machines. Without additional configuration, the GPU Operator creates a default set of devices on all GPUs. To learn more about how the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to vGPU Device Configuration.
Add GPU resources to KubeVirt CR#
Update the KubeVirt custom resource so that all GPU and vGPU devices in your cluster are permitted and can be assigned to virtual machines.
The following example shows how to permit the A10 GPU device and A10-24Q vGPU device.
Determine the resource names for the GPU devices:
$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
Example Output
{ "nvidia.com/NVIDIA_A10-12Q": "4" }
Determine the PCI device IDs for the GPUs.
You can search by device name in the PCI IDs database.
If you have host access to the node, you can list the NVIDIA GPU devices with a command like the following example:
$ lspci -nnk -d 10de:
Example Output
65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] Kernel modules: nvidiafb, nouveau
Modify the
KubeVirt
custom resource like the following partial example:... spec: configuration: developerConfiguration: featureGates: - GPU - DisableMDEVConfiguration permittedHostDevices: pciHostDevices: - externalResourceProvider: true pciVendorSelector: 10DE:2236 resourceName: nvidia.com/GA102GL_A10 mediatedDevices: - externalResourceProvider: true mdevNameSelector: NVIDIA A10-24Q resourceName: nvidia.com/NVIDIA_A10-24Q ...
Replace the values in the YAML as follows:
pciDeviceSelector
andresourceName
underpciHostDevices
to correspond to your GPU model.mdevNameSelector
andresourceName
undermediatedDevices
to correspond to your vGPU type.Set
externalResourceProvider=true
to indicate that this resource is provided by an external device plugin, in this case thesandbox-device-plugin
that is deployed by the GPU Operator.
Refer to the KubeVirt user guide for more information on the configuration options.
Create a virtual machine with GPU#
After the GPU Operator finishes deploying the sandbox device plugin and VFIO manager pods on worker nodes and the GPU resources are added to the
KubeVirt allowlist, you can assign a GPU to a virtual machine by editing the spec.domain.devices.gpus
field
in the VirtualMachineInstance
manifest.
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
...
spec:
domain:
devices:
gpus:
- deviceName: nvidia.com/GA102GL_A10
name: gpu1
...
deviceName
is the resource name representing the device.name
is a name to identify the device in the virtual machine
vGPU Device Configuration#
The vGPU Device Manager assists in creating vGPU devices on GPU worker nodes.
The vGPU Device Manager allows administrators to declaratively define a set of possible vGPU device configurations they would like applied to GPUs on a node.
At runtime, they then point the vGPU Device Manager at one of these configurations, and vGPU Device Manager takes care of applying it.
The configuration file is created as a ConfigMap, and is shared across all worker nodes.
At runtime, a node label, nvidia.com/vgpu.config
, can be used to decide which of these configurations to actually apply to a node at any given time.
If the node is not labeled, then the default
configuration will be used.
For more information on this component and how it is configured, refer to the project README.
By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all vGPU types supported by NVIDIA vGPU.
Users can select a specific configuration for a worker node by applying the nvidia.com/vgpu.config
node label.
For example, labeling a node with nvidia.com/vgpu.config=A10-8Q
would create 3 vGPU devices of type A10-8Q on all A10 GPUs on the node (note: 3 is the maximum number of A10-8Q devices that can be created per GPU).
If the node is not labeled, the default
configuration will be applied.
The default
configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device
is half the total GPU memory.
For example, the default
configuration will create two A10-12Q devices on all A10 GPUs, two V100-8Q devices on all V100 GPUs, and two T4-8Q devices on all T4 GPUs.
If custom vGPU device configuration is desired, more than the default ConfigMap provides, you can create your own ConfigMap:
$ kubectl create configmap custom-vgpu-config -n gpu-operator --from-file=config.yaml=/path/to/file
And then configure the GPU Operator to use it by setting vgpuDeviceManager.config.name=custom-vgpu-config
.
Apply a New vGPU Device Configuration#
We can apply a specific vGPU device configuration on a per-node basis by setting the nvidia.com/vgpu.config
node label. It is recommended to set this node label prior to installing the GPU Operator if you do not want the default configuration applied.
Switching vGPU device configuration after one has been successfully applied assumes that no virtual machines with vGPU are currently running on the node. Any existing virtual machines will have to be shutdown/migrated first.
To apply a new configuration after GPU Operator install, simply update the nvidia.com/vgpu.config
node label. Let’s run through an example on a system with two A10 GPUs.
$ nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-ebd34bdf-1083-eaac-2aff-4b71a022f9bd)
GPU 1: NVIDIA A10 (UUID: GPU-1795e88b-3395-b27b-dad8-0488474eec0c)
After installing the GPU Operator as detailed in the previous sections and without labeling the node with nvidia.com/vgpu.config
, the default
vGPU config get applied – four A10-12Q devices get created (two per GPU):
$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
{
"nvidia.com/NVIDIA_A10-12Q": "4"
}
If instead we wish to create A10-4Q devices, we can label the node like such:
$ kubectl label node <node-name> --overwrite nvidia.com/vgpu.config=A10-4Q
After the vGPU Device Manager finishes applying the new configuration, all GPU Operator pods should return to the Running state.
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
...
nvidia-sandbox-device-plugin-daemonset-brtb6 1/1 Running 0 10s
nvidia-sandbox-validator-ljnwg 1/1 Running 0 10s
nvidia-vgpu-device-manager-8mgg8 1/1 Running 0 30m
nvidia-vgpu-manager-daemonset-fpplc 1/1 Running 0 31m
We now see 12 A10-4Q devices on the node, as 6 A10-4Q devices can be created per A10 GPU.
$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
{
"nvidia.com/NVIDIA_A10-4Q": "12"
}
Building the NVIDIA vGPU Manager image#
Note
Building the NVIDIA vGPU Manager image is only required if you are planning to use NVIDIA vGPU. If only planning to use PCI passthrough, skip this section.
This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry.
Download the vGPU Software from the NVIDIA Licensing Portal.
Login to the NVIDIA Licensing Portal and navigate to the Software Downloads section.
The NVIDIA vGPU Software is located in the Software Downloads section of the NVIDIA Licensing Portal.
The vGPU Software bundle is packaged as a zip file. Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file,
NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run
.Note
NVIDIA AI Enterprise customers must use the
aie
.run file for building the NVIDIA vGPU Manager image. Download theNVIDIA-Linux-x86_64-<version>-vgpu-kvm-aie.run
file instead, and rename it toNVIDIA-Linux-x86_64-<version>-vgpu-kvm.run
before proceeding with the rest of the procedure.
Next, clone the driver container repository and build the driver image with the following steps.
Open a terminal and clone the driver container image repository.
$ git clone https://gitlab.com/nvidia/container-images/driver
$ cd driver
Change to the vgpu-manager directory for your OS. We use Ubuntu 20.04 as an example.
$ cd vgpu-manager/ubuntu20.04
Note
For RedHat OpenShift, run cd vgpu-manager/rhel8
to use the rhel8
folder instead.
Copy the NVIDIA vGPU Manager from your extracted zip file
$ cp <local-driver-download-directory>/*-vgpu-kvm.run ./
PRIVATE_REGISTRY
- name of private registry used to store driver imageVERSION
- NVIDIA vGPU Manager version downloaded from NVIDIA Software PortalOS_TAG
- this must match the Guest OS version. In the below example ubuntu20.04
is used. For RedHat OpenShift this should be set to rhcos4.x
where x is the supported minor OCP version.CUDA_VERSION
- CUDA base image version to build the driver image with.$ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=ubuntu20.04 CUDA_VERSION=11.7.1
Build the NVIDIA vGPU Manager image.
$ docker build \
--build-arg DRIVER_VERSION=${VERSION} \
--build-arg CUDA_VERSION=${CUDA_VERSION} \
-t ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} .
Push NVIDIA vGPU Manager image to your private registry.
$ docker push ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG}