GPU Operator with Kata Containers
About the Operator with Kata Containers
Note
Technology Preview features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.
Kata Containers are similar, but subtly different from traditional containers such as a Docker container.
A traditional container packages software for user-space isolation from the host, but the container runs on the host and shares the operating system kernel with the host. Sharing the operating system kernel is a potential vulnerability.
A Kata container runs in a virtual machine on the host. The virtual machine has a separate operating system and operating system kernel. Hardware virtualization and a separate kernel provide improved workload isolation in comparison with traditional containers.
The NVIDIA GPU Operator works with the Kata container runtime. Kata uses a hypervisor, like QEMU, to provide a lightweight virtual machine with a single purpose–to run a Kubernetes pod.
The following diagram shows the software components that Kubernetes uses to run a Kata container.
NVIDIA supports Kata Containers by using the Confidential Containers Operator to install the Kata runtime and QEMU. Even though the Operator isn’t used for confidential computing in this configuration, the Operator simplifies the installation of the Kata runtime.
About NVIDIA Kata Manager
When you configure the GPU Operator for Kata Containers, the Operator deploys NVIDIA Kata Manager as an operand.
The manager downloads an NVIDIA optimized Linux kernel image and initial RAM disk that provides the lightweight operating system for the virtual machines that run in QEMU. These artifacts are downloaded from the NVIDIA container registry, nvcr.io, on each worker node.
The manager also configures each worker node with a runtime class, kata-qemu-nvidia-gpu
,
and configures containerd for the runtime class.
NVIDIA Kata Manager Configuration
The following part of the cluster policy shows the fields related to the manager:
kataManager:
enabled: true
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
runtimeClasses:
- artifacts:
pullSecret: ""
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-525
name: kata-qemu-nvidia-gpu
nodeSelector: {}
- artifacts:
pullSecret: ""
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535-snp
name: kata-qemu-nvidia-gpu-snp
nodeSelector: {}
repository: nvcr.io/nvidia/cloud-native
image: k8s-kata-manager
version: v0.1.0
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
The kata-qemu-nvidia-gpu
runtime class is used with Kata Containers.
The kata-qemu-nvidia-gpu-snp
runtime class is used with Confidential Containers
and is installed by default even though it is not used with this configuration.
Benefits of Using Kata Containers
The primary benefits of Kata Containers are as follows:
Running untrusted workloads in a container. The virtual machine provides a layer of defense against the untrusted code.
Limiting access to hardware devices such as NVIDIA GPUs. The virtual machine is provided access to specific devices. This approach ensures that the workload cannot access additional devices.
Transparent deployment of unmodified containers.
Limitations and Restrictions
GPUs are available to containers as a single GPU in passthrough mode only. Multi-GPU passthrough and vGPU are not supported.
Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters for Kata Containers is not supported.
Support for Kata Containers is limited to the implementation described on this page. The Operator does not support Red Hat OpenShift sandbox containers.
Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the files that the manager downloads and installs in the
/opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/
directory on the worker nodes.NVIDIA supports the Operator and Kata Containers with the containerd runtime only.
Cluster Topology Considerations
You can configure all the worker nodes in your cluster for Kata Containers or you configure some nodes for Kata Containers and the others for traditional containers. Consider the following example.
Node A is configured to run traditional containers.
Node B is configured to run Kata Containers.
Node A receives the following software components:
NVIDIA Driver Manager for Kubernetes
– to install the data-center driver.NVIDIA Container Toolkit
– to ensure that containers can access GPUs.NVIDIA Device Plugin for Kubernetes
– to discover and advertise GPU resources to kubelet.NVIDIA DCGM and DCGM Exporter
– to monitor GPUs.NVIDIA MIG Manager for Kubernetes
– to manage MIG-capable GPUs.Node Feature Discovery
– to detect CPU, kernel, and host features and label worker nodes.NVIDIA GPU Feature Discovery
– to detect NVIDIA GPUs and label worker nodes.
Node B receives the following software components:
NVIDIA Kata Manager for Kubernetes
– to manage the NVIDIA artifacts such as the NVIDIA optimized Linux kernel image and initial RAM disk.NVIDIA Sandbox Device Plugin
– to discover and advertise the passthrough GPUs to kubelet.NVIDIA VFIO Manager
– to load the vfio-pci device driver and bind it to all GPUs on the node.Node Feature Discovery
– to detect CPU security features, NVIDIA GPUs, and label worker nodes.
Prerequisites
Your hosts are configured to enable hardware virtualization and Access Control Services (ACS). With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). Enabling these features is typically performed by configuring the host BIOS.
Your hosts are configured to support IOMMU.
If the output from running
ls /sys/kernel/iommu_groups
includes0
,1
, and so on, then your host is configured for IOMMU.If a host is not configured or you are unsure, add the
intel_iommu=on
Linux kernel command-line argument. For most Linux distributions, you add the argument to the/etc/default/grub
file:... GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau" ...
On Ubuntu systems, run
sudo update-grub
after making the change to configure the bootloader. On other systems, you might need to runsudo dracut
after making the change. Refer to the documentation for your operating system. Reboot the host after configuring the bootloader.You have a Kubernetes cluster and you have cluster administrator privileges.
Overview of Installation and Configuration
Installing and configuring your cluster to support the NVIDIA GPU Operator with Kata Containers is as follows:
Label the worker nodes that you want to use with Kata Containers.
This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster.
Install the Confidential Containers Operator.
This step installs the Operator and also the Kata Containers runtime that NVIDIA uses for Kata Containers.
Install the NVIDIA GPU Operator.
You install the Operator and specify options to deploy the operands that are required for Kata Containers.
After installation, you can run a sample workload.
Install the Confidential Containers Operator
Perform the following steps to install and verify the Confidential Containers Operator:
Label the nodes to run virtual machines in containers. Label only the nodes that you want to run with Kata Containers.
$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
Set the Operator version in an environment variable:
$ export VERSION=v0.7.0
Install the Operator:
$ kubectl apply -k "github.com/confidential-containers/operator/config/release?ref=${VERSION}"
Example Output
namespace/confidential-containers-system created customresourcedefinition.apiextensions.k8s.io/ccruntimes.confidentialcontainers.org created serviceaccount/cc-operator-controller-manager created role.rbac.authorization.k8s.io/cc-operator-leader-election-role created clusterrole.rbac.authorization.k8s.io/cc-operator-manager-role created clusterrole.rbac.authorization.k8s.io/cc-operator-metrics-reader created clusterrole.rbac.authorization.k8s.io/cc-operator-proxy-role created rolebinding.rbac.authorization.k8s.io/cc-operator-leader-election-rolebinding created clusterrolebinding.rbac.authorization.k8s.io/cc-operator-manager-rolebinding created clusterrolebinding.rbac.authorization.k8s.io/cc-operator-proxy-rolebinding created configmap/cc-operator-manager-config created service/cc-operator-controller-manager-metrics-service created deployment.apps/cc-operator-controller-manager create
(Optional) View the pods and services in the
confidential-containers-system
namespace:$ kubectl get pod,svc -n confidential-containers-system
Example Output
NAME READY STATUS RESTARTS AGE pod/cc-operator-controller-manager-c98c4ff74-ksb4q 2/2 Running 0 2m59s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cc-operator-controller-manager-metrics-service ClusterIP 10.98.221.141 <none> 8443/TCP 2m59s
Install the sample Confidential Containers runtime by creating the manifests and then editing the node selector so that the runtime is installed only on the labelled nodes.
Create a local copy of the manifests in a file that is named
ccruntime.yaml
:$ kubectl apply --dry-run=client -o yaml \ -k "github.com/confidential-containers/operator/config/samples/ccruntime/default?ref=${VERSION}" > ccruntime.yaml
Edit the
ccruntime.yaml
file and set the node selector as follows:apiVersion: confidentialcontainers.org/v1beta1 kind: CcRuntime metadata: annotations: ... spec: ccNodeSelector: matchLabels: nvidia.com/gpu.workload.config: "vm-passthrough" ...
Apply the modified manifests:
$ kubectl apply -f ccruntime.yaml
Example Output
ccruntime.confidentialcontainers.org/ccruntime-sample created
Wait a few minutes for the Operator to create the base runtime classes.
(Optional) View the runtime classes:
$ kubectl get runtimeclass
Example Output
NAME HANDLER AGE kata kata 13m kata-clh kata-clh 13m kata-clh-tdx kata-clh-tdx 13m kata-qemu kata-qemu 13m kata-qemu-sev kata-qemu-sev 13m kata-qemu-snp kata-qemu-snp 13m kata-qemu-tdx kata-qemu-tdx 13m
Install the NVIDIA GPU Operator
Procedure
Perform the following steps to install the Operator for use with Kata Containers:
Add and update the NVIDIA Helm repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Specify at least the following options when you install the Operator:
$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set sandboxWorkloads.enabled=true \ --set kataManager.enabled=true
Example Output
NAME: gpu-operator LAST DEPLOYED: Tue Jul 25 19:19:07 2023 NAMESPACE: gpu-operator STATUS: deployed REVISION: 1 TEST SUITE: None
Verification
Verify that the Kata Manager and VFIO Manager operands are running:
$ kubectl get pods -n gpu-operator
Example Output
NAME READY STATUS RESTARTS AGE gpu-operator-57bf5d5769-nb98z 1/1 Running 0 6m21s gpu-operator-node-feature-discovery-master-b44f595bf-5sjxg 1/1 Running 0 6m21s gpu-operator-node-feature-discovery-worker-lwhdr 1/1 Running 0 6m21s nvidia-kata-manager-bw5mb 1/1 Running 0 3m36s nvidia-sandbox-device-plugin-daemonset-cr4s6 1/1 Running 0 2m37s nvidia-sandbox-validator-9wjm4 1/1 Running 0 2m37s nvidia-vfio-manager-vg4wp 1/1 Running 0 3m36s
Verify that the
kata-qemu-nvidia-gpu
andkata-qemu-nvidia-gpu-snp
runtime classes are available:$ kubectl get runtimeclass
Example Output
NAME HANDLER AGE kata kata 37m kata-clh kata-clh 37m kata-clh-tdx kata-clh-tdx 37m kata-qemu kata-qemu 37m kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 96s kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 96s kata-qemu-sev kata-qemu-sev 37m kata-qemu-snp kata-qemu-snp 37m kata-qemu-tdx kata-qemu-tdx 37m nvidia nvidia 97s
Optional: If you have host access to the worker node, you can perform the following steps:
Confirm that the host uses the
vfio-pci
device driver for GPUs:$ lspci -nnk -d 10de:
Example Output
65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau
Confirm that NVIDIA Kata Manager installed the
kata-qemu-nvidia-gpu
runtime class files:$ ls -1 /opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/
Example Output
configuration-nvidia-gpu-qemu.toml kata-ubuntu-jammy-nvidia-gpu.initrd vmlinuz-5.xx.x-xxx-nvidia-gpu ...
Run a Sample Workload
A pod specification for a Kata container requires the following:
Specify a Kata runtime class.
Specify a passthrough GPU resource.
Determine the passthrough GPU resource names:
kubectl get nodes -l nvidia.com/gpu.present -o json | \ jq '.items[0].status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
Example Output
{ "nvidia.com/GA102GL_A10": "1" }
Create a file, such as
cuda-vectoradd-kata.yaml
, like the following example:apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd-kata annotations: cdi.k8s.io/gpu: "nvidia.com/pgpu=0" io.katacontainers.config.hypervisor.default_memory: "16384" spec: runtimeClassName: kata-qemu-nvidia-gpu restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" resources: limits: "nvidia.com/GA102GL_A10": 1
The
io.katacontainers.config.hypervisor.default_memory
annotation starts the VM with 16 GB of memory. Modify the value to accommodate your workload.Create the pod:
$ kubectl apply -f cuda-vectoradd-kata.yaml
View the logs from pod:
$ kubectl logs -n default cuda-vectoradd-kata
Example Output
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
Delete the pod:
$ kubectl delete -f cuda-vectoradd-kata.yaml
Troubleshooting Workloads
If the sample workload does not run, confirm that you labelled nodes to run virtual machines in containers:
$ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough
Example Output
NAME STATUS ROLES AGE VERSION
kata-worker-1 Ready <none> 10d v1.27.3
kata-worker-2 Ready <none> 10d v1.27.3
kata-worker-3 Ready <none> 10d v1.27.3
About the Pod Annotation
The cdi.k8s.io/gpu: "nvidia.com/pgpu=0"
annotation is used when the pod sandbox is created.
The annotation ensures that the virtual machine created by the Kata runtime is created with
the correct PCIe topology so that GPU passthrough succeeds.
The annotation refers to a Container Device Interface (CDI) device, nvidia.com/pgpu=0
.
The pgpu
indicates passthrough GPU and the 0
indicates the device index.
The index is defined by the order that the GPUs are enumerated on the PCI bus.
The index does not correlate to a CUDA index.
The NVIDIA Kata Manager creates a CDI specification on the GPU nodes. The file includes a device entry for each passthrough device.
In the following sample /var/run/cdi/nvidia.com-pgpu.yaml
file shows one GPU that
is bound to the VFIO PCI driver:
cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
deviceNodes:
- path: /dev/vfio/10
name: "0"
kind: nvidia.com/pgpu