GPU Operator with Confidential Containers and Kata

About Support for Confidential Containers

Note

Technology Preview features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.

Confidential containers is the cloud-native approach of confidential computing. Confidential computing extends the practice of securing data in transit and data at rest by adding the practice of securing data in use.

Confidential computing is a technology that isolates sensitive data in NVIDIA GPUs and a protected CPU enclave during processing. Confidential computing relies on hardware features such as Intel SGX, Intel TDX, and AMD SEV to provide the trusted execution environment (TEE). The TEE provides embedded encryption keys and an embedded attestation mechanism to ensure that keys are only accessible by authorized application code.

The following high-level diagram shows some fundamental concepts for confidential containers with the NVIDIA GPU Operator:

containerd is configured to run a Kata runtime to start virtual machines.
Kata starts the virtual machines using an NVIDIA optimized Linux kernel and NVIDIA provided initial RAM disk
Before the containers run in the virtual machine, a guest pre-start hook runs the local verifier that is part of the NVIDIA Attestation SDK.

_images/gpu-op-confidential-containers.svg — High-Level Logical Diagram of Software Components and Communication Paths

Requirements

Refer to the Confidential Computing Deployment Guide at the https://docs.nvidia.com/confidential-computing website for information about supported NVIDIA GPUs, such as the NVIDIA Hopper H100.

The following topics in the deployment guide apply to a cloud-native environment:

Hardware selection and initial hardware configuration, such as BIOS settings.
Host operating system selection, initial configuration, and validation.

The remaining configuration topics in the deployment guide do not apply to a cloud-native environment. NVIDIA GPU Operator performs the actions that are described in these topics.

Key Software Components

NVIDIA GPU Operator brings together the following software components to simplify managing the software required for confidential computing and deploying confidential container workloads:

Confidential Containers Operator

The Operator manages installing and deploying a runtime that can run Kata Containers with QEMU.

NVIDIA Kata Manager for Kubernetes

GPU Operator deploys NVIDIA Kata Manager for Kubernetes, k8s-kata-manager. The manager performs the following functions:

Manages the kata-qemu-nvidia-gpu-snp runtime class.
Configures containerd to use the runtime class.
Manages the Kata artifacts such as Linux kernel images and initial RAM disks.

NVIDIA Confidential Computing Manager for Kubernetes

GPU Operator deploys the manager, k8s-cc-manager, to set the confidential computing mode on the NVIDIA GPUs.

Node Feature Discovery (NFD)

When you install NVIDIA GPU Operator for confidential computing, you must specify the nfd.nodefeaturerules=true option. This option directs the Operator to install node feature rules that detect CPU security features and the NVIDIA GPU hardware. You can confirm the rules are installed by running kubectl get nodefeaturerules nvidia-nfd-node-featurerules.

On nodes that have an NVIDIA Hopper family GPU and either Intel TDX or AMD SEV-SNP, NFD adds labels to the node such as "feature.node.kubernetes.io/cpu-security.sev.snp.enabled": "true" and "nvidia.com/cc.capable": "true". NVIDIA GPU Operator only deploys the operands for confidential containers on nodes that have the "nvidia.com/cc.capable": "true" label.

About NVIDIA Confidential Computing Manager

You can set the default confidential computing mode of the NVIDIA GPUs by setting the ccManager.defaultMode=<on|off> option. The default value is off. You can set this option when you install NVIDIA GPU Operator or afterward by modifying the cluster-policy instance of the ClusterPolicy object.

When you change the mode, the manager performs the following actions:

Evicts the other GPU Operator operands from the node.

However, the manager does not drain user workloads. You must make sure ensure that no user workloads running on the node before you change the mode.
Unbinds the GPU from the VFIO PCI device driver.
Changes the mode and resets the GPU.
Reschedules the other GPU Operator operands.

NVIDIA Confidential Computing Manager Configuration

The following part of the cluster policy shows the fields related to the manager:

ccManager:
  enabled: true
  defaultMode: "off"
  repository: nvcr.io/nvidia/cloud-native
  image: k8s-cc-manager
  version: v0.1.0
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env:
    - name: CC_CAPABLE_DEVICE_IDS
      value: "0x2331,0x2322"
  resources: {}

Limitations and Restrictions

GPUs are available to containers as a single GPU in passthrough mode only. Multi-GPU passthrough and vGPU are not supported.
Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters to configure confidential computing is not supported.
Support for confidential computing environments is limited to the implementation described on this page.
NVIDIA supports the Operator and confidential computing with the containerd runtime only.
The Operator supports performing local attestation only.

Cluster Topology Considerations

You can configure all the worker nodes in your cluster for confidential containers or you configure some nodes for confidential containers and the others for traditional containers. Consider the following example.

Node A is configured to run traditional containers.

Node B is configured to run confidential containers.

Node A receives the following software components:

NVIDIA Driver Manager for Kubernetes – to install the data-center driver.
NVIDIA Container Toolkit – to ensure that containers can access GPUs.
NVIDIA Device Plugin for Kubernetes – to discover and advertise GPU resources to kubelet.
NVIDIA DCGM and DCGM Exporter – to monitor GPUs.
NVIDIA MIG Manager for Kubernetes – to manage MIG-capable GPUs.
Node Feature Discovery – to detect CPU, kernel, and host features and label worker nodes.
NVIDIA GPU Feature Discovery – to detect NVIDIA GPUs and label worker nodes.

Node B receives the following software components:

NVIDIA Kata Manager for Kubernetes – to manage the NVIDIA artifacts such as the NVIDIA optimized Linux kernel image and initial RAM disk.
NVIDIA Confidential Computing Manager for Kubernetes – to manage the confidential computing mode of the NVIDIA GPU on the node.
NVIDIA Sandbox Device Plugin – to discover and advertise the passthrough GPUs to kubelet.
NVIDIA VFIO Manager – to load the vfio-pci device driver and bind it to all GPUs on the node.
Node Feature Discovery – to detect CPU security features, NVIDIA GPUs, and label worker nodes.

Prerequisites

Refer to the Confidential Computing Deployment Guide for the following prerequisites:
- You selected and configured your hardware and BIOS to support confidential computing.
- You installed and configured an operating system to support confidential computing.
- You validated that the Linux kernel is SNP-aware.
Your hosts are configured to enable hardware virtualization. Enabling this feature is typically performed by configuring the host BIOS.
Your hosts are configured to support IOMMU.

If the output from running ls /sys/kernel/iommu_groups includes 0, 1, and so on, then your host is configured for IOMMU.

If the host is not configured or you are unsure, add the intel_iommu=on Linux kernel command-line argument. For most Linux distributions, you add the argument to the /etc/default/grub file:
```
...
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau"
...
```
On Ubuntu systems, run sudo update-grub after making the change to configure the bootloader. On other systems, you might need to run sudo dracut after making the change. Refer to the documentation for your operating system. Reboot the host after configuring the bootloader.
You have a Kubernetes cluster and you have cluster administrator privileges.

Overview of Installation and Configuration

Installing and configuring your cluster to support the NVIDIA GPU Operator with confidential containers is as follows:

Label the worker nodes that you want to use with confidential containers.

This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster.
Install the Confidential Containers Operator.

This step installs the Operator and also the Kata Containers runtime that NVIDIA uses for confidential containers.
Install the NVIDIA GPU Operator.

You install the Operator and specify options to deploy the operands that are required for confidential containers.

After installation, you can change the confidential computing mode and run a sample workload.

Label Nodes for Confidential Containers

> Label the nodes to run Kata Containers and configure for confidential containers:

$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough

Install the Confidential Containers Operator

Perform the following steps to install and verify the Confidential Containers Operator:

Set the Operator version in an environment variable:
```
$ export VERSION=v0.7.0
```

Install the Operator:

$ kubectl apply -k "github.com/confidential-containers/operator/config/release?ref=${VERSION}"

Example Output

namespace/confidential-containers-system created
customresourcedefinition.apiextensions.k8s.io/ccruntimes.confidentialcontainers.org created
serviceaccount/cc-operator-controller-manager created
role.rbac.authorization.k8s.io/cc-operator-leader-election-role created
clusterrole.rbac.authorization.k8s.io/cc-operator-manager-role created
clusterrole.rbac.authorization.k8s.io/cc-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/cc-operator-proxy-role created
rolebinding.rbac.authorization.k8s.io/cc-operator-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/cc-operator-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/cc-operator-proxy-rolebinding created
configmap/cc-operator-manager-config created
service/cc-operator-controller-manager-metrics-service created
deployment.apps/cc-operator-controller-manager create

(Optional) View the pods and services in the confidential-containers-system namespace:

$ kubectl get pod,svc -n confidential-containers-system

Example Output

NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-c98c4ff74-ksb4q   2/2     Running   0          2m59s

NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.98.221.141   <none>        8443/TCP   2m59s

Install the sample Confidential Containers runtime by creating the manifests and then editing the node selector so that the runtime is installed only on the labelled nodes.

Create a local copy of the manifests in a file that is named ccruntime.yaml:

$ kubectl apply --dry-run=client -o yaml \
    -k "github.com/confidential-containers/operator/config/samples/ccruntime/default?ref=${VERSION}" > ccruntime.yaml

Edit the ccruntime.yaml file and set the node selector as follows:

apiVersion: confidentialcontainers.org/v1beta1
kind: CcRuntime
metadata:
  annotations:
...
spec:
  ccNodeSelector:
    matchLabels:
      nvidia.com/gpu.workload.config: "vm-passthrough"
...

Apply the modified manifests:

$ kubectl apply -f ccruntime.yaml

Example Output

ccruntime.confidentialcontainers.org/ccruntime-sample created

Wait a few minutes for the Operator to create the base runtime classes.

(Optional) View the runtime classes:

$ kubectl get runtimeclass

Example Output

NAME            HANDLER         AGE
kata            kata            13m
kata-clh        kata-clh        13m
kata-clh-tdx    kata-clh-tdx    13m
kata-qemu       kata-qemu       13m
kata-qemu-sev   kata-qemu-sev   13m
kata-qemu-snp   kata-qemu-snp   13m
kata-qemu-tdx   kata-qemu-tdx   13m

Install the NVIDIA GPU Operator

Procedure

Perform the following steps to install the Operator for use with confidential containers:

Add and update the NVIDIA Helm repository:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

Specify at least the following options when you install the Operator:

$ helm install --wait --generate-name \
   -n gpu-operator --create-namespace \
   nvidia/gpu-operator \
   --set sandboxWorkloads.enabled=true \
   --set kataManager.enabled=true \
   --set ccManager.enabled=true \
   --set nfd.nodefeaturerules=true

Example Output

NAME: gpu-operator
LAST DEPLOYED: Tue Jul 25 19:19:07 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Verification

Verify that the Kata Manager, Confidential Computing Manager, and VFIO Manager operands are running:

$ kubectl get pods -n gpu-operator

Example Output

NAME                                                         READY   STATUS      RESTARTS   AGE
gpu-operator-57bf5d5769-nb98z                                1/1     Running     0          6m21s
gpu-operator-node-feature-discovery-master-b44f595bf-5sjxg   1/1     Running     0          6m21s
gpu-operator-node-feature-discovery-worker-lwhdr             1/1     Running     0          6m21s
nvidia-cc-manager-yzbw7                                      1/1     Running     0          3m36s
nvidia-kata-manager-bw5mb                                    1/1     Running     0          3m36s
nvidia-sandbox-device-plugin-daemonset-cr4s6                 1/1     Running     0          2m37s
nvidia-sandbox-validator-9wjm4                               1/1     Running     0          2m37s
nvidia-vfio-manager-vg4wp                                    1/1     Running     0          3m36s

Verify that the kata-qemu-nvidia-gpu and kata-qemu-nvidia-gpu-snp runtime classes are available:

$ kubectl get runtimeclass

Example Output

NAME                       HANDLER                    AGE
kata                       kata                       37m
kata-clh                   kata-clh                   37m
kata-clh-tdx               kata-clh-tdx               37m
kata-qemu                  kata-qemu                  37m
kata-qemu-nvidia-gpu       kata-qemu-nvidia-gpu       96s
kata-qemu-nvidia-gpu-snp   kata-qemu-nvidia-gpu-snp   96s
kata-qemu-sev              kata-qemu-sev              37m
kata-qemu-snp              kata-qemu-snp              37m
kata-qemu-tdx              kata-qemu-tdx              37m
nvidia                     nvidia                     97s

(Optional) If you have host access to the worker node, you can perform the following steps:

Confirm that the host uses the vfio-pci device driver for GPUs:

$ lspci -nnk -d 10de:

Example Output

65:00.0 3D controller [0302]: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] (rev xx)
        Subsystem: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

Confirm that NVIDIA Kata Manager installed the kata-qemu-nvidia-gpu-snp runtime class files:

$ ls -1 /opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu-snp/

Example Output

5.19.2.tar.gz
config-5.19.2-109-nvidia-gpu-sev
configuration-kata-qemu-nvidia-gpu-snp.toml
dpkg.sbom.list
kata-ubuntu-jammy-nvidia-gpu.initrd
vmlinuz-5.19.2-109-nvidia-gpu-sev
...

Managing the Confidential Computing Mode

Three modes are supported:

on – Enable confidential computing.
off – Disable confidential computing.
devtools – Development mode for software development and debugging.

You can set a cluster-wide default mode and you can set the mode on individual nodes. The mode that you set on a node has higher precedence than the cluster-wide default mode.

Setting a Cluster-Wide Default Mode

To set a cluster-wide mode, specify the ccManager.defaultMode field like the following example:

$ kubectl patch clusterpolicy/cluster-policy \
       -p '{"spec": {"ccManager": {"defaultMode": "on"}}}'

Setting a Node-Level Mode

To set a node-level mode, apply the nvidia.com/cc.mode=<on|off|devtools> label like the following example:

$ kubectl label node <node-name> nvidia.com/cc.mode=on --overwrite

The mode that you set on a node has higher precedence than the cluster-wide default mode.

Verifying a Mode Change

To verify that changing the mode was successful, a cluster-wide or node-level change, view the nvidia.com/cc.mode.state node label:

$ kubectl get node <node-name> -o json |  \
    jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc.mode.state)))'

The label is set to either success or failed.

Run a Sample Workload

A pod specification for a confidential computing requires the following:

Specify the kata-qemu-nvidia-gpu-snp runtime class.
Specify a passthrough GPU resource.

Determine the passthrough GPU resource names:

kubectl get nodes -l nvidia.com/gpu.present -o json | \
  jq '.items[0].status.allocatable |
    with_entries(select(.key | startswith("nvidia.com/"))) |
    with_entries(select(.value != "0"))'

Example Output

{
   "nvidia.com/GH100_H100_PCIE": "1"
}

Create a file, such as cuda-vectoradd-coco.yaml, like the following example:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-coco
  annotations:
    cdi.k8s.io/gpu: "nvidia.com/pgpu=0"
spec:
  runtimeClassName: kata-qemu-nvidia-gpu-snp
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        "nvidia.com/GH100_H100_PCIE": 1

Create the pod:

$ kubectl apply -f cuda-vectoradd-coco.yaml

View the logs from pod:

$ kubectl logs -n default cuda-vectoradd-coco

Example Output

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Delete the pod:

$ kubectl delete -f cuda-vectoradd-coco.yaml

Refer to About the Pod Annotation for information about the pod annotation.

Attestation

About Attestation

With confidential computing, attestation is the assertion that the hardware and software is trustworthy.

The Kata runtime uses the kata-ubuntu-jammy-nvidia-gpu.initrd initial RAM disk file that NVIDIA Kata Manager for Kubernetes downloaded from NVIDIA Container Registry, nvcr.io. The initial RAM disk includes an NVIDIA verifier tool that runs as a container guest pre-start hook. When the attestation is successful, the GPU is set in the Ready state. On failure, containers still start, but CUDA applications fail with a system not initialized error.

Refer to NVIDIA Hopper Confidential Computing Attestation Verifier at https://docs.nvidia.com/confidential-computing for more information about attestation.

Accessing the VM of a Scheduled Confidential Container

You do not need to access the VM as a routine task. Accessing the VM is useful for troubleshooting or performing lower-level verification about the confidential computing mode.

This task requires host access to the Kubernetes node that is running the container.

Determine the Kubernetes node and pod sandbox ID:
```
$ kubectl describe pod <pod-name>
```
Access the Kubernetes node. Using secure shell is typical.
Access the Kata runtime:
```
$ kata-runtime exec <pod-sandbox-ID>
```

Viewing the GPU Ready State

After you access the VM, you can run nvidia-smi conf-compute -grs:

Confidential Compute GPUs Ready state: ready

Viewing the Confidential Computing Mode

After you access the VM, you can run nvidia-smi conf-compute -f to view the mode:

CC status: ON

Verifying That Attestation Is Successful

After you access the VM, you can run the following commands to verify that attestation is successful:

# source /gpu-attestation/nv-venv/bin/activate
# python3 /gpu-attestation/nv_attestation_sdk/tests/SmallGPUTest.py

Example Output

[SmallGPUTest] node name : thisNode1
[['LOCAL_GPU_CLAIMS', <Devices.GPU: 2>, <Environment.LOCAL: 2>, '', '', '']]
[SmallGPUTest] call attest() - expecting True
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
VERIFYING GPU : 0
      Driver version fetched : 535.86.05
      VBIOS version fetched : 96.00.5e.00.01
      Validating GPU certificate chains.
             GPU attestation report certificate chain validation successful.
                    The certificate chain revocation status verification successful.
      Authenticating attestation report
             The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
             Driver version fetched from the attestation report : 535.86.05
             VBIOS version fetched from the attestation report : 96.00.5e.00.01
             Attestation report signature verification successful.
             Attestation report verification successful.
      Authenticating the RIMs.
             Authenticating Driver RIM
                     Schema validation passed.
                     driver RIM certificate chain verification successful.
                     The certificate chain revocation status verification successful.
                     driver RIM signature verification successful.
                     Driver RIM verification successful
            Authenticating VBIOS RIM.
                     RIM Schema validation passed.
                     vbios RIM certificate chain verification successful.
                     The certificate chain revocation status verification successful.
                     vbios RIM signature verification successful.
                     VBIOS RIM verification successful
      Comparing measurements (runtime vs golden)
                     The runtime measurements are matching with the golden measurements.
             GPU is in the expected state.
      GPU 0 verified successfully.
      attestation result: True
      claims list:: {'x-nv-gpu-availability': True, 'x-nv-gpu-attestation-report-available': ...
      True
      [SmallGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e..."],
         {"LOCAL_GPU_CLAIMS": "eyJhbGciOiJIUzI1NiIsInR5cCI..."}]
      [SmallGPUTest] call validate_token() - expecting True
      True

Troubleshooting

To troubleshoot attestation failures, access the VM and view the logs in the /var/log/ directory.

To troubleshoot virtual machine failures, access the Kubernetes node and view logs with the journalctl command.

$ sudo journalctl -u containerd -f

The Kata agent communicates with the virtcontainers library on the host by using the VSOCK port. The communication is recorded to the system journal on the host. When you view the logs, refer to logs with a kata or virtcontainers prefix.

Additional Resources

NVIDIA Confidential Computing documentation is available at https://docs.nvidia.com/confidential-computing.
NVIDIA Verifier Tool is part of the nvTrust project. Refer to https://github.com/NVIDIA/nvtrust/tree/main/guest_tools/gpu_verifiers/local_gpu_verifier for more information.