NVIDIA GPU Operator with Google GKE

About Using the Operator with Google GKE

There are two ways to use NVIDIA GPU Operator with Google Kubernetes Engine (GKE). You can use Google driver installer to install and manage NVIDIA GPU Driver on the nodes or you can use the Operator and driver manager to manage the driver and other NVIDIA software components.

The choice depends on the operating system and whether you prefer to have the Operator manage all the software components.

Supported OS

Summary

Google
Driver
Installer
  • Container-Optimized OS

  • Ubuntu with containerd

The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components.

NVIDIA
Driver
Manager
  • Ubuntu with containerd

NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software.

The preceding information relates to using GKE Standard node pools. For Autopilot Pods, using the GPU Operator is not supported, and you can refer to Deploy GPU workloads in Autopilot.

Prerequisites

  • You installed and initialized the Google Cloud CLI. Refer to gcloud CLI overview in the Google Cloud documentation.

  • You have a Google Cloud project to use for your GKE cluster. Refer to Creating and managing projects in the Google Cloud documentation.

  • You have the project ID for your Google Cloud project. Refer to Identifying projects in the Google Cloud documentation.

  • You know the machine type for the node pool and that the machine type is supported in your region and zone. Refer to GPU platforms in the Google Cloud documentation.

Using the Google Driver Installer

Perform the following steps to create a GKE cluster with the gcloud CLI and use Google driver installer to manage the GPU driver. You can create a node pool that uses a Container-Optimized OS node image or a Ubuntu node image.

  1. Create the node pool. Refer to Running GPUs in GKE Standard clusters in the GKE documentation.

    When you create the node pool, specify the following additional gcloud command-line options to disable GKE features that are not supported with the Operator:

    • --node-labels="gke-no-default-nvidia-gpu-device-plugin=true"

      The node label disables the GKE GPU device plugin daemon set on GPU nodes.

    • --accelerator type=...,gpu-driver-version=disabled

      This argument disables automatically installing the GPU driver on GPU nodes.

  2. Get the authentication credentials for the cluster:

    $ gcloud container clusters get-credentials demo-cluster --location us-west1
    
  3. Optional: Verify that you can connect to the cluster:

    $ kubectl get nodes -o wide
    
  4. Create the namespace for the NVIDIA GPU Operator:

    $ kubectl create ns gpu-operator
    
  5. Create a file, such as gpu-operator-quota.yaml, with contents like the following example:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: gpu-operator-quota
    spec:
      hard:
        pods: 100
      scopeSelector:
        matchExpressions:
        - operator: In
          scopeName: PriorityClass
          values:
            - system-node-critical
            - system-cluster-critical
    
  6. Apply the resource quota:

    $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml
    
  7. Optional: View the resource quota:

    $ kubectl get -n gpu-operator resourcequota
    

    Example Output

    NAME                  AGE     REQUEST
    gpu-operator-quota    38s     pods: 0/100
    
  8. Install the Google driver installer daemon set.

    For Container-Optimized OS:

    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    

    For Ubuntu, the manifest to apply depends on GPU model and node version. Refer to the Ubuntu tab at Manually install NVIDIA GPU drivers in the GKE documentation.

  9. Install the Operator using Helm:

    $ helm install --wait --generate-name \
        -n gpu-operator \
        nvidia/gpu-operator \
        --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
        --set toolkit.installDir=/home/kubernetes/bin/nvidia \
        --set cdi.enabled=true \
        --set cdi.default=true \
        --set driver.enabled=false
    

    Set the NVIDIA Container Toolkit and driver installation path to /home/kubernetes/bin/nvidia. On GKE node images, this directory is writable and is a stateful location for storing the NVIDIA runtime binaries.

    To configure MIG with NVIDIA MIG Manager, specify the following additional Helm command arguments:

    --set migManager.env[0].name=WITH_REBOOT \
    --set-string migManager.env[0].value=true
    

Using NVIDIA Driver Manager

Perform the following steps to create a GKE cluster with the gcloud CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver. The steps create the cluster with a node pool that uses a Ubuntu and containerd node image.

  1. Create the cluster by running a command that is similar to the following example:

    $ gcloud beta container clusters create demo-cluster \
        --project <project-id> \
        --location us-west1 \
        --release-channel "regular" \
        --machine-type "n1-standard-4" \
        --accelerator "type=nvidia-tesla-t4,count=1" \
        --image-type "UBUNTU_CONTAINERD" \
        --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \
        --disk-type "pd-standard" \
        --disk-size "1000" \
        --no-enable-intra-node-visibility \
        --metadata disable-legacy-endpoints=true \
        --max-pods-per-node "110" \
        --num-nodes "1" \
        --logging=SYSTEM,WORKLOAD \
        --monitoring=SYSTEM \
        --enable-ip-alias \
        --default-max-pods-per-node "110" \
        --no-enable-master-authorized-networks \
        --tags=nvidia-ingress-all
    

    Creating the cluster requires several minutes.

  2. Get the authentication credentials for the cluster:

    $ USE_GKE_GCLOUD_AUTH_PLUGIN=True \
        gcloud container clusters get-credentials demo-cluster --zone us-west1
    
  3. Optional: Verify that you can connect to the cluster:

    $ kubectl get nodes -o wide
    
  4. Create the namespace for the NVIDIA GPU Operator:

    $ kubectl create ns gpu-operator
    
  5. Create a file, such as gpu-operator-quota.yaml, with contents like the following example:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: gpu-operator-quota
    spec:
      hard:
        pods: 100
      scopeSelector:
        matchExpressions:
        - operator: In
          scopeName: PriorityClass
          values:
            - system-node-critical
            - system-cluster-critical
    
  6. Apply the resource quota:

    $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml
    
  7. Optional: View the resource quota:

    $ kubectl get -n gpu-operator resourcequota
    

    Example Output

    NAME                  AGE     REQUEST
    gke-resource-quotas   6m56s   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 2/1500, services: 1/500
    gpu-operator-quota    38s     pods: 0/100
    
  8. Install the Operator. Refer to install the NVIDIA GPU Operator.