NVIDIA GPU Operator with Google GKE

About Using the Operator with Google GKE

You can use the NVIDIA GPU Operator with Google Kubernetes Engine (GKE), but you must use an operating system that is supported by the Operator.

By default, Google GKE configures nodes with the Container-Optimized OS with Containerd from Google. This operating system is not supported by the Operator.

To use a supported operating system, such as Ubuntu 22.04 or 20.04, configure your GKE cluster entirely with Ubuntu containerd nodes images or with a node pool that uses Ubuntu containerd node images.

By selecting a supported operating system rather than Container-Optimized OS with Containerd, you can customize which NVIDIA software components are installed by the GPU Operator at deployment time. For example, the Operator can deploy GPU driver containers and use the Operator to manage the lifecycle of the NVIDIA software components.

Prerequisites

  • You installed and initialized the Google Cloud CLI. Refer to gcloud CLI overview in the Google Cloud documentation.

  • You have a Google Cloud project to use for your GKE cluster. Refer to Creating and managing projects in the Google Cloud documentation.

  • You have the project ID for your Google Cloud project. Refer to Identifying projects in the Google Cloud documentation.

  • You know the machine type for the node pool and that the machine type is supported in your region and zone. Refer to GPU platforms in the Google Cloud documentation.

Procedure

Perform the following steps to create a GKE cluster with the gcloud CLI. The steps create the cluster with a node pool that uses a Ubuntu and containerd node image.

  1. Create the cluster by running a command that is similar to the following example:

      $ gcloud beta container clusters create demo-cluster \
          --project <project-id> \
          --zone us-west1-a \
          --release-channel "regular" \
          --machine-type "n1-standard-4" \
          --accelerator "type=nvidia-tesla-t4,count=1" \
          --image-type "UBUNTU_CONTAINERD" \
          --disk-type "pd-standard" \
          --disk-size "1000" \
          --no-enable-intra-node-visibility \
          --metadata disable-legacy-endpoints=true \
          --max-pods-per-node "110" \
          --num-nodes "1" \
          --logging=SYSTEM,WORKLOAD \
          --monitoring=SYSTEM \
          --enable-ip-alias \
          --no-enable-intra-node-visibility \
          --default-max-pods-per-node "110" \
          --no-enable-master-authorized-networks \
          --tags=nvidia-ingress-all
    
    Creating the cluster requires several minutes.
    
  2. Get the authentication credentials for the cluster:

    $ USE_GKE_GCLOUD_AUTH_PLUGIN=True \
        gcloud container clusters get-credentials demo-cluster --zone us-west1-a
    
  3. Optional: Verify that you can connect to the cluster:

    $ kubectl get nodes -o wide
    
  4. Create the namespace for the NVIDIA GPU Operator:

    $ kubectl create ns gpu-operator
    
  5. Create a file, such as gpu-operator-quota.yaml, with contents like the following example:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: gpu-operator-quota
    spec:
      hard:
        pods: 100
      scopeSelector:
        matchExpressions:
        - operator: In
          scopeName: PriorityClass
          values:
            - system-node-critical
            - system-cluster-critical
    
  6. Apply the resource quota:

    $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml
    
  7. Optional: View the resource quota:

    $ kubectl get -n gpu-operator resourcequota
    

    Example Output

    NAME                  AGE     REQUEST
    gke-resource-quotas   6m56s   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 2/1500, services: 1/500
    gpu-operator-quota    38s     pods: 0/100
    

Next Steps