NVIDIA GPU Operator with Google GKE
About Using the Operator with Google GKE
You can use the NVIDIA GPU Operator with Google Kubernetes Engine (GKE), but you must use an operating system that is supported by the Operator.
By default, Google GKE configures nodes with the Container-Optimized OS with Containerd from Google. This operating system is not supported by the Operator.
To use a supported operating system, such as Ubuntu 22.04 or 20.04, configure your GKE cluster entirely with Ubuntu containerd nodes images or with a node pool that uses Ubuntu containerd node images.
By selecting a supported operating system rather than Container-Optimized OS with Containerd, you can customize which NVIDIA software components are installed by the GPU Operator at deployment time. For example, the Operator can deploy GPU driver containers and use the Operator to manage the lifecycle of the NVIDIA software components.
Prerequisites
You installed and initialized the Google Cloud CLI. Refer to gcloud CLI overview in the Google Cloud documentation.
You have a Google Cloud project to use for your GKE cluster. Refer to Creating and managing projects in the Google Cloud documentation.
You have the project ID for your Google Cloud project. Refer to Identifying projects in the Google Cloud documentation.
You know the machine type for the node pool and that the machine type is supported in your region and zone. Refer to GPU platforms in the Google Cloud documentation.
Procedure
Perform the following steps to create a GKE cluster with the gcloud
CLI.
The steps create the cluster with a node pool that uses a Ubuntu and containerd node image.
Create the cluster by running a command that is similar to the following example:
$ gcloud beta container clusters create demo-cluster \ --project <project-id> \ --zone us-west1-a \ --release-channel "regular" \ --machine-type "n1-standard-4" \ --accelerator "type=nvidia-tesla-t4,count=1" \ --image-type "UBUNTU_CONTAINERD" \ --disk-type "pd-standard" \ --disk-size "1000" \ --no-enable-intra-node-visibility \ --metadata disable-legacy-endpoints=true \ --max-pods-per-node "110" \ --num-nodes "1" \ --logging=SYSTEM,WORKLOAD \ --monitoring=SYSTEM \ --enable-ip-alias \ --no-enable-intra-node-visibility \ --default-max-pods-per-node "110" \ --no-enable-master-authorized-networks \ --tags=nvidia-ingress-all Creating the cluster requires several minutes.
Get the authentication credentials for the cluster:
$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \ gcloud container clusters get-credentials demo-cluster --zone us-west1-a
Optional: Verify that you can connect to the cluster:
$ kubectl get nodes -o wide
Create the namespace for the NVIDIA GPU Operator:
$ kubectl create ns gpu-operator
Create a file, such as
gpu-operator-quota.yaml
, with contents like the following example:apiVersion: v1 kind: ResourceQuota metadata: name: gpu-operator-quota spec: hard: pods: 100 scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: - system-node-critical - system-cluster-critical
Apply the resource quota:
$ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml
Optional: View the resource quota:
$ kubectl get -n gpu-operator resourcequota
Example Output
NAME AGE REQUEST gke-resource-quotas 6m56s count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 2/1500, services: 1/500 gpu-operator-quota 38s pods: 0/100
Next Steps
You are ready to install the NVIDIA GPU Operator with Helm.