GCP Google Kubernetes Engine (GKE)#

Google Kubernetes Engine (GKE) is Google Cloud’s managed Kubernetes service for deploying, managing, and scaling containerized applications. This guide will walk you through setting up the RAPIDS Accelerator for Apache Spark on a GKE cluster with GPU nodes.

At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a GKE cluster.

Quick Start Summary#

For experienced users, here are the key steps:

  1. Create a GKE cluster with GPU node pool

  2. Verify GPU drivers are installed (GKE auto-installs them)

  3. Build and push a Docker image with Spark + RAPIDS Accelerator to Artifact Registry

  4. Install Spark Operator using Helm

  5. Submit Spark applications using SparkApplication CRD

Estimated time: 30-45 minutes (including cluster creation)

Prerequisites#

  • A Google Cloud account with billing enabled

  • gcloud CLI installed and configured (run gcloud init to configure)

  • kubectl installed

  • gke-gcloud-auth-plugin installed (required for kubectl authentication)

  • Helm installed (required for Spark Operator)

  • Docker installed on your local machine

Install gke-gcloud-auth-plugin if you haven’t:

1gcloud components install gke-gcloud-auth-plugin
../_images/GKE-auths.png

Enable Required APIs#

Before creating a GKE cluster, enable the required Google Cloud APIs:

1gcloud services enable container.googleapis.com
2gcloud services enable compute.googleapis.com
3gcloud services enable artifactregistry.googleapis.com

Create a GKE Cluster with GPU Nodes#

Set Up Environment Variables#

 1export PROJECT_ID=[Your GCP Project ID]
 2export ZONE=us-central1-a
 3export CLUSTER_NAME=spark-rapids-gke
 4
 5# Default node pool settings
 6export MACHINE_TYPE=n1-standard-4
 7export NUM_NODES=1
 8
 9# GPU node pool settings
10export GPU_MACHINE_TYPE=g2-standard-16
11export GPU_TYPE=nvidia-l4
12export GPU_COUNT=1
13export NUM_GPU_NODES=2

Create the GKE Cluster#

Create a GKE cluster with a default node pool:

1gcloud container clusters create $CLUSTER_NAME \
2    --project=$PROJECT_ID \
3    --zone=$ZONE \
4    --machine-type=$MACHINE_TYPE \
5    --num-nodes=$NUM_NODES

Add a GPU Node Pool#

Add a GPU node pool with NVIDIA L4 GPUs:

1gcloud container node-pools create gpu-pool \
2    --cluster=$CLUSTER_NAME \
3    --project=$PROJECT_ID \
4    --zone=$ZONE \
5    --machine-type=$GPU_MACHINE_TYPE \
6    --accelerator=type=$GPU_TYPE,count=$GPU_COUNT \
7    --num-nodes=$NUM_GPU_NODES

Note

To use different GPU types, update the environment variables accordingly:

  • NVIDIA L4: GPU_MACHINE_TYPE=g2-standard-16, GPU_TYPE=nvidia-l4

  • NVIDIA T4: GPU_MACHINE_TYPE=n1-standard-8, GPU_TYPE=nvidia-tesla-t4

Get Cluster Credentials#

After the cluster is created, get the credentials to access it with kubectl:

1gcloud container clusters get-credentials $CLUSTER_NAME --zone=$ZONE --project=$PROJECT_ID
../_images/GKE-credentials.png

Verify GPU Availability#

GKE automatically installs NVIDIA GPU drivers via the nvidia-gpu-device-plugin DaemonSet. Verify GPUs are available:

1# Check GPU device plugin pods
2kubectl get pods -n kube-system | grep nvidia
3
4# Verify GPU resources on nodes
5kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'

You should see output like:

../_images/GKE-nodes.png

Docker Image Preparation#

Build the Spark RAPIDS Docker Image#

  1. Create a working directory and download Apache Spark:

    1mkdir spark-rapids-gke && cd spark-rapids-gke
    2
    3# Download Spark 3.5.0 (check https://nvidia.github.io/spark-rapids/docs/download.html for supported versions)
    4wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
    5tar xzf spark-3.5.0-bin-hadoop3.tgz
    6mv spark-3.5.0-bin-hadoop3 spark
    
  2. Download the RAPIDS Accelerator jar and GPU discovery script:

    1# Download RAPIDS Accelerator jar (check https://nvidia.github.io/spark-rapids/docs/download.html for latest version)
    2wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/25.12.0/rapids-4-spark_2.12-25.12.0-cuda12.jar
    3
    4# Download GPU discovery script
    5wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
    6chmod +x getGpusResources.sh
    
  3. Download the Dockerfile:

    Download Dockerfile.cuda to your working directory.

    The current directory structure should look like:

    1$ ls
    2Dockerfile.cuda  getGpusResources.sh  rapids-4-spark_2.12-25.12.0-cuda12.jar  spark
    
  4. Build and push the Docker image to Google Artifact Registry:

     1export REGION=us-central1
     2
     3# Create an Artifact Registry repository (if not exists)
     4gcloud artifacts repositories create spark-rapids \
     5    --repository-format=docker \
     6    --location=$REGION \
     7    --project=$PROJECT_ID
     8
     9# Configure Docker to use gcloud credentials
    10gcloud auth configure-docker ${REGION}-docker.pkg.dev --quiet
    11
    12# Build the image
    13export IMAGE_NAME=${REGION}-docker.pkg.dev/${PROJECT_ID}/spark-rapids/spark-rapids:25.12.0-cuda12
    14docker build -t $IMAGE_NAME -f Dockerfile.cuda .
    15
    16# Push the image
    17docker push $IMAGE_NAME
    

    Note

    Google Container Registry (GCR) is deprecated. Use Artifact Registry instead with the format: REGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/IMAGE:TAG

Install Spark Operator#

The Spark Operator simplifies managing Spark applications on Kubernetes. Install it using Helm:

 1# Add the Spark Operator Helm repository (Kubeflow maintained)
 2helm repo add spark-operator https://kubeflow.github.io/spark-operator
 3helm repo update
 4
 5# Install the Spark Operator
 6helm install spark-operator spark-operator/spark-operator \
 7    --namespace spark-operator \
 8    --create-namespace \
 9    --set webhook.enable=true \
10    --set sparkJobNamespace=default

Verify the Spark Operator is running:

1kubectl get pods -n spark-operator

You should see pods like spark-operator-controller-xxx and spark-operator-webhook-xxx in Running state.

Create a service account for Spark:

1kubectl create serviceaccount spark
2kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Running Spark Applications#

Submitting a Spark Application using Spark Operator#

Create and submit a SparkApplication to run the built-in SparkPi example with GPU acceleration:

 1cat <<EOF | kubectl apply -f -
 2apiVersion: sparkoperator.k8s.io/v1beta2
 3kind: SparkApplication
 4metadata:
 5  name: spark-rapids-pi
 6  namespace: default
 7spec:
 8  type: Scala
 9  mode: cluster
10  image: "${IMAGE_NAME}"
11  imagePullPolicy: Always
12  mainClass: org.apache.spark.examples.SparkPi
13  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar"
14  arguments:
15    - "1000"
16  sparkVersion: "3.5.0"
17  sparkConf:
18    "spark.plugins": "com.nvidia.spark.SQLPlugin"
19    "spark.executor.resource.gpu.amount": "1"
20    "spark.executor.resource.gpu.vendor": "nvidia.com"
21    "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh"
22    "spark.task.resource.gpu.amount": "1"
23    "spark.rapids.sql.concurrentGpuTasks": "1"
24    "spark.executor.memory": "4g"
25    "spark.rapids.memory.pinnedPool.size": "2g"
26    "spark.executor.memoryOverhead": "3g"
27    "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-25.12.0-cuda12.jar"
28    "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-25.12.0-cuda12.jar"
29  driver:
30    cores: 1
31    memory: "1g"
32    serviceAccount: spark
33  executor:
34    cores: 1
35    instances: 1
36    memory: "4g"
37    gpu:
38      name: "nvidia.com/gpu"
39      quantity: 1
40EOF

Check the application status:

1# Watch the application status
2kubectl get sparkapplication spark-rapids-pi -w
3
4# View detailed status
5kubectl describe sparkapplication spark-rapids-pi

View the driver logs:

1kubectl logs spark-rapids-pi-driver

You should see output similar to Pi is roughly 3.14... indicating the job ran successfully.

To verify GPU acceleration is enabled, check for RAPIDS-related messages in the logs:

1kubectl logs spark-rapids-pi-driver | grep -i "rapids\|gpu"

You should see output similar to taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0].

../_images/GKE-logs.png

Delete the application when done:

1kubectl delete sparkapplication spark-rapids-pi

Access Spark UI#

To access the Spark driver UI while a job is running:

1kubectl port-forward spark-rapids-pi-driver 4040:4040

Then open a web browser to http://localhost:4040.

Configuring Spark Event Logs#

To enable Spark event logging for post-job analysis, add the following to your SparkApplication:

1sparkConf:
2  "spark.eventLog.enabled": "true"
3  "spark.eventLog.dir": "gs://YOUR_BUCKET/spark-events"

You can then use the Spark History Server to view completed jobs.

Troubleshooting#

gke-gcloud-auth-plugin Not Found#

If you see the error gke-gcloud-auth-plugin was not found or is not executable:

gcloud components install gke-gcloud-auth-plugin

Pod Timeout Issue#

When running GPU Spark jobs on GKE, you may encounter the following error:

The executor with ID XXX was not found in the cluster at the polling time
which is after the accepted detect delta time (30000 ms) configured by
`spark.kubernetes.executor.missingPodDetectDelta`.
The executor may have been deleted but the driver missed the deletion event.

This issue occurs because GPU initialization and resource allocation can take longer than the default timeout. To resolve this, increase the spark.kubernetes.executor.missingPodDetectDelta configuration:

sparkConf:
  "spark.kubernetes.executor.missingPodDetectDelta": "120000"

Docker Push Authentication Error#

If you see Unauthenticated request when pushing to Artifact Registry:

# Re-authenticate Docker with Artifact Registry
gcloud auth configure-docker ${REGION}-docker.pkg.dev --quiet

# Ensure you're not using sudo (which runs as root with different credentials)
# If you need to use Docker without sudo, add your user to the docker group:
sudo usermod -aG docker $USER
# Then log out and log back in

GPU Not Detected#

If GPUs aren’t being detected:

  1. Verify that the NVIDIA GPU device plugin is running:

    kubectl get pods -n kube-system | grep nvidia
    
  2. Check if the GPU resource is available on the node:

    kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
    
  3. Ensure the GPU discovery script has execute permissions in your Docker image.

ClassNotFoundException for SQLPlugin#

If you see ClassNotFoundException: com.nvidia.spark.SQLPlugin:

  • Ensure the RAPIDS Accelerator jar is correctly placed in the Docker image at /opt/sparkRapidsPlugin/

  • Verify that spark.executor.extraClassPath and spark.driver.extraClassPath point to the correct jar location and filename

Cleaning Up#

To delete the GKE cluster and associated resources:

1# Delete the Spark Operator
2helm uninstall spark-operator -n spark-operator
3kubectl delete namespace spark-operator
4
5# Delete the GKE cluster
6gcloud container clusters delete $CLUSTER_NAME --zone=$ZONE --quiet
7
8# Optionally, delete the Artifact Registry repository
9gcloud artifacts repositories delete spark-rapids --location=$REGION --quiet

Additional Resources#