Is this page helpful?

GCP Google Kubernetes Engine (GKE)#

Google Kubernetes Engine (GKE) is Google Cloud’s managed Kubernetes service for deploying, managing, and scaling containerized applications. This guide will walk you through setting up the RAPIDS Accelerator for Apache Spark on a GKE cluster with GPU nodes.

At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a GKE cluster.

Quick Start Summary #

For experienced users, here are the key steps:

Create a GKE cluster with GPU node pool
Verify GPU drivers are installed (GKE auto-installs them)
Build and push a Docker image with Spark + RAPIDS Accelerator to Artifact Registry
Install Spark Operator using Helm
Submit Spark applications using SparkApplication CRD

Estimated time: 30-45 minutes (including cluster creation)

Prerequisites #

A Google Cloud account with billing enabled
gcloud CLI installed and configured (run gcloud init to configure)
kubectl installed
gke-gcloud-auth-plugin installed (required for kubectl authentication)
Helm installed (required for Spark Operator)
Docker installed on your local machine

Install gke-gcloud-auth-plugin if you haven’t:

gcloud components install gke-gcloud-auth-plugin

Enable Required APIs #

Before creating a GKE cluster, enable the required Google Cloud APIs:

gcloud services enable container.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable artifactregistry.googleapis.com

Create a GKE Cluster with GPU Nodes #

Set Up Environment Variables #

export PROJECT_ID=[Your GCP Project ID]
export ZONE=us-central1-a
export CLUSTER_NAME=spark-rapids-gke

# Default node pool settings
export MACHINE_TYPE=n1-standard-4
export NUM_NODES=1

# GPU node pool settings
export GPU_MACHINE_TYPE=g2-standard-16
export GPU_TYPE=nvidia-l4
export GPU_COUNT=1
export NUM_GPU_NODES=2

Create the GKE Cluster #

Create a GKE cluster with a default node pool:

gcloud container clusters create $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --zone=$ZONE \
    --machine-type=$MACHINE_TYPE \
    --num-nodes=$NUM_NODES

Add a GPU Node Pool #

Add a GPU node pool with NVIDIA L4 GPUs:

gcloud container node-pools create gpu-pool \
    --cluster=$CLUSTER_NAME \
    --project=$PROJECT_ID \
    --zone=$ZONE \
    --machine-type=$GPU_MACHINE_TYPE \
    --accelerator=type=$GPU_TYPE,count=$GPU_COUNT \
    --num-nodes=$NUM_GPU_NODES

Note

To use different GPU types, update the environment variables accordingly:

NVIDIA L4: GPU_MACHINE_TYPE=g2-standard-16, GPU_TYPE=nvidia-l4
NVIDIA T4: GPU_MACHINE_TYPE=n1-standard-8, GPU_TYPE=nvidia-tesla-t4

Get Cluster Credentials #

After the cluster is created, get the credentials to access it with kubectl:

gcloud container clusters get-credentials $CLUSTER_NAME --zone=$ZONE --project=$PROJECT_ID

Verify GPU Availability #

GKE automatically installs NVIDIA GPU drivers via the nvidia-gpu-device-plugin DaemonSet. Verify GPUs are available:

# Check GPU device plugin pods
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources on nodes
kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'

You should see output like:

Docker Image Preparation #

Build the Spark RAPIDS Docker Image #

Create a working directory and download Apache Spark:

mkdir spark-rapids-gke && cd spark-rapids-gke

# Download Spark 3.5.0 (check https://nvidia.github.io/spark-rapids/docs/download.html for supported versions)
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar xzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark

Download the RAPIDS Accelerator jar and GPU discovery script:

# Download RAPIDS Accelerator jar (check https://nvidia.github.io/spark-rapids/docs/download.html for latest version)
wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/25.12.0/rapids-4-spark_2.12-25.12.0-cuda12.jar

# Download GPU discovery script
wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
chmod +x getGpusResources.sh

Download the Dockerfile:

Download Dockerfile.cuda to your working directory.

The current directory structure should look like:
```
1$ ls
2Dockerfile.cuda  getGpusResources.sh  rapids-4-spark_2.12-25.12.0-cuda12.jar  spark
```

Build and push the Docker image to Google Artifact Registry:

export REGION=us-central1

# Create an Artifact Registry repository (if not exists)
gcloud artifacts repositories create spark-rapids \
    --repository-format=docker \
    --location=$REGION \
    --project=$PROJECT_ID

# Configure Docker to use gcloud credentials
gcloud auth configure-docker ${REGION}-docker.pkg.dev --quiet

# Build the image
export IMAGE_NAME=${REGION}-docker.pkg.dev/${PROJECT_ID}/spark-rapids/spark-rapids:25.12.0-cuda12
docker build -t $IMAGE_NAME -f Dockerfile.cuda .

# Push the image
docker push $IMAGE_NAME

Note

Google Container Registry (GCR) is deprecated. Use Artifact Registry instead with the format: REGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/IMAGE:TAG

Install Spark Operator #

The Spark Operator simplifies managing Spark applications on Kubernetes. Install it using Helm:

# Add the Spark Operator Helm repository (Kubeflow maintained)
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update

# Install the Spark Operator
helm install spark-operator spark-operator/spark-operator \
    --namespace spark-operator \
    --create-namespace \
    --set webhook.enable=true \
    --set sparkJobNamespace=default

Verify the Spark Operator is running:

kubectl get pods -n spark-operator

You should see pods like spark-operator-controller-xxx and spark-operator-webhook-xxx in Running state.

Create a service account for Spark:

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Running Spark Applications #

Submitting a Spark Application using Spark Operator #

Create and submit a SparkApplication to run the built-in SparkPi example with GPU acceleration:

cat <<EOF | kubectl apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-rapids-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "${IMAGE_NAME}"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar"
  arguments:
    - "1000"
  sparkVersion: "3.5.0"
  sparkConf:
    "spark.plugins": "com.nvidia.spark.SQLPlugin"
    "spark.executor.resource.gpu.amount": "1"
    "spark.executor.resource.gpu.vendor": "nvidia.com"
    "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh"
    "spark.task.resource.gpu.amount": "1"
    "spark.rapids.sql.concurrentGpuTasks": "1"
    "spark.executor.memory": "4g"
    "spark.rapids.memory.pinnedPool.size": "2g"
    "spark.executor.memoryOverhead": "3g"
    "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-25.12.0-cuda12.jar"
    "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-25.12.0-cuda12.jar"
  driver:
    cores: 1
    memory: "1g"
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "4g"
    gpu:
      name: "nvidia.com/gpu"
      quantity: 1
EOF

Check the application status:

# Watch the application status
kubectl get sparkapplication spark-rapids-pi -w

# View detailed status
kubectl describe sparkapplication spark-rapids-pi

View the driver logs:

1kubectl logs spark-rapids-pi-driver

You should see output similar to Pi is roughly 3.14... indicating the job ran successfully.

To verify GPU acceleration is enabled, check for RAPIDS-related messages in the logs:

kubectl logs spark-rapids-pi-driver | grep -i "rapids\|gpu"

You should see output similar to taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0].

Delete the application when done:

kubectl delete sparkapplication spark-rapids-pi

Access Spark UI #

To access the Spark driver UI while a job is running:

kubectl port-forward spark-rapids-pi-driver 4040:4040

Then open a web browser to http://localhost:4040.

Configuring Spark Event Logs #

To enable Spark event logging for post-job analysis, add the following to your SparkApplication:

sparkConf:
  "spark.eventLog.enabled": "true"
  "spark.eventLog.dir": "gs://YOUR_BUCKET/spark-events"

You can then use the Spark History Server to view completed jobs.

Troubleshooting #

gke-gcloud-auth-plugin Not Found #

If you see the error gke-gcloud-auth-plugin was not found or is not executable:

gcloud components install gke-gcloud-auth-plugin

Pod Timeout Issue #

When running GPU Spark jobs on GKE, you may encounter the following error:

The executor with ID XXX was not found in the cluster at the polling time
which is after the accepted detect delta time (30000 ms) configured by
`spark.kubernetes.executor.missingPodDetectDelta`.
The executor may have been deleted but the driver missed the deletion event.

This issue occurs because GPU initialization and resource allocation can take longer than the default timeout. To resolve this, increase the spark.kubernetes.executor.missingPodDetectDelta configuration:

sparkConf:
  "spark.kubernetes.executor.missingPodDetectDelta": "120000"

Docker Push Authentication Error #

If you see Unauthenticated request when pushing to Artifact Registry:

# Re-authenticate Docker with Artifact Registry
gcloud auth configure-docker ${REGION}-docker.pkg.dev --quiet

# Ensure you're not using sudo (which runs as root with different credentials)
# If you need to use Docker without sudo, add your user to the docker group:
sudo usermod -aG docker $USER
# Then log out and log back in

GPU Not Detected #

If GPUs aren’t being detected:

Verify that the NVIDIA GPU device plugin is running:

kubectl get pods -n kube-system | grep nvidia

Check if the GPU resource is available on the node:

kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'

Ensure the GPU discovery script has execute permissions in your Docker image.

ClassNotFoundException for SQLPlugin #

If you see ClassNotFoundException: com.nvidia.spark.SQLPlugin:

Ensure the RAPIDS Accelerator jar is correctly placed in the Docker image at /opt/sparkRapidsPlugin/
Verify that spark.executor.extraClassPath and spark.driver.extraClassPath point to the correct jar location and filename

Cleaning Up #

To delete the GKE cluster and associated resources:

# Delete the Spark Operator
helm uninstall spark-operator -n spark-operator
kubectl delete namespace spark-operator

# Delete the GKE cluster
gcloud container clusters delete $CLUSTER_NAME --zone=$ZONE --quiet

# Optionally, delete the Artifact Registry repository
gcloud artifacts repositories delete spark-rapids --location=$REGION --quiet

Additional Resources #

Running Spark on Kubernetes
Spark Operator Documentation
GKE GPU Documentation
RAPIDS Accelerator for Apache Spark Configuration
Getting Started with RAPIDS and Kubernetes for general Kubernetes guidance