Getting Started with RAPIDS and Kubernetes
NVIDIA AI Enterprise 4.0 or later
This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a Kubernetes cluster.
This is a quick start guide which uses default settings which may be different from your cluster.
Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker image - Spark, the RAPIDS Accelerator for Spark jar, and the discovery script.
You can find other supported base CUDA images for from CUDA dockerhub. Its source Dockerfile is inside GitLab repository which can be used to build the docker images from OS base image from scratch.
Ubuntu 22.04
Spark 3.4.0
Upstream Kubernetes Version 1.25
Docker is installed on a client machine
A Docker repository which is accessible by the Kubernetes cluster
RAPIDS Accelerator
Refer to the Appendix for access
NGC API Key
Refer to Generating Your NGC API Key
This guide leverages the Cloud Native Stack (CNS) GitHub install-guides to build the Kubernetes cluster. To install without CNS, leverage instructions to Install Kubernetes for creating a Kubernetes cluster with NVIDIA GPU support.
From your client machine with Docker installed, download the following packages and scripts as shown below. Verison 3.4 will be used Apache Spark. Please note that only Scala version 2.12 is currently supported by the accelerator.
Below are bash commands to install a local copy of Apache Spark and configure the docker image:
mkdir -p ~/spark-rapids/spark
wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -zxvf spark-3.4.0-bin-hadoop3.tgz -C ./spark-rapids/spark --strip-components 1
cd ~/spark-rapids
Copy the .jar into the current working directory. Refer to the Access the NVIDIA AI Enterprise RAPIDs Accelerator section to pull the .jar file.
A sample Dockerfile is provided below:
# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
FROM nvidia/cuda:12.2.2-devel-ubuntu22.04
ARG spark_uid=185
# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
# Install tini and java dependencies
RUN apt-get update && apt-get install -y --no-install-recommends tini openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin
# Before building the docker image, first either download Apache Spark 3.1+ from
# http://spark.apache.org/downloads.html or build and make a Spark distribution following the
# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see
# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions). If this
# docker file is being used in the context of building your images from a Spark distribution, the
# docker build command should be invoked from the top level directory of the Spark
# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .
RUN set -ex && \
ln -s /lib /lib64 && \
mkdir -p /opt/spark/work-dir && \
touch /opt/spark/RELEASE && \
rm /bin/sh && \
ln -sv /bin/bash /bin/sh && \
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
chgrp root /etc/passwd && chmod ug+rw /etc/passwd
COPY spark /opt/spark
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/kubernetes/tests /opt/spark/tests
COPY rapids-4-spark_2.12-*.jar /opt/spark/jars
RUN apt-get update && \
apt-get install -y python-is-python3 python3-pip && \
pip install --upgrade pip setuptools && \
# You may install with python3 packages by using pip3.6
# Removed the .cache to save space
rm -r /root/.cache && rm -rf /var/cache/apt/*
ENV SPARK_HOME /opt/spark
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
# Specify the User that the actual main process will run as
USER ${spark_uid}
It is assumed that the following directory structure exists:
$ ls ~/spark-rapids
Dockerfile rapids-4-spark_2.12-23.08.1.jar spark
Build the Dockerfile and push to NGC (Optional):
export IMAGE_NAME=nvcr.io/<your-registry-name>/<container-name>:<tag>
docker build . -f Dockerfile -t $IMAGE_NAME
docker push $IMAGE_NAME
docker push $IMAGE_NAME
Pushing to NGC is optional. Refer to NGC Private Registry User Guide for setting up
Update kubenetes credentials for NGC by filling in <ngc-secret-token> and <email> in the following command:
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nv-nvaie-tme" --docker-username='$oauthtoken' --docker-password='<ngc-secret-token>' --docker-email='<email>'
Creating a secret will not work if there is already a secret with that name in the Kubernetes configuration. Refer to Generating Your NGC API Key
This is required for authentication. Note that the name of the secret is ngc-secret, which will be used in spark.kubernetes.container.image.pullSecrets.
Role Based Access Control (RBAC) is enabled by default when CNS created the cluster. Follow steps in https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac to create a service account to pull the protected image. Below are relevant commands:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Corespondingly, an additional argument is added to the spark-submit:
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
Submitting a Simple Test Job
This simple job will test if the RAPIDS plugin can be found.
Here is an example spark-submit job:
export SPARK_HOME=~/spark-rapids/spark
export K8SMASTER=k8s://<ip>:<port>
export SPARK_NAMESPACE=default
export SPARK_DRIVER_NAME=exampledriver
export NGC_SECRET_NAME=ngc-secret
$SPARK_HOME/bin/spark-submit \
--master $K8SMASTER \
--deploy-mode cluster \
--name examplejob \
--class org.apache.spark.examples.SparkPi \
--driver-memory 2G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=4G \
--conf spark.executor.cores=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.discoveryScript=/opt/spark/examples/src/main/scripts/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
--conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=$IMAGE_NAME \
--conf spark.kubernetes.container.image.pullSecrets=$NGC_SECRET_NAME \
--conf spark.kubernetes.container.image.pullPolicy=Always \
local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 10000
The 10000 at the end of the spark-submit is an input which specifies the number of iterations. Note local://
means the jar file location is inside the Docker image. Since this is cluster
mode, the Spark driver is running inside a pod in Kubernetes. The driver and executor pods can be seen when the job is running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s
exampledriver 1/1 Running 0 15s
To view the Spark driver log, use below command:
kubectl logs $SPARK_DRIVER_NAME
If the job ran successfully, the log output will contain the computed value for pi:
Pi is roughly 3.1406957034785172
Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin
ClassNotFoundException is a common error if the Spark driver can not find the RAPIDS Accelerator jar, resulting in an exception like this:
To view the Spark driver UI when the job is running, first expose the driver UI port:
kubectl port-forward $SPARK_DRIVER_NAME 4040:4040
You may need to enable ssh with port forwarding to access the UI from a remote machine. e.g. run ssh -L 4040:localhost:4040 nvidia@<cluster-ip>
Then open a web browser to the Spark driver UI page on the exposed port:
http://localhost:4040
To kill the Spark job:
$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME
To delete the driver pod:
kubectl delete pod $SPARK_DRIVER_NAME
Deleting the driver pod is required to reuse the same driver pod name.