Getting Started with RAPIDS and Kubernetes#
Added in version 4.0.
This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a Kubernetes cluster.
This is a quick start guide which uses default settings which may be different from your cluster.
Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker image - Spark, the RAPIDS Accelerator for Spark jar, and the discovery script.
You can find other supported base CUDA images for from CUDA dockerhub. Its source Dockerfile is inside GitLab repository which can be used to build the docker images from OS base image from scratch.
Prerequisites#
Ubuntu 22.04
Spark 3.4.0
Upstream Kubernetes Version 1.25
Docker is installed on a client machine
A Docker repository which is accessible by the Kubernetes cluster
RAPIDS Accelerator
Refer to the Appendix for access
NGC API Key
Refer to Generating Your NGC API Key
This guide leverages the Cloud Native Stack (CNS) GitHub install-guides to build the Kubernetes cluster. To install without CNS, leverage instructions to Install Kubernetes for creating a Kubernetes cluster with NVIDIA GPU support.
Docker Image Preparation#
From your client machine with Docker installed, download the following packages and scripts as shown below. Verison 3.4 will be used Apache Spark. Please note that only Scala version 2.12 is currently supported by the accelerator.
Below are bash commands to install a local copy of Apache Spark and configure the docker image:
1mkdir -p ~/spark-rapids/spark
2wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
3tar -zxvf spark-3.4.0-bin-hadoop3.tgz -C ./spark-rapids/spark --strip-components 1
1cd ~/spark-rapids
Copy the .jar into the current working directory. Refer to the Access the NVIDIA AI Enterprise RAPIDs Accelerator section to pull the .jar file.
A sample Dockerfile is provided below:
1# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14#
15
16FROM nvidia/cuda:12.2.2-devel-ubuntu22.04
17ARG spark_uid=185
18
19# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
20RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
21
22# Install tini and java dependencies
23RUN apt-get update && apt-get install -y --no-install-recommends tini openjdk-8-jdk openjdk-8-jre
24ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
25ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin
26
27# Before building the docker image, first either download Apache Spark 3.1+ from
28# http://spark.apache.org/downloads.html or build and make a Spark distribution following the
29# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see
30# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions). If this
31# docker file is being used in the context of building your images from a Spark distribution, the
32# docker build command should be invoked from the top level directory of the Spark
33# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .
34
35RUN set -ex && \
36 ln -s /lib /lib64 && \
37 mkdir -p /opt/spark/work-dir && \
38 touch /opt/spark/RELEASE && \
39 rm /bin/sh && \
40 ln -sv /bin/bash /bin/sh && \
41 echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
42 chgrp root /etc/passwd && chmod ug+rw /etc/passwd
43
44COPY spark /opt/spark
45COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
46COPY spark/kubernetes/tests /opt/spark/tests
47
48COPY rapids-4-spark_2.12-*.jar /opt/spark/jars
49
50RUN apt-get update && \
51 apt-get install -y python-is-python3 python3-pip && \
52 pip install --upgrade pip setuptools && \
53 # You may install with python3 packages by using pip3.6
54 # Removed the .cache to save space
55 rm -r /root/.cache && rm -rf /var/cache/apt/*
56
57ENV SPARK_HOME /opt/spark
58
59WORKDIR /opt/spark/work-dir
60RUN chmod g+w /opt/spark/work-dir
61
62ENTRYPOINT [ "/opt/entrypoint.sh" ]
63
64# Specify the User that the actual main process will run as
65USER ${spark_uid}
It is assumed that the following directory structure exists:
$ ls ~/spark-rapids
Dockerfile rapids-4-spark_2.12-23.08.1.jar spark
Build the Dockerfile and push to NGC (Optional):
1export IMAGE_NAME=nvcr.io/<your-registry-name>/<container-name>:<tag>
2docker build . -f Dockerfile -t $IMAGE_NAME
3docker push $IMAGE_NAME
docker push $IMAGE_NAME
Note
Pushing to NGC is optional. Refer to NGC Private Registry User Guide for setting up
Pulling Docker Image from NGC into Kubernetes#
Update kubenetes credentials for NGC by filling in <ngc-secret-token> and <email> in the following command:
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nv-nvaie-tme" --docker-username='$oauthtoken' --docker-password='<ngc-secret-token>' --docker-email='<email>'
Note
Creating a secret will not work if there is already a secret with that name in the Kubernetes configuration. Refer to Generating Your NGC API Key
Important
This is required for authentication. Note that the name of the secret is ngc-secret, which will be used in spark.kubernetes.container.image.pullSecrets.
Running Spark Applications in the Kubernetes Cluster#
Role Based Access Control (RBAC) is enabled by default when CNS created the cluster. Follow steps in https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac to create a service account to pull the protected image. Below are relevant commands:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Corespondingly, an additional argument is added to the spark-submit:
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
Submitting a Simple Test Job#
This simple job will test if the RAPIDS plugin can be found.
Here is an example spark-submit job:
1export SPARK_HOME=~/spark-rapids/spark
2export K8SMASTER=k8s://<ip>:<port>
3export SPARK_NAMESPACE=default
4export SPARK_DRIVER_NAME=exampledriver
5export NGC_SECRET_NAME=ngc-secret
6
7$SPARK_HOME/bin/spark-submit \
8 --master $K8SMASTER \
9 --deploy-mode cluster \
10 --name examplejob \
11 --class org.apache.spark.examples.SparkPi \
12 --driver-memory 2G \
13 --conf spark.executor.instances=1 \
14 --conf spark.executor.memory=4G \
15 --conf spark.executor.cores=1 \
16 --conf spark.executor.resource.gpu.amount=1 \
17 --conf spark.executor.resource.gpu.discoveryScript=/opt/spark/examples/src/main/scripts/getGpusResources.sh \
18 --conf spark.executor.resource.gpu.vendor=nvidia.com \
19 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
20 --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
21 --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \
22 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
23 --conf spark.kubernetes.container.image=$IMAGE_NAME \
24 --conf spark.kubernetes.container.image.pullSecrets=$NGC_SECRET_NAME \
25 --conf spark.kubernetes.container.image.pullPolicy=Always \
26 local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 10000
The 10000 at the end of the spark-submit is an input which specifies the number of iterations. Note local://
means the jar file location is inside the Docker image. Since this is cluster
mode, the Spark driver is running inside a pod in Kubernetes. The driver and executor pods can be seen when the job is running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s
exampledriver 1/1 Running 0 15s
To view the Spark driver log, use below command:
kubectl logs $SPARK_DRIVER_NAME
If the job ran successfully, the log output will contain the computed value for pi:
Pi is roughly 3.1406957034785172
Note
Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin
ClassNotFoundException is a common error if the Spark driver can not find the RAPIDS Accelerator jar, resulting in an exception like this:
To view the Spark driver UI when the job is running, first expose the driver UI port:
kubectl port-forward $SPARK_DRIVER_NAME 4040:4040
Note
You may need to enable ssh with port forwarding to access the UI from a remote machine. e.g. run ssh -L 4040:localhost:4040 nvidia@<cluster-ip>
Then open a web browser to the Spark driver UI page on the exposed port:
http://localhost:4040
To kill the Spark job:
$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME
To delete the driver pod:
kubectl delete pod $SPARK_DRIVER_NAME
Deleting the driver pod is required to reuse the same driver pod name.