Kubernetes#
This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a Kubernetes cluster.
This is a quick start guide that uses default settings that may be different from your cluster.
Kubernetes requires a Docker image to run Spark. Generally, everything needed is in the Docker image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script. Refer to this Dockerfile.cuda
example.
You can find other supported base CUDA images for from CUDA dockerhub. Its source Dockerfile is inside GitLab repository that can be used to build the docker images from OS base image from scratch.
Prerequisites#
Kubernetes cluster is up and running with NVIDIA GPU support
Docker is installed on a client machine
A Docker repository that’s accessible by the Kubernetes cluster
These instructions don’t cover how to setup a Kubernetes cluster.
Please refer to Install Kubernetes on how to install a Kubernetes cluster with NVIDIA GPU support.
Docker Image Preparation#
On a client machine that has access to the Kubernetes cluster:
Download Apache Spark. Supported versions of Spark are listed on the RAPIDS Accelerator download page. Only Scala version 2.12 is currently supported by the accelerator.
You can download these into a local directory and untar the Spark
.tar.gz
as a directory namedspark
.Download the RAPIDS Accelerator for Spark jars and the GPU discovery script.
Put
rapids-4-spark_<version>.jar
andgetGpusResources.sh
in the same directory asspark
.Note
If here you decide to put above jar in the
spark/jars
directory which will be copied into/opt/spark/jars
directory in Docker image, then in the future you need to specifyspark.driver.extraClassPath
orspark.executor.extraClassPath
usingcluster
mode. This example just shows you a way to put customized jars or third-party jars.Download the sample
Dockerfile.cuda
in the same directory asspark
.The sample Dockerfile.cuda will copy the
spark
directory’s several sub-directories into/opt/spark/
along with the RAPIDS Accelerator jars andgetGpusResources.sh
into/opt/sparkRapidsPlugin
inside the Docker image.You can modify the Dockerfile to copy your application into the Docker image, that is,
test.py
.Examine the Dockerfile.cuda file to ensure the file names are correct and modify if needed.
Currently the directory in the local machine should look as below:
1$ ls 2Dockerfile.cuda getGpusResources.sh rapids-4-spark_<version>.jar spark
Build the Docker image with a proper repository name and tag and push it to the repository
1export IMAGE_NAME=xxx/yyy:tag 2docker build . -f Dockerfile.cuda -t $IMAGE_NAME 3docker push $IMAGE_NAME
Running Spark Applications in the Kubernetes Cluster#
Submitting a Simple Test Job#
This simple job will test if the RAPIDS Accelerator can be found.
ClassNotFoundException
is a common error if the Spark driver can not
find the RAPIDS Accelerator jar, resulting in an exception like this:
Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin
Here is an example job:
1export SPARK_HOME=~/spark
2export IMAGE_NAME=xxx/yyy:tag
3export K8SMASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
4export SPARK_NAMESPACE=default
5export SPARK_DRIVER_NAME=exampledriver
6
7$SPARK_HOME/bin/spark-submit \
8 --master $K8SMASTER \
9 --deploy-mode cluster \
10 --name examplejob \
11 --class org.apache.spark.examples.SparkPi \
12 --conf spark.executor.instances=1 \
13 --conf spark.executor.resource.gpu.amount=1 \
14 --conf spark.executor.memory=4G \
15 --conf spark.executor.cores=1 \
16 --conf spark.task.cpus=1 \
17 --conf spark.task.resource.gpu.amount=1 \
18 --conf spark.rapids.memory.pinnedPool.size=2G \
19 --conf spark.executor.memoryOverhead=3G \
20 --conf spark.sql.files.maxPartitionBytes=512m \
21 --conf spark.sql.shuffle.partitions=10 \
22 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
23 --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
24 --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \
25 --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
26 --conf spark.executor.resource.gpu.vendor=nvidia.com \
27 --conf spark.kubernetes.container.image=$IMAGE_NAME \
28 --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar \
29 --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar \
30 --driver-memory 2G \
31 local:///opt/spark/examples/jars/spark-examples_2.12-3.0.2.jar
Note
local://
means the jar file location is inside the Docker image. Since this is cluster
mode, the Spark driver is running inside a pod in Kubernetes. The driver and executor pods can be seen when the job is running:
1$ kubectl get pods
2NAME READY STATUS RESTARTS AGE
3spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s
4exampledriver 1/1 Running 0 15s
To view the Spark driver log, use below command:
kubectl logs $SPARK_DRIVER_NAME
To view the Spark driver UI when the job is running first expose the driver UI port:
kubectl port-forward $SPARK_DRIVER_NAME 4040:4040
Then open a web browser to the Spark driver UI page on the exposed port:
http://localhost:4040
To kill the Spark job:
$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME
To delete the driver POD:
kubectl delete pod $SPARK_DRIVER_NAME
Running an Interactive Spark Shell#
If you need an interactive Spark shell with executor pods running inside the Kubernetes cluster:
1$SPARK_HOME/bin/spark-shell \
2 --master $K8SMASTER \
3 --name mysparkshell \
4 --deploy-mode client \
5 --conf spark.executor.instances=1 \
6 --conf spark.executor.resource.gpu.amount=1 \
7 --conf spark.executor.memory=4G \
8 --conf spark.executor.cores=1 \
9 --conf spark.task.cpus=1 \
10 --conf spark.task.resource.gpu.amount=1 \
11 --conf spark.rapids.memory.pinnedPool.size=2G \
12 --conf spark.executor.memoryOverhead=3G \
13 --conf spark.sql.files.maxPartitionBytes=512m \
14 --conf spark.sql.shuffle.partitions=10 \
15 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
16 --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
17 --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
18 --conf spark.executor.resource.gpu.vendor=nvidia.com \
19 --conf spark.kubernetes.container.image=$IMAGE_NAME \
20 --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar \
21 --driver-class-path=./rapids-4-spark_<version>.jar \
22 --driver-memory 2G
Only the client
deploy mode should be used. If you specify the cluster
deploy mode, you would see the following error:
1Cluster deploy mode isn't applicable to Spark shells.
Also notice that --conf spark.driver.extraClassPath
was removed but --driver-class-path
was added. This is because now the driver is running on the client machine, so the jar paths should be local filesystem paths.
When running the shell you can see only the executor pods are running inside Kubernetes:
1$ kubectl get pods
2NAME READY STATUS RESTARTS AGE
3mysparkshell-bfe52e782f44841c-exec-1 1/1 Running 0 11s
The following Scala code can be run in the Spark shell to test if the RAPIDS Accelerator is enabled.
1val df = spark.sparkContext.parallelize(Seq(1)).toDF()
2df.createOrReplaceTempView("df")
3spark.sql("SELECT value FROM df WHERE value <>1").show
4spark.sql("SELECT value FROM df WHERE value <>1").explain
5:quit
The expected explain
plan should contain the GPU related operators:
1scala> spark.sql("SELECT value FROM df WHERE value <>1").explain
2== Physical Plan ==
3GpuColumnarToRow false
4+- GpuFilter NOT (value#2 = 1)
5 +- GpuRowToColumnar TargetSize(2147483647)
6 +- *(1) SerializeFromObject [input[0, int, false] AS value#2]
7 +- Scan[obj#1]
Running PySpark in Client Mode#
Of course, you can COPY
the Python code in the Docker image when building it and submit it using the cluster
deploy mode as shown in in the previous example pi job.
However if you don’t want to re-build the Docker image each time and just want to submit the Python code from the client machine, you can use the client
deploy mode.
1$SPARK_HOME/bin/spark-submit \
2 --master $K8SMASTER \
3 --deploy-mode client \
4 --name mypythonjob \
5 --conf spark.executor.instances=1 \
6 --conf spark.executor.resource.gpu.amount=1 \
7 --conf spark.executor.memory=4G \
8 --conf spark.executor.cores=1 \
9 --conf spark.task.cpus=1 \
10 --conf spark.task.resource.gpu.amount=1 \
11 --conf spark.rapids.memory.pinnedPool.size=2G \
12 --conf spark.executor.memoryOverhead=3G \
13 --conf spark.sql.files.maxPartitionBytes=512m \
14 --conf spark.sql.shuffle.partitions=10 \
15 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
16 --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
17 --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
18 --conf spark.executor.resource.gpu.vendor=nvidia.com \
19 --conf spark.kubernetes.container.image=$IMAGE_NAME \
20 --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar \
21 --driver-memory 2G \
22 --driver-class-path=./rapids-4-spark_<version>.jar \
23 test.py
A sample test.py
is as below:
1from pyspark.sql import SQLContext
2from pyspark import SparkConf
3from pyspark import SparkContext
4conf = SparkConf()
5sc = SparkContext.getOrCreate()
6sqlContext = SQLContext(sc)
7df=sqlContext.createDataFrame([1,2,3], "int").toDF("value")
8df.createOrReplaceTempView("df")
9sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()
10sqlContext.sql("SELECT * FROM df WHERE value<>1").show()
11sc.stop()
Running Spark Applications using Spark Operator#
Using Spark Operator is another way to submit Spark Applications into a Kubernetes Cluster.
Locate the Spark Application jars/files in the Docker image when preparing Docker image.
For example, assume
/opt/sparkRapidsPlugin/test.py
is inside the Docker image.This is because currently only
cluster
deployment mode is supported by Spark Operator.Create spark operator using
helm
.
Create a Spark Application YAML file
For example, create a file named
testpython-rapids.yaml
with the following contents:1apiVersion: "sparkoperator.k8s.io/v1beta2" 2kind: SparkApplication 3metadata: 4 name: testpython-rapids 5 namespace: default 6spec: 7 sparkConf: 8 "spark.ui.port": "4045" 9 "spark.rapids.sql.concurrentGpuTasks": "1" 10 "spark.executor.resource.gpu.amount": "1" 11 "spark.task.resource.gpu.amount": "1" 12 "spark.executor.memory": "1g" 13 "spark.rapids.memory.pinnedPool.size": "2g" 14 "spark.executor.memoryOverhead": "3g" 15 "spark.sql.files.maxPartitionBytes": "512m" 16 "spark.sql.shuffle.partitions": "10" 17 "spark.plugins": "com.nvidia.spark.SQLPlugin" 18 "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh" 19 "spark.executor.resource.gpu.vendor": "nvidia.com" 20 "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar" 21 "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar" 22 type: Python 23 pythonVersion: 3 24 mode: cluster 25 image: "<IMAGE_NAME>" 26 imagePullPolicy: Always 27 mainApplicationFile: "local:///opt/sparkRapidsPlugin/test.py" 28 sparkVersion: "3.1.1" 29 restartPolicy: 30 type: Never 31 volumes: 32 - name: "test-volume" 33 hostPath: 34 path: "/tmp" 35 type: Directory 36 driver: 37 cores: 1 38 coreLimit: "1200m" 39 memory: "1024m" 40 labels: 41 version: 3.1.1 42 serviceAccount: spark 43 volumeMounts: 44 - name: "test-volume" 45 mountPath: "/tmp" 46 executor: 47 cores: 1 48 instances: 1 49 memory: "5000m" 50 gpu: 51 name: "nvidia.com/gpu" 52 quantity: 1 53 labels: 54 version: 3.1.1 55 volumeMounts: 56 - name: "test-volume" 57 mountPath: "/tmp"
Submit the Spark Application
1sparkctl create testpython-rapids.yaml
Note
sparkctl
can be built from the Spark Operator repo after installing golang:1cd sparkctl 2go build -o sparkctl
Check the driver log
1sparkctl log testpython-rapids
Check the status of this Spark Application
1sparkctl status testpython-rapids
Port forwarding when Spark driver is running
1sparkctl forward testpython-rapids --local-port 1234 --remote-port 4045
Then open browser with
http://localhost:1234/
to check Spark UI.Delete the Spark Application
1sparkctl delete testpython-rapids
Please refer to Running Spark on Kubernetes for more information.