Getting Started with RAPIDS and Kubernetes#

Added in version 4.0.

This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a Kubernetes cluster.

This is a quick start guide which uses default settings which may be different from your cluster.

Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker image - Spark, the RAPIDS Accelerator for Spark jar, and the discovery script.

You can find other supported base CUDA images for from CUDA dockerhub. Its source Dockerfile is inside GitLab repository which can be used to build the docker images from OS base image from scratch.

Prerequisites#

Ubuntu 22.04

Spark 3.4.0

Upstream Kubernetes Version 1.25

Docker is installed on a client machine

A Docker repository which is accessible by the Kubernetes cluster

RAPIDS Accelerator

Refer to the Appendix for access

NGC API Key

Refer to Generating Your NGC API Key

This guide leverages the Cloud Native Stack (CNS) GitHub install-guides to build the Kubernetes cluster. To install without CNS, leverage instructions to Install Kubernetes for creating a Kubernetes cluster with NVIDIA GPU support.

Docker Image Preparation#

From your client machine with Docker installed, download the following packages and scripts as shown below. Verison 3.4 will be used Apache Spark. Please note that only Scala version 2.12 is currently supported by the accelerator.

Below are bash commands to install a local copy of Apache Spark and configure the docker image:

mkdir -p ~/spark-rapids/spark
wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -zxvf spark-3.4.0-bin-hadoop3.tgz -C ./spark-rapids/spark --strip-components 1

1cd ~/spark-rapids

Copy the .jar into the current working directory. Refer to the Access the NVIDIA AI Enterprise RAPIDs Accelerator section to pull the .jar file.

A sample Dockerfile is provided below:

# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM nvidia/cuda:12.2.2-devel-ubuntu22.04
ARG spark_uid=185

# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub

# Install tini and java dependencies
RUN apt-get update && apt-get install -y --no-install-recommends tini openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the docker image, first either download Apache Spark 3.1+ from
# http://spark.apache.org/downloads.html or build and make a Spark distribution following the
# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see
# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions).  If this
# docker file is being used in the context of building your images from a Spark distribution, the
# docker build command should be invoked from the top level directory of the Spark
# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd

COPY spark /opt/spark
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/kubernetes/tests /opt/spark/tests

COPY rapids-4-spark_2.12-*.jar /opt/spark/jars

RUN apt-get update && \
    apt-get install -y python-is-python3 python3-pip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}

It is assumed that the following directory structure exists:

$ ls ~/spark-rapids

Dockerfile  rapids-4-spark_2.12-23.08.1.jar   spark

Build the Dockerfile and push to NGC (Optional):

export IMAGE_NAME=nvcr.io/<your-registry-name>/<container-name>:<tag>
docker build . -f Dockerfile -t $IMAGE_NAME
docker push $IMAGE_NAME

docker push $IMAGE_NAME

Note

Pushing to NGC is optional. Refer to NGC Private Registry User Guide for setting up

Pulling Docker Image from NGC into Kubernetes#

Update kubenetes credentials for NGC by filling in <ngc-secret-token> and <email> in the following command:

kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nv-nvaie-tme" --docker-username='$oauthtoken' --docker-password='<ngc-secret-token>' --docker-email='<email>'

Note

Creating a secret will not work if there is already a secret with that name in the Kubernetes configuration. Refer to Generating Your NGC API Key

Important

This is required for authentication. Note that the name of the secret is ngc-secret, which will be used in spark.kubernetes.container.image.pullSecrets.

Running Spark Applications in the Kubernetes Cluster#

Role Based Access Control (RBAC) is enabled by default when CNS created the cluster. Follow steps in https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac to create a service account to pull the protected image. Below are relevant commands:

kubectl create serviceaccount spark

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Corespondingly, an additional argument is added to the spark-submit:

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

Submitting a Simple Test Job#

This simple job will test if the RAPIDS plugin can be found.

Here is an example spark-submit job:

export SPARK_HOME=~/spark-rapids/spark
export K8SMASTER=k8s://<ip>:<port>
export SPARK_NAMESPACE=default
export SPARK_DRIVER_NAME=exampledriver
export NGC_SECRET_NAME=ngc-secret

$SPARK_HOME/bin/spark-submit \
    --master $K8SMASTER \
    --deploy-mode cluster  \
    --name examplejob \
    --class org.apache.spark.examples.SparkPi \
    --driver-memory 2G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.memory=4G \
    --conf spark.executor.cores=1 \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.resource.gpu.discoveryScript=/opt/spark/examples/src/main/scripts/getGpusResources.sh \
    --conf spark.executor.resource.gpu.vendor=nvidia.com \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.kubernetes.namespace=$SPARK_NAMESPACE  \
    --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME  \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=$IMAGE_NAME \
    --conf spark.kubernetes.container.image.pullSecrets=$NGC_SECRET_NAME \
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 10000

The 10000 at the end of the spark-submit is an input which specifies the number of iterations. Note local:// means the jar file location is inside the Docker image. Since this is cluster mode, the Spark driver is running inside a pod in Kubernetes. The driver and executor pods can be seen when the job is running:

$ kubectl get pods

NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-d11075782f399fd7-exec-1   1/1     Running   0          9s
exampledriver                      1/1     Running   0          15s

To view the Spark driver log, use below command:

kubectl logs $SPARK_DRIVER_NAME

If the job ran successfully, the log output will contain the computed value for pi:

Pi is roughly 3.1406957034785172

Note

Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin

ClassNotFoundException is a common error if the Spark driver can not find the RAPIDS Accelerator jar, resulting in an exception like this:

To view the Spark driver UI when the job is running, first expose the driver UI port:

kubectl port-forward $SPARK_DRIVER_NAME 4040:4040

Note

You may need to enable ssh with port forwarding to access the UI from a remote machine. e.g. run ssh -L 4040:localhost:4040 nvidia@<cluster-ip>

Then open a web browser to the Spark driver UI page on the exposed port:

http://localhost:4040

To kill the Spark job:

$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME

To delete the driver pod:

kubectl delete pod $SPARK_DRIVER_NAME

Deleting the driver pod is required to reuse the same driver pod name.