Getting Started with RAPIDS and Kubernetes

NVIDIA AI Enterprise 4.0 or later

This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs in a Kubernetes cluster.

This is a quick start guide which uses default settings which may be different from your cluster.

Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker image - Spark, the RAPIDS Accelerator for Spark jar, and the discovery script.

You can find other supported base CUDA images for from CUDA dockerhub. Its source Dockerfile is inside GitLab repository which can be used to build the docker images from OS base image from scratch.

  • Ubuntu 22.04

  • Spark 3.4.0

  • Upstream Kubernetes Version 1.25

  • Docker is installed on a client machine

  • A Docker repository which is accessible by the Kubernetes cluster

  • RAPIDS Accelerator

  • NGC API Key

This guide leverages the Cloud Native Stack (CNS) GitHub install-guides to build the Kubernetes cluster. To install without CNS, leverage instructions to Install Kubernetes for creating a Kubernetes cluster with NVIDIA GPU support.

From your client machine with Docker installed, download the following packages and scripts as shown below. Verison 3.4 will be used Apache Spark. Please note that only Scala version 2.12 is currently supported by the accelerator.

Below are bash commands to install a local copy of Apache Spark and configure the docker image:

Copy
Copied!
            

mkdir -p ~/spark-rapids/spark wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz tar -zxvf spark-3.4.0-bin-hadoop3.tgz -C ./spark-rapids/spark --strip-components 1

Copy
Copied!
            

cd ~/spark-rapids

Copy the .jar into the current working directory. Refer to the Access the NVIDIA AI Enterprise RAPIDs Accelerator section to pull the .jar file.

A sample Dockerfile is provided below:

Copy
Copied!
            

# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 ARG spark_uid=185 # https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771 RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub # Install tini and java dependencies RUN apt-get update && apt-get install -y --no-install-recommends tini openjdk-8-jdk openjdk-8-jre ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64 ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin # Before building the docker image, first either download Apache Spark 3.1+ from # http://spark.apache.org/downloads.html or build and make a Spark distribution following the # instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see # https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions). If this # docker file is being used in the context of building your images from a Spark distribution, the # docker build command should be invoked from the top level directory of the Spark # distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile . RUN set -ex && \ ln -s /lib /lib64 && \ mkdir -p /opt/spark/work-dir && \ touch /opt/spark/RELEASE && \ rm /bin/sh && \ ln -sv /bin/bash /bin/sh && \ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ chgrp root /etc/passwd && chmod ug+rw /etc/passwd COPY spark /opt/spark COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY spark/kubernetes/tests /opt/spark/tests COPY rapids-4-spark_2.12-*.jar /opt/spark/jars RUN apt-get update && \ apt-get install -y python-is-python3 python3-pip && \ pip install --upgrade pip setuptools && \ # You may install with python3 packages by using pip3.6 # Removed the .cache to save space rm -r /root/.cache && rm -rf /var/cache/apt/* ENV SPARK_HOME /opt/spark WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir ENTRYPOINT [ "/opt/entrypoint.sh" ] # Specify the User that the actual main process will run as USER ${spark_uid}

It is assumed that the following directory structure exists:

Copy
Copied!
            

$ ls ~/spark-rapids

Copy
Copied!
            

Dockerfile rapids-4-spark_2.12-23.08.1.jar spark

Build the Dockerfile and push to NGC (Optional):

Copy
Copied!
            

export IMAGE_NAME=nvcr.io/<your-registry-name>/<container-name>:<tag> docker build . -f Dockerfile -t $IMAGE_NAME docker push $IMAGE_NAME

Copy
Copied!
            

docker push $IMAGE_NAME

Note

Pushing to NGC is optional. Refer to NGC Private Registry User Guide for setting up

Update kubenetes credentials for NGC by filling in <ngc-secret-token> and <email> in the following command:

Copy
Copied!
            

kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nv-nvaie-tme" --docker-username='$oauthtoken' --docker-password='<ngc-secret-token>' --docker-email='<email>'

Note

Creating a secret will not work if there is already a secret with that name in the Kubernetes configuration. Refer to Generating Your NGC API Key

Important

This is required for authentication. Note that the name of the secret is ngc-secret, which will be used in spark.kubernetes.container.image.pullSecrets.

Role Based Access Control (RBAC) is enabled by default when CNS created the cluster. Follow steps in https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac to create a service account to pull the protected image. Below are relevant commands:

Copy
Copied!
            

kubectl create serviceaccount spark

Copy
Copied!
            

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Corespondingly, an additional argument is added to the spark-submit:

Copy
Copied!
            

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

Submitting a Simple Test Job

This simple job will test if the RAPIDS plugin can be found.

Here is an example spark-submit job:

Copy
Copied!
            

export SPARK_HOME=~/spark-rapids/spark export K8SMASTER=k8s://<ip>:<port> export SPARK_NAMESPACE=default export SPARK_DRIVER_NAME=exampledriver export NGC_SECRET_NAME=ngc-secret $SPARK_HOME/bin/spark-submit \ --master $K8SMASTER \ --deploy-mode cluster \ --name examplejob \ --class org.apache.spark.examples.SparkPi \ --driver-memory 2G \ --conf spark.executor.instances=1 \ --conf spark.executor.memory=4G \ --conf spark.executor.cores=1 \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.executor.resource.gpu.discoveryScript=/opt/spark/examples/src/main/scripts/getGpusResources.sh \ --conf spark.executor.resource.gpu.vendor=nvidia.com \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \ --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=$IMAGE_NAME \ --conf spark.kubernetes.container.image.pullSecrets=$NGC_SECRET_NAME \ --conf spark.kubernetes.container.image.pullPolicy=Always \ local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 10000

The 10000 at the end of the spark-submit is an input which specifies the number of iterations. Note local:// means the jar file location is inside the Docker image. Since this is cluster mode, the Spark driver is running inside a pod in Kubernetes. The driver and executor pods can be seen when the job is running:

Copy
Copied!
            

$ kubectl get pods

Copy
Copied!
            

NAME READY STATUS RESTARTS AGE spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s exampledriver 1/1 Running 0 15s

To view the Spark driver log, use below command:

Copy
Copied!
            

kubectl logs $SPARK_DRIVER_NAME

If the job ran successfully, the log output will contain the computed value for pi:

Pi is roughly 3.1406957034785172

Note
Copy
Copied!
            

Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin

ClassNotFoundException is a common error if the Spark driver can not find the RAPIDS Accelerator jar, resulting in an exception like this:

To view the Spark driver UI when the job is running, first expose the driver UI port:

Copy
Copied!
            

kubectl port-forward $SPARK_DRIVER_NAME 4040:4040

Note

You may need to enable ssh with port forwarding to access the UI from a remote machine. e.g. run ssh -L 4040:localhost:4040 nvidia@<cluster-ip>

Then open a web browser to the Spark driver UI page on the exposed port:

Copy
Copied!
            

http://localhost:4040

To kill the Spark job:

Copy
Copied!
            

$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME

To delete the driver pod:

Copy
Copied!
            

kubectl delete pod $SPARK_DRIVER_NAME

Deleting the driver pod is required to reuse the same driver pod name.

Previous Amazon EMR Deployment UserGuide
Next NGC API Key
© Copyright 2024, NVIDIA. Last updated on Apr 2, 2024.