Google Cloud Dataproc Deployment Guide#
Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. NVIDIA RAPIDS Accelerator for Apache Spark is available on Dataproc, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Google Cloud and enables users to run their Spark workloads with optimized performance and efficiency.
Overview of Steps#
This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Dataproc cluster. We start with an Ubuntu based OS to run the Google CLI for connecting and interacting with a Dataproc cluster and Google storage buckets. We then use the spark-rapids Java Archive (jar) file to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. The jar file is uploaded to the Google Cloud storage bucket and subsequently loaded at cluster creation time. Once the cluster is created, we can then submit a GPU accelerated Spark job/application.
Prerequisites#
Prior to getting started with the RAPIDS Accelerator for Apache Spark on Dataproc ensure you have the following prerequisites:
Ubuntu OS with internet access to Google Cloud.
NGC Account with NGC Catalog Access
Google Cloud Tools:
Note
Other Operating Systems will function, but the below steps require Ubuntu.
Connectivity#
From an Ubuntu OS with the above prerequisites, use the gcloud init
command to validate and connect to a Google Cloud project and region with Dataproc enabled.
gcloud dataproc clusters list
Leverage a region with the desired GPU instances. A complete list is available here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
Configure gcloud to use your preferred region.
Below is an example setting the region to us-central1
.
1export REGION=us-central1
2gcloud config set compute/region ${REGION}
Note
You may need to request additional GPU resource usage quota for your specified region. For more information, refer to the Google Cloud Resource usage quotas and permission management documentation.
Upload Jar to the Cloud (GCP Bucket)#
To speed up cluster creation, upload the jar file to a GCP bucket within the region of your cluster. Set terminal variable and use the gsutil command to create a GCS bucket.
Tip
Pull the jar file from the NGC Catalog to your local machine. Refer back to the Access the NVDIAI AI Enterprise RAPIDS Accelerator Section.
Change exampleuser1
and rapids-4-spark_2.12-23.02.0.jar
as needed.
1export USER_NAME=exampleuser1
2export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar
Place the jar in your current working directory and set the location variable.
export LOCAL_JAR_LOCATION=./${JAR_NAME}
Export variables for placing the jar in a GCP bucket.
1export GCS_BUCKET=${USER_NAME}-bucket
2export GCS_JAR_LOCATION=gs://$GCS_BUCKET/jars/${JAR_NAME}
Copy the jar to GCP bucket.
1gsutil mkdir gs://${GCS_BUCKET}
2gsutil cp ${LOCAL_JAR_LOCATION} ${GCS_JAR_LOCATION}
Note
The Google Cloud Console has the capability to transfer files if preferred.
Create the Cluster and Update Dataproc Jar#
In this section, you will create the Dataproc cluster with gcloud commands and then replace the existing jars with the RAPIDS Accelerator jar.
Use the export command to set the appropriate cluster config variables. Below is are example variables. Replace the following variables to match your environment.
Note
Some regions do not have GPU instances.
1export REGION=us-central1
2export ZONE=us-central1-a
3export CLUSTER_NAME=${USER_NAME}-gpu
4export MASTER_MACHINE_TYPE=n1-standard-16
5export WORKER_MACHINE_TYPE=n1-highmem-32
6export NUM_WORKERS=4
7export NUM_WORKER_SSDS=2
8export WORKER_ACCEL_TYPE=nvidia-tesla-t4
9export NUM_GPUS_PER_WORKER=2
Now that the above variables have been set, use the following command to create a cluster with the above config.
1gcloud dataproc clusters create ${CLUSTER_NAME} \
2 --image-version=2.1.2-ubuntu20 \
3 --region ${REGION}\
4 --zone ${ZONE} \
5 --master-machine-type ${MASTER_MACHINE_TYPE} \
6 --num-workers ${NUM_WORKERS} \
7 --worker-accelerator type=${WORKER_ACCEL_TYPE},count=${NUM_GPUS_PER_WORKER} \
8 --worker-machine-type ${WORKER_MACHINE_TYPE} \
9 --num-worker-local-ssds ${NUM_WORKER_SSDS} \
10 --worker-local-ssd-interface=NVME \
11 --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \
12 --optional-components=JUPYTER,ZEPPELIN \
13 --properties 'spark:spark.eventLog.enabled=true,spark:spark.eventLog.compress=true' \
14 --bucket ${GCS_BUCKET} \
15 --enable-component-gateway \
16 --subnet=default \
17 --no-shielded-secure-boot
Note
GPU Cluster creation time can take ~15+ min.
Remove any existing RAPIDS Accelerator and cudf jars from Dataproc. Replace with RAPIDS Accelerator jar.
1WORKER_LIST=$(echo $(for n in $(seq 0 $((NUM_WORKERS - 1))); do echo -n w-${n} ' '; done))
2for node in m $WORKER_LIST; do gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/cudf-*.jar" ;gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/rapids-4-spark_*.jar" ; gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo gsutil cp ${GCS_JAR_LOCATION} /usr/lib/spark/jars/" ; done
Validation#
Now you are ready to submit jobs to the cluster as you would any other cluster. Submitting a sample job can be found here.
This job will show that the jar is loaded successfully.
1gcloud dataproc jobs submit spark \
2 --cluster=${CLUSTER_NAME} \
3 --region=${REGION} \
4 --class=org.apache.spark.examples.SparkPi \
5 --properties=spark.rapids.sql.explain=ALL \
6 --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
7 -- 1000
For a complete list of all configuration options, compatibility topics with Apache Spark, & NVIDIA operator support see Configuration Appendix and Supported Operators Appendix.
Cluster Cleanup#
The cluster can be deleted via:
gcloud dataproc clusters delete -q $CLUSTER_NAME
Tip
Delete your GCP Bucket if no longer needed.
gsutil -m rm -rf gs://<Your-Bucket-Name>