Google Cloud Dataproc Deployment Guide
Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. NVIDIA RAPIDS Accelerator for Apache Spark is available on Dataproc, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Google Cloud and enables users to run their Spark workloads with optimized performance and efficiency.
This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Dataproc cluster. We start with an Ubuntu based OS to run the Google CLI for connecting and interacting with a Dataproc cluster and Google storage buckets. We then use the spark-rapids Java Archive (jar) file to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. The jar file is uploaded to the Google Cloud storage bucket and subsequently loaded at cluster creation time. Once the cluster is created, we can then submit a GPU accelerated Spark job/application.
Prior to getting started with the RAPIDS Accelerator for Apache Spark on Dataproc ensure you have the following prerequisites:
Ubuntu OS with internet access to Google Cloud.
NGC Account with NGC Catalog Access
Google Cloud Tools:
Other Operating Systems will function, but the below steps require Ubuntu.
From an Ubuntu OS with the above prerequisites, use the gcloud init
command to validate and connect to a Google Cloud project and region with Dataproc enabled.
gcloud dataproc clusters list
Leverage a region with the desired GPU instances. A complete list is available here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
Configure gcloud to use your preferred region.
Below is an example setting the region to us-central1
.
export REGION=us-central1
gcloud config set compute/region ${REGION}
You may need to request additional GPU resource usage quota for your specified region. For more information, refer to the Google Cloud Resource usage quotas and permission management documentation.
To speed up cluster creation, upload the jar file to a GCP bucket within the region of your cluster. Set terminal variable and use the gsutil command to create a GCS bucket.
Pull the jar file from the NGC Catalog to your local machine. Refer back to the Access the NVDIAI AI Enterprise RAPIDS Accelerator Section.
Change exampleuser1
and rapids-4-spark_2.12-23.02.0.jar
as needed.
export USER_NAME=exampleuser1
export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar
Place the jar in your current working directory and set the location variable.
export LOCAL_JAR_LOCATION=./${JAR_NAME}
Export variables for placing the jar in a GCP bucket.
export GCS_BUCKET=${USER_NAME}-bucket
export GCS_JAR_LOCATION=gs://$GCS_BUCKET/jars/${JAR_NAME}
Copy the jar to GCP bucket.
gsutil mkdir gs://${GCS_BUCKET}
gsutil cp ${LOCAL_JAR_LOCATION} ${GCS_JAR_LOCATION}
The Google Cloud Console has the capability to transfer files if preferred.
In this section, you will create the Dataproc cluster with gcloud commands and then replace the existing jars with the RAPIDS Accelerator jar.
Use the export command to set the appropriate cluster config variables. Below is are example variables. Replace the following variables to match your environment.
Some regions do not have GPU instances.
export REGION=us-central1
export ZONE=us-central1-a
export CLUSTER_NAME=${USER_NAME}-gpu
export MASTER_MACHINE_TYPE=n1-standard-16
export WORKER_MACHINE_TYPE=n1-highmem-32
export NUM_WORKERS=4
export NUM_WORKER_SSDS=2
export WORKER_ACCEL_TYPE=nvidia-tesla-t4
export NUM_GPUS_PER_WORKER=2
Now that the above variables have been set, use the following command to create a cluster with the above config.
gcloud dataproc clusters create ${CLUSTER_NAME} \
--image-version=2.1.2-ubuntu20 \
--region ${REGION}\
--zone ${ZONE} \
--master-machine-type ${MASTER_MACHINE_TYPE} \
--num-workers ${NUM_WORKERS} \
--worker-accelerator type=${WORKER_ACCEL_TYPE},count=${NUM_GPUS_PER_WORKER} \
--worker-machine-type ${WORKER_MACHINE_TYPE} \
--num-worker-local-ssds ${NUM_WORKER_SSDS} \
--worker-local-ssd-interface=NVME \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \
--optional-components=JUPYTER,ZEPPELIN \
--properties 'spark:spark.eventLog.enabled=true,spark:spark.eventLog.compress=true' \
--bucket ${GCS_BUCKET} \
--enable-component-gateway \
--subnet=default \
--no-shielded-secure-boot
GPU Cluster creation time can take ~15+ min.
Remove any existing RAPIDS Accelerator and cudf jars from Dataproc. Replace with RAPIDS Accelerator jar.
WORKER_LIST=$(echo $(for n in $(seq 0 $((NUM_WORKERS - 1))); do echo -n w-${n} ' '; done))
for node in m $WORKER_LIST; do gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/cudf-*.jar" ;gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/rapids-4-spark_*.jar" ; gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo gsutil cp${GCS_JAR_LOCATION}/usr/lib/spark/jars/" ; done
Now you are ready to submit jobs to the cluster as you would any other cluster. Submitting a sample job can be found here.
This job will show that the jar is loaded successfully.
gcloud dataproc jobs submit spark \
--cluster=${CLUSTER_NAME} \
--region=${REGION} \
--class=org.apache.spark.examples.SparkPi \
--properties=spark.rapids.sql.explain=ALL \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
-- 1000
For a complete list of all configuration options, compatibility topics with Apache Spark, & NVIDIA operator support see Configuration Appendix and Supported Operators Appendix.
The cluster can be deleted via:
gcloud dataproc clusters delete -q $CLUSTER_NAME
Delete your GCP Bucket if no longer needed.
gsutil -m rm -rf gs://<Your-Bucket-Name>