Google Cloud Dataproc Deployment Guide#

Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. NVIDIA RAPIDS Accelerator for Apache Spark is available on Dataproc, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Google Cloud and enables users to run their Spark workloads with optimized performance and efficiency.

Overview of Steps#

This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Dataproc cluster. We start with an Ubuntu based OS to run the Google CLI for connecting and interacting with a Dataproc cluster and Google storage buckets. We then use the spark-rapids Java Archive (jar) file to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. The jar file is uploaded to the Google Cloud storage bucket and subsequently loaded at cluster creation time. Once the cluster is created, we can then submit a GPU accelerated Spark job/application.

Prerequisites#

Prior to getting started with the RAPIDS Accelerator for Apache Spark on Dataproc ensure you have the following prerequisites:

Ubuntu OS with internet access to Google Cloud.

NVIDIA AI Enterprise License

NGC Account with NGC Catalog Access

Google Cloud Tools:

gcloud CLI

gsutil

Google Cloud Dataproc Permissions

Note

Other Operating Systems will function, but the below steps require Ubuntu.

Connectivity#

From an Ubuntu OS with the above prerequisites, use the gcloud init command to validate and connect to a Google Cloud project and region with Dataproc enabled.

gcloud dataproc clusters list

Leverage a region with the desired GPU instances. A complete list is available here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

Configure gcloud to use your preferred region.

Below is an example setting the region to us-central1.

export REGION=us-central1
gcloud config set compute/region ${REGION}

Note

You may need to request additional GPU resource usage quota for your specified region. For more information, refer to the Google Cloud Resource usage quotas and permission management documentation.

Upload Jar to the Cloud (GCP Bucket)#

To speed up cluster creation, upload the jar file to a GCP bucket within the region of your cluster. Set terminal variable and use the gsutil command to create a GCS bucket.

Tip

Pull the jar file from the NGC Catalog to your local machine. Refer back to the Access the NVDIAI AI Enterprise RAPIDS Accelerator Section.

Change exampleuser1 and rapids-4-spark_2.12-23.02.0.jar as needed.

export USER_NAME=exampleuser1
export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar

Place the jar in your current working directory and set the location variable.

export LOCAL_JAR_LOCATION=./${JAR_NAME}

Export variables for placing the jar in a GCP bucket.

export GCS_BUCKET=${USER_NAME}-bucket
export GCS_JAR_LOCATION=gs://$GCS_BUCKET/jars/${JAR_NAME}

Copy the jar to GCP bucket.

gsutil mkdir gs://${GCS_BUCKET}
gsutil cp ${LOCAL_JAR_LOCATION} ${GCS_JAR_LOCATION}

Note

The Google Cloud Console has the capability to transfer files if preferred.

Create the Cluster and Update Dataproc Jar#

In this section, you will create the Dataproc cluster with gcloud commands and then replace the existing jars with the RAPIDS Accelerator jar.

Use the export command to set the appropriate cluster config variables. Below is are example variables. Replace the following variables to match your environment.

Note

Some regions do not have GPU instances.

export REGION=us-central1
export ZONE=us-central1-a
export CLUSTER_NAME=${USER_NAME}-gpu
export MASTER_MACHINE_TYPE=n1-standard-16
export WORKER_MACHINE_TYPE=n1-highmem-32
export NUM_WORKERS=4
export NUM_WORKER_SSDS=2
export WORKER_ACCEL_TYPE=nvidia-tesla-t4
export NUM_GPUS_PER_WORKER=2

Now that the above variables have been set, use the following command to create a cluster with the above config.

gcloud dataproc clusters create ${CLUSTER_NAME} \
        --image-version=2.1.2-ubuntu20 \
        --region ${REGION}\
        --zone ${ZONE} \
        --master-machine-type ${MASTER_MACHINE_TYPE} \
        --num-workers ${NUM_WORKERS} \
        --worker-accelerator type=${WORKER_ACCEL_TYPE},count=${NUM_GPUS_PER_WORKER} \
        --worker-machine-type ${WORKER_MACHINE_TYPE} \
        --num-worker-local-ssds ${NUM_WORKER_SSDS} \
        --worker-local-ssd-interface=NVME \
        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \
        --optional-components=JUPYTER,ZEPPELIN \
        --properties 'spark:spark.eventLog.enabled=true,spark:spark.eventLog.compress=true' \
        --bucket ${GCS_BUCKET} \
        --enable-component-gateway \
        --subnet=default \
        --no-shielded-secure-boot

Note

GPU Cluster creation time can take ~15+ min.

Remove any existing RAPIDS Accelerator and cudf jars from Dataproc. Replace with RAPIDS Accelerator jar.

WORKER_LIST=$(echo $(for n in $(seq 0 $((NUM_WORKERS - 1))); do echo -n w-${n} ' '; done))
for node in m $WORKER_LIST; do gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/cudf-*.jar" ;gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/rapids-4-spark_*.jar" ; gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo gsutil cp ${GCS_JAR_LOCATION} /usr/lib/spark/jars/" ; done

Validation#

Now you are ready to submit jobs to the cluster as you would any other cluster. Submitting a sample job can be found here.

This job will show that the jar is loaded successfully.

gcloud dataproc jobs submit spark \
    --cluster=${CLUSTER_NAME} \
    --region=${REGION} \
    --class=org.apache.spark.examples.SparkPi \
    --properties=spark.rapids.sql.explain=ALL \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

For a complete list of all configuration options, compatibility topics with Apache Spark, & NVIDIA operator support see Configuration Appendix and Supported Operators Appendix.

Cluster Cleanup#

The cluster can be deleted via:

gcloud dataproc clusters delete -q $CLUSTER_NAME

Tip

Delete your GCP Bucket if no longer needed.

gsutil -m rm -rf gs://<Your-Bucket-Name>