Google Cloud Dataproc Deployment Guide

Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. NVIDIA RAPIDS Accelerator for Apache Spark is available on Dataproc, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Google Cloud and enables users to run their Spark workloads with optimized performance and efficiency.

Overview of Steps

This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Dataproc cluster. We start with an Ubuntu based OS to run the Google CLI for connecting and interacting with a Dataproc cluster and Google storage buckets. We then use the spark-rapids Java Archive (jar) file to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. The jar file is uploaded to the Google Cloud storage bucket and subsequently loaded at cluster creation time. Once the cluster is created, we can then submit a GPU accelerated Spark job/application.

Prerequisites

Prior to getting started with the RAPIDS Accelerator for Apache Spark on Dataproc ensure you have the following prerequisites:

Ubuntu OS with internet access to Google Cloud.
NVIDIA AI Enterprise License
NGC Account with Enterprise Catalog Access
Google Cloud Tools:
- gcloud CLI
- gsutil
Google Cloud Dataproc Permissions

Note

Other Operating Systems will function, but the below steps require Ubuntu.

Connectivity

From an Ubuntu OS with the above prerequisites, use the gcloud init command to validate and connect to a Google Cloud project and region with Dataproc enabled.

Copy
Copied!

            
            gcloud dataproc clusters list

Leverage a region with the desired GPU instances. A complete list is available here: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

Configure gcloud to use your preferred region.

Below is an example setting the region to us-central1.

Copy
Copied!

            
            export REGION=us-central1
gcloud config set compute/region ${REGION}

Note

You may need to request additional GPU resource usage quota for your specified region. For more information, refer to the Google Cloud Resource usage quotas and permission management documentation.

Upload Jar to the Cloud (GCP Bucket)

To speed up cluster creation, upload the jar file to a GCP bucket within the region of your cluster. Set terminal variable and use the gsutil command to create a GCS bucket.

Tip

Pull the jar file from the enterprise catalog to your local machine. Refer back to the Access the NVDIAI AI Enterprise RAPIDS Accelerator Section.

Change exampleuser1 and rapids-4-spark_2.12-23.02.0.jar as needed.

Copy
Copied!

            
            export USER_NAME=exampleuser1
export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar

Place the jar in your current working directory and set the location variable.

Copy
Copied!

            
            export LOCAL_JAR_LOCATION=./${JAR_NAME}

Export variables for placing the jar in a GCP bucket.

Copy
Copied!

            
            export GCS_BUCKET=${USER_NAME}-bucket
export GCS_JAR_LOCATION=gs://$GCS_BUCKET/jars/${JAR_NAME}

Copy the jar to GCP bucket.

Copy
Copied!

            
            gsutil mkdir gs://${GCS_BUCKET}
gsutil cp ${LOCAL_JAR_LOCATION} ${GCS_JAR_LOCATION}

Note

The Google Cloud Console has the capability to transfer files if preferred.

Create the Cluster and Update Dataproc Jar

In this section, you will create the Dataproc cluster with gcloud commands and then replace the existing jars with the RAPIDS Accelerator jar.

Use the export command to set the appropriate cluster config variables. Below is are example variables. Replace the following variables to match your environment.

Note

Some regions do not have GPU instances.

Copy
Copied!

            
            export REGION=us-central1
export ZONE=us-central1-a
export CLUSTER_NAME=${USER_NAME}-gpu
export MASTER_MACHINE_TYPE=n1-standard-16
export WORKER_MACHINE_TYPE=n1-highmem-32
export NUM_WORKERS=4
export NUM_WORKER_SSDS=2
export WORKER_ACCEL_TYPE=nvidia-tesla-t4
export NUM_GPUS_PER_WORKER=2

Now that the above variables have been set, use the following command to create a cluster with the above config.

Copy
Copied!

            
            gcloud dataproc clusters create ${CLUSTER_NAME} \
        --image-version=2.1.2-ubuntu20 \
        --region ${REGION}\
        --zone ${ZONE} \
        --master-machine-type ${MASTER_MACHINE_TYPE} \
        --num-workers ${NUM_WORKERS} \
        --worker-accelerator type=${WORKER_ACCEL_TYPE},count=${NUM_GPUS_PER_WORKER} \
        --worker-machine-type ${WORKER_MACHINE_TYPE} \
        --num-worker-local-ssds ${NUM_WORKER_SSDS} \
        --worker-local-ssd-interface=NVME \
        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \
        --optional-components=JUPYTER,ZEPPELIN \
        --properties 'spark:spark.eventLog.enabled=true,spark:spark.eventLog.compress=true' \
        --bucket ${GCS_BUCKET} \
        --enable-component-gateway \
        --subnet=default \
        --no-shielded-secure-boot

Note

GPU Cluster creation time can take ~15+ min.

Remove any existing RAPIDS Accelerator and cudf jars from Dataproc. Replace with RAPIDS Accelerator jar.

Copy
Copied!

            
            WORKER_LIST=$(echo $(for n in $(seq 0 $((NUM_WORKERS - 1))); do echo -n w-${n} ' '; done))
for node in m $WORKER_LIST; do gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/cudf-*.jar" ;gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo rm /usr/lib/spark/jars/rapids-4-spark_*.jar" ; gcloud compute ssh ${CLUSTER_NAME}-${node} --command="sudo gsutil cp${GCS_JAR_LOCATION}/usr/lib/spark/jars/" ; done

Validation

Now you are ready to submit jobs to the cluster as you would any other cluster. Submitting a sample job can be found here.

This job will show that the jar is loaded successfully.

Copy
Copied!

            
            gcloud dataproc jobs submit spark \
    --cluster=${CLUSTER_NAME} \
    --region=${REGION} \
    --class=org.apache.spark.examples.SparkPi \
    --properties=spark.rapids.sql.explain=ALL \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

For a complete list of all configuration options, compatibility topics with Apache Spark, & NVIDIA operator support see Configuration Appendix and Supported Operators Appendix.

Cluster Cleanup

The cluster can be deleted via:

Copy
Copied!

            
            gcloud dataproc clusters delete -q $CLUSTER_NAME

Tip

Delete your GCP Bucket if no longer needed.

Copy
Copied!

            
            gsutil -m rm -rf gs://<Your-Bucket-Name>