RAPIDS Accelerator on Dataproc Serverless

User Guide (23.12)

Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.

For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, please refer to the following guide.

Version

RAPIDS Accelerator Jar

1.1 LTS rapids-4-spark_2.12-23.08.1.jar
2.0 Coming soon
2.1 Coming soon

To get started with Dataproc Serverless, use the following example:

join-dataframes.py

Copy
Copied!
            

from pyspark.sql import SparkSession # Initialize a Spark session spark = SparkSession.builder.appName("DataFrameJoin").getOrCreate() # Create two DataFrames df1 = spark.range(1, 10000001).toDF("a") df2 = spark.range(1, 10000001).toDF("b") # Perform the join and count result = df1.join(df2, df1["a"] == df2["b"]).count() # Display the result print(result) # Stop the Spark session spark.stop()

Spark Batch Submission Command:

Copy
Copied!
            

NUM_EXECUTORS=2 EXEC_CORES=8 DRIVER_CORES=4 DISK_SIZE=375G BATCH_ID=join-df BUCKET=your-bucket gcloud dataproc batches submit \ pyspark \ --batch $BATCH_ID ./join-dataframes.py \ --deps-bucket $BUCKET \ --version 1.1 \ --subnet default \ --properties \ spark.executor.instances=$NUM_EXECUTORS,\ spark.driver.cores=$DRIVER_CORES,\ spark.executor.cores=$EXEC_CORES,\ spark.dataproc.driver.compute.tier=premium,\ spark.dataproc.executor.compute.tier=premium,\ spark.dataproc.executor.resource.accelerator.type=l4,\ spark.dataproc.driver.disk.tier=premium,\ spark.dataproc.executor.disk.tier=premium,\ spark.dataproc.driver.disk.size=$DISK_SIZE

Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.

Event Logs: There are two options for configuring event logs on Serverless:

  1. Using Persistent History Server (PHS): - Use the --history-server-cluster argument to attach the History Server, which will automatically configure spark.eventLog.dir.

  2. Explicitly Using spark.eventLog Properties: - Configure event logs by setting spark.eventLog.enabled=true and spark.eventLog.dir=gs://<bucket-name>/eventLogs.

Driver Logs: You can access Driver Logs in the Output section of the UI.

Executor Logs: Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs, use the following query in the search bar with the appropriate batch_id (Click on VIEW LOGS):

Copy
Copied!
            

resource.type="cloud_dataproc_batch" resource.labels.batch_id="join-df" SEARCH("`executor.log`") severity=ERROR

You can also pick the severity levels from:
  1. INFO

  2. DEFAULT

  3. ERROR

  4. WARNING

Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.

© Copyright 2023, NVIDIA. Last updated on Dec 20, 2023.