RAPIDS Accelerator on Dataproc Serverless#

Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.

Accelerating Batch Workloads with GPUs on Dataproc Serverless:#

For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, please refer to the following guide.

Dataproc Serverless Runtime RAPIDS Version:#

Version

RAPIDS Accelerator Jar

1.1 LTS

rapids-4-spark_2.12-23.08.1.jar

2.0

Coming soon

2.1

Coming soon

Creating a Serverless Batch Workload with RAPIDS Accelerator:#

To get started with Dataproc Serverless, use the following example:

join-dataframes.py#

 1from pyspark.sql import SparkSession
 2
 3# Initialize a Spark session
 4spark = SparkSession.builder.appName("DataFrameJoin").getOrCreate()
 5
 6# Create two DataFrames
 7df1 = spark.range(1, 10000001).toDF("a")
 8df2 = spark.range(1, 10000001).toDF("b")
 9
10# Perform the join and count
11result = df1.join(df2, df1["a"] == df2["b"]).count()
12
13# Display the result
14print(result)
15
16# Stop the Spark session
17spark.stop()

Spark Batch Submission Command:#

 1NUM_EXECUTORS=2
 2EXEC_CORES=8
 3DRIVER_CORES=4
 4DISK_SIZE=375G
 5BATCH_ID=join-df
 6BUCKET=your-bucket
 7
 8gcloud dataproc batches submit \
 9pyspark \
10--batch $BATCH_ID ./join-dataframes.py \
11--deps-bucket $BUCKET \
12--version 1.1 \
13--subnet default \
14--properties \
15spark.executor.instances=$NUM_EXECUTORS,\
16spark.driver.cores=$DRIVER_CORES,\
17spark.executor.cores=$EXEC_CORES,\
18spark.dataproc.driver.compute.tier=premium,\
19spark.dataproc.executor.compute.tier=premium,\
20spark.dataproc.executor.resource.accelerator.type=l4,\
21spark.dataproc.driver.disk.tier=premium,\
22spark.dataproc.executor.disk.tier=premium,\
23spark.dataproc.driver.disk.size=$DISK_SIZE

Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.

Logging Configuration for Serverless Workloads:#

Event Logs: There are two options for configuring event logs on Serverless:

  1. Using Persistent History Server (PHS): - Use the --history-server-cluster argument to attach the History Server, which will automatically configure spark.eventLog.dir.

  2. Explicitly Using spark.eventLog Properties: - Configure event logs by setting spark.eventLog.enabled=true and spark.eventLog.dir=gs://<bucket-name>/eventLogs.

Driver Logs: You can access Driver Logs in the Output section of the UI.

Executor Logs: Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs, use the following query in the search bar with the appropriate batch_id (Click on VIEW LOGS):

1resource.type="cloud_dataproc_batch"
2resource.labels.batch_id="join-df"
3SEARCH("`executor.log`")
4severity=ERROR
You can also pick the severity levels from:
  1. INFO

  2. DEFAULT

  3. ERROR

  4. WARNING

Cost:#

Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.