RAPIDS Accelerator on Dataproc Serverless#
Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.
Accelerating Batch Workloads with GPUs on Dataproc Serverless:#
For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, please refer to the following guide.
Dataproc Serverless Runtime RAPIDS Version:#
Version |
RAPIDS Accelerator Jar |
---|---|
1.1 LTS |
rapids-4-spark_2.12-23.08.1.jar |
2.0 |
Coming soon |
2.1 |
Coming soon |
Creating a Serverless Batch Workload with RAPIDS Accelerator:#
To get started with Dataproc Serverless, use the following example:
join-dataframes.py#
1from pyspark.sql import SparkSession
2
3# Initialize a Spark session
4spark = SparkSession.builder.appName("DataFrameJoin").getOrCreate()
5
6# Create two DataFrames
7df1 = spark.range(1, 10000001).toDF("a")
8df2 = spark.range(1, 10000001).toDF("b")
9
10# Perform the join and count
11result = df1.join(df2, df1["a"] == df2["b"]).count()
12
13# Display the result
14print(result)
15
16# Stop the Spark session
17spark.stop()
Spark Batch Submission Command:#
1NUM_EXECUTORS=2
2EXEC_CORES=8
3DRIVER_CORES=4
4DISK_SIZE=375G
5BATCH_ID=join-df
6BUCKET=your-bucket
7
8gcloud dataproc batches submit \
9pyspark \
10--batch $BATCH_ID ./join-dataframes.py \
11--deps-bucket $BUCKET \
12--version 1.1 \
13--subnet default \
14--properties \
15spark.executor.instances=$NUM_EXECUTORS,\
16spark.driver.cores=$DRIVER_CORES,\
17spark.executor.cores=$EXEC_CORES,\
18spark.dataproc.driver.compute.tier=premium,\
19spark.dataproc.executor.compute.tier=premium,\
20spark.dataproc.executor.resource.accelerator.type=l4,\
21spark.dataproc.driver.disk.tier=premium,\
22spark.dataproc.executor.disk.tier=premium,\
23spark.dataproc.driver.disk.size=$DISK_SIZE
Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.
Logging Configuration for Serverless Workloads:#
Event Logs: There are two options for configuring event logs on Serverless:
Using Persistent History Server (PHS): - Use the
--history-server-cluster
argument to attach the History Server, which will automatically configurespark.eventLog.dir
.Explicitly Using
spark.eventLog
Properties: - Configure event logs by settingspark.eventLog.enabled=true
andspark.eventLog.dir=gs://<bucket-name>/eventLogs
.
Driver Logs: You can access Driver Logs in the Output section of the UI.
Executor Logs:
Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs,
use the following query in the search bar with the appropriate batch_id
(Click on VIEW LOGS):
1resource.type="cloud_dataproc_batch"
2resource.labels.batch_id="join-df"
3SEARCH("`executor.log`")
4severity=ERROR
- You can also pick the severity levels from:
INFO
DEFAULT
ERROR
WARNING
Cost:#
Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.