RAPIDS Accelerator on Dataproc Serverless#

Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.

Accelerating Batch Workloads with GPUs on Dataproc Serverless:#

For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, please refer to the following guide.

Dataproc Serverless Runtime RAPIDS Version:#

Version	RAPIDS Accelerator Jar
1.1 LTS	rapids-4-spark_2.12-23.08.1.jar
2.0	Coming soon
2.1	Coming soon

Creating a Serverless Batch Workload with RAPIDS Accelerator:#

To get started with Dataproc Serverless, use the following example:

join-dataframes.py#

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("DataFrameJoin").getOrCreate()

# Create two DataFrames
df1 = spark.range(1, 10000001).toDF("a")
df2 = spark.range(1, 10000001).toDF("b")

# Perform the join and count
result = df1.join(df2, df1["a"] == df2["b"]).count()

# Display the result
print(result)

# Stop the Spark session
spark.stop()

Spark Batch Submission Command:#

NUM_EXECUTORS=2
EXEC_CORES=8
DRIVER_CORES=4
DISK_SIZE=375G
BATCH_ID=join-df
BUCKET=your-bucket

gcloud dataproc batches submit \
pyspark \
--batch $BATCH_ID ./join-dataframes.py \
--deps-bucket $BUCKET \
--version 1.1 \
--subnet default \
--properties \
spark.executor.instances=$NUM_EXECUTORS,\
spark.driver.cores=$DRIVER_CORES,\
spark.executor.cores=$EXEC_CORES,\
spark.dataproc.driver.compute.tier=premium,\
spark.dataproc.executor.compute.tier=premium,\
spark.dataproc.executor.resource.accelerator.type=l4,\
spark.dataproc.driver.disk.tier=premium,\
spark.dataproc.executor.disk.tier=premium,\
spark.dataproc.driver.disk.size=$DISK_SIZE

Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.

Logging Configuration for Serverless Workloads:#

Event Logs: There are two options for configuring event logs on Serverless:

Using Persistent History Server (PHS): - Use the --history-server-cluster argument to attach the History Server, which will automatically configure spark.eventLog.dir.
Explicitly Using spark.eventLog Properties: - Configure event logs by setting spark.eventLog.enabled=true and spark.eventLog.dir=gs://<bucket-name>/eventLogs.

Driver Logs: You can access Driver Logs in the Output section of the UI.

Executor Logs: Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs, use the following query in the search bar with the appropriate batch_id (Click on VIEW LOGS):

resource.type="cloud_dataproc_batch"
resource.labels.batch_id="join-df"
SEARCH("`executor.log`")
severity=ERROR

You can also pick the severity levels from:

INFO
DEFAULT
ERROR
WARNING

Cost:#

Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.