RAPIDS Accelerator on Dataproc Serverless
Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.
For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, please refer to the following guide.
Version |
RAPIDS Accelerator Jar |
---|---|
1.1 LTS | rapids-4-spark_2.12-23.08.1.jar |
2.0 | Coming soon |
2.1 | Coming soon |
To get started with Dataproc Serverless, use the following example:
join-dataframes.py
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("DataFrameJoin").getOrCreate()
# Create two DataFrames
df1 = spark.range(1, 10000001).toDF("a")
df2 = spark.range(1, 10000001).toDF("b")
# Perform the join and count
result = df1.join(df2, df1["a"] == df2["b"]).count()
# Display the result
print(result)
# Stop the Spark session
spark.stop()
Spark Batch Submission Command:
NUM_EXECUTORS=2
EXEC_CORES=8
DRIVER_CORES=4
DISK_SIZE=375G
BATCH_ID=join-df
BUCKET=your-bucket
gcloud dataproc batches submit \
pyspark \
--batch $BATCH_ID ./join-dataframes.py \
--deps-bucket $BUCKET \
--version 1.1 \
--subnet default \
--properties \
spark.executor.instances=$NUM_EXECUTORS,\
spark.driver.cores=$DRIVER_CORES,\
spark.executor.cores=$EXEC_CORES,\
spark.dataproc.driver.compute.tier=premium,\
spark.dataproc.executor.compute.tier=premium,\
spark.dataproc.executor.resource.accelerator.type=l4,\
spark.dataproc.driver.disk.tier=premium,\
spark.dataproc.executor.disk.tier=premium,\
spark.dataproc.driver.disk.size=$DISK_SIZE
Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.
Event Logs: There are two options for configuring event logs on Serverless:
Using Persistent History Server (PHS): - Use the
--history-server-cluster
argument to attach the History Server, which will automatically configurespark.eventLog.dir
.Explicitly Using
spark.eventLog
Properties: - Configure event logs by settingspark.eventLog.enabled=true
andspark.eventLog.dir=gs://<bucket-name>/eventLogs
.
Driver Logs: You can access Driver Logs in the Output section of the UI.
Executor Logs:
Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs,
use the following query in the search bar with the appropriate batch_id
(Click on VIEW LOGS):
resource.type="cloud_dataproc_batch"
resource.labels.batch_id="join-df"
SEARCH("`executor.log`")
Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.