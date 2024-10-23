Dataproc Serverless#

Dataproc Serverless allows users to run Spark workloads without the need to provision and manage their own clusters.

Accelerating Batch Workloads with GPUs on Dataproc Serverless:# For instructions on how to leverage L4 and A100 GPUs to accelerate Spark workloads, refer to the following guide.

Dataproc Serverless Runtime RAPIDS Version:# Version RAPIDS Accelerator Jar 1.1 LTS rapids-4-spark_2.12-24.02.0.jar 1.2 rapids-4-spark_2.12-24.04.0.jar 2.0 rapids-4-spark_2.13-24.02.0.jar 2.1 rapids-4-spark_2.13-24.02.0.jar 2.2 rapids-4-spark_2.13-24.04.0.jar

Creating a Serverless Batch Workload with RAPIDS Accelerator:# NOTE: Currently, the dataproc serverless jvm config for default charset is set based on the container configs which is US-ASCII and not UTF-8. Rapids users must set the jvm config -Dfile.encoding=UTF-8 explicitly to leverage full capabilities of the plugin. A future release of dataproc serverless 2.2 will change this back to UTF-8. To get started with Dataproc Serverless, use the following example: join-dataframes.py# 1 from pyspark.sql import SparkSession 2 3 # Initialize a Spark session 4 spark = SparkSession.builder.appName ( "DataFrameJoin" ) .getOrCreate () 5 6 # Create two DataFrames 7 df1 = spark.range ( 1 , 10000001 ) .toDF ( "a" ) 8 df2 = spark.range ( 1 , 10000001 ) .toDF ( "b" ) 9 10 # Perform the join and count 11 result = df1.join ( df2, df1 [ "a" ] == df2 [ "b" ]) .count () 12 13 # Display the result 14 print ( result ) 15 16 # Stop the Spark session 17 spark.stop () Spark Batch Submission Command:# 1 NUM_EXECUTORS = 2 2 EXEC_CORES = 8 3 DRIVER_CORES = 4 4 DISK_SIZE = 375G 5 BATCH_ID = join-df 6 BUCKET = your-bucket 7 8 gcloud dataproc batches submit \ 9 pyspark \ 10 --batch $BATCH_ID ./join-dataframes.py \ 11 --deps-bucket $BUCKET \ 12 --version 1 .1 \ 13 --subnet default \ 14 --properties \ 15 spark.executor.instances = $NUM_EXECUTORS , \ 16 spark.driver.cores = $DRIVER_CORES , \ 17 spark.executor.cores = $EXEC_CORES , \ 18 spark.dataproc.driver.compute.tier = premium, \ 19 spark.dataproc.executor.compute.tier = premium, \ 20 spark.dataproc.executor.resource.accelerator.type = l4, \ 21 spark.dataproc.driver.disk.tier = premium, \ 22 spark.dataproc.executor.disk.tier = premium, \ 23 spark.dataproc.driver.disk.size = $DISK_SIZE Refer to the Supported Spark properties to understand the resource allocation properties in Dataproc Serverless.

Logging Configuration for Serverless Workloads:# Event Logs: There are two options for configuring event logs on Serverless: Using Persistent History Server (PHS): - Use the --history-server-cluster argument to attach the History Server, which will automatically configure spark.eventLog.dir . Explicitly Using spark.eventLog Properties: - Configure event logs by setting spark.eventLog.enabled=true and spark.eventLog.dir=gs://<bucket-name>/eventLogs . Driver Logs: You can access Driver Logs in the Output section of the UI. Executor Logs: Serverless employs Log Explorer to grant users access to Executor Logs. To view Executor Logs, use the following query in the search bar with the appropriate batch_id (Click on VIEW LOGS): 1 resource.type = "cloud_dataproc_batch" 2 resource.labels.batch_id = "join-df" 3 SEARCH ( "`executor.log`" ) 4 severity = ERROR You can also pick the severity levels from: INFO DEFAULT ERROR WARNING

Cost:# Refer to Dataproc Serverless Pricing for details on calculating the cost of a GPU batch workload.