Databricks Deployment Guide#
Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebook. The NVIDIA RAPIDS Accelerator for Apache Spark is available on Databricks, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Databricks and enables users to run their Spark workloads with optimized performance and efficiency.
Overview of Steps#
This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Databricks cluster.
The RAPIDS Accelerator Java Archive (JAR) file is used to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. Once the cluster is created, a GPU accelerated Spark job/application is submitted.
Prerequisites#
Prior to getting started with the RAPIDS Accelerator for Apache Spark on Databricks ensure you have the following prerequisites:
NGC Account with NGC Catalog Access
Connectivity#
A simple way to check connectivity to Databricks is by listing the workspaces.
databricks workspace ls
Upload Jar and Initialization Script#
To speed up cluster creation, upload the jar file to a Databricks File System (DBFS) Directory and create an initialization script for a cluster.
From the command line, navigate to the same directory as the jar file that was downloaded from NGC. Replace JAR_NAME
with the name of your jar. Run the following commands to upload the jar and create an initialization script which moves the jar from DBFS into each node in the cluster:
1export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar
2
3# Create directory for jars.
4dbfs mkdirs dbfs:/FileStore/jars
5
6# Copy local jar into DBFS for reuse:
7dbfs cp ./$JAR_NAME dbfs:/FileStore/jars
8
9# Create init script:
10echo "sudo cp /dbfs/FileStore/jars/${JAR_NAME} /databricks/jars/" > init.sh
11
12# Upload init.sh:
13dbfs cp ./init.sh dbfs:/databricks/init_scripts/init.sh --overwrite
Cluster Creation with Databricks Clusters API#
By going through the command line interface, users are able to bypass some of the limitations placed on the GUI such as using CPU-only driver nodes on a GPU-accelerated cluster for additional cost savings. Additional features are exposed in the CLI that may not yet be available in the GUI.
The following is a cluster creation script for a basic GPU-accelerated cluster with a CPU-only driver node on AWS Databricks:
1databricks clusters create --json '{
2"cluster_name": "my-cluster",
3"spark_version": "10.4.x-gpu-ml-scala2.12",
4"driver_node_type_id": "m5d.4xlarge",
5"node_type_id": "g4dn.4xlarge",
6"num_workers": "2",
7"spark_conf": {
8 "spark.plugins": "com.nvidia.spark.SQLPlugin",
9 "spark.task.resource.gpu.amount": "0.1",
10 "spark.rapids.memory.pinnedPool.size": "2G",
11 "spark.rapids.sql.concurrentGpuTasks": "2",
12 "spark.databricks.optimizer.dynamicFilePruning": "false"
13 },
14"init_scripts": [{
15 "dbfs": {"destination": "dbfs:/databricks/init_scripts/init.sh"}
16 }]
17}'
In Azure Databricks, instance types will need to be replaced for the cluster to be valid. An example substitution is provided:
1"driver_node_type_id": "Standard_E16ads_v5",
2"node_type_id": "Standard_NC16as_T4_v3",
The cluster_id
is provided as a response to the above cluster creation command in the CLI. Save this cluster_id for use in subsequent steps. The cluster_id can also be viewed by selecting the cluster in the GUI and inspecting the URL.
https://<databricks-instance>/#/setting/clusters/<cluster-id>
Additional information on Databricks Clusters API can be found here.
databricks clusters list
Validation#
Create a job using the CLI interface. Enter your cluster_id as the existing_cluster_id
in the JSON:
1# Download pi example jar:
2wget https://docs.databricks.com/_static/examples/SparkPi-assembly-0.1.jar
3
4# Upload jar to /docs/sparkpi.jar
5dbfs cp ./SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
6
7# Create job:
8databricks jobs create --json '{
9"name": "pi_computation",
10"max_concurrent_runs": 1,
11"tasks": [
12 {
13 "task_key": "pi_test",
14 "description": "Compute pi",
15 "depends_on": [ ],
16 "existing_cluster_id": "<cluster_id>",
17 "spark_jar_task": {
18 "main_class_name": "org.apache.spark.examples.SparkPi",
19 "parameters": "1000"
20 },
21 "libraries": [{ "jar": "dbfs:/docs/sparkpi.jar"}]
22 }]
23}'
Ensure that the cluster is online and ready to accept new tasks. It takes about 10 minutes to launch a GPU-accelerated cluster. The cluster status can be viewed in the GUI under the “Compute” tab. Or visible in the CLI with: databricks clusters list.
Run job by going into the Workflows tab and look for a job called “pi_computation”, then press the “Run now” button on the upper right. Alternatively, the job_id given in the response can be used in:
databricks jobs run-now --job-id <job_id>
Inspect the logs by clicking on the Logs link under the Spark column for the recently completed run.
Then look at the “Standard output” to see the estimated value for π.
Cluster Cleanup#
Use the GUI to Terminate the cluster by selecting Compute >> [cluster name] >> Terminate. Terminated clusters can be restarted at a future time with the same configurations and cluster_id intact. You can also delete the cluster entirely by clicking on the three-dot menu next to the cluster name and select “Delete”.