Databricks Deployment Guide
Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebook. The NVIDIA RAPIDS Accelerator for Apache Spark is available on Databricks, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by Databricks and enables users to run their Spark workloads with optimized performance and efficiency.
This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for a Databricks cluster.
The RAPIDS Accelerator Java Archive (JAR) file is used to integrate the necessary dependencies and classes which enable GPU acceleration of Apache Spark workloads. Once the cluster is created, a GPU accelerated Spark job/application is submitted.
Prior to getting started with the RAPIDS Accelerator for Apache Spark on Databricks ensure you have the following prerequisites:
NGC Account with Enterprise Catalog Access
A simple way to check connectivity to Databricks is by listing the workspaces.
databricks workspace ls
To speed up cluster creation, upload the jar file to a Databricks File System (DBFS) Directory and create an initialization script for a cluster.
From the command line, navigate to the same directory as the jar file that was downloaded from NGC. Replace JAR_NAME
with the name of your jar. Run the following commands to upload the jar and create an initialization script which moves the jar from DBFS into each node in the cluster:
export JAR_NAME=rapids-4-spark_2.12-23.02.0.jar
# Create directory for jars.
dbfs mkdirs dbfs:/FileStore/jars
# Copy local jar into DBFS for reuse:
dbfs cp ./$JAR_NAME dbfs:/FileStore/jars
# Create init script:
echo "sudo cp /dbfs/FileStore/jars/${JAR_NAME}/databricks/jars/" > init.sh
# Upload init.sh:
dbfs cp ./init.sh dbfs:/databricks/init_scripts/init.sh --overwrite
By going through the command line interface, users are able to bypass some of the limitations placed on the GUI such as using CPU-only driver nodes on a GPU-accelerated cluster for additional cost savings. Additional features are exposed in the CLI that may not yet be available in the GUI.
The following is a cluster creation script for a basic GPU-accelerated cluster with a CPU-only driver node on AWS Databricks:
databricks clusters create --json '{
"cluster_name": "my-cluster",
"spark_version": "10.4.x-gpu-ml-scala2.12",
"driver_node_type_id": "m5d.4xlarge",
"node_type_id": "g4dn.4xlarge",
"num_workers": "2",
"spark_conf": {
"spark.plugins": "com.nvidia.spark.SQLPlugin",
"spark.task.resource.gpu.amount": "0.1",
"spark.rapids.memory.pinnedPool.size": "2G",
"spark.rapids.sql.concurrentGpuTasks": "2",
"spark.databricks.optimizer.dynamicFilePruning": "false"
},
"init_scripts": [{
"dbfs": {"destination": "dbfs:/databricks/init_scripts/init.sh"}
}]
}'
In Azure Databricks, instance types will need to be replaced for the cluster to be valid. An example substitution is provided:
"driver_node_type_id": "Standard_E16ads_v5",
"node_type_id": "Standard_NC16as_T4_v3",
The cluster_id
is provided as a response to the above cluster creation command in the CLI. Save this cluster_id for use in subsequent steps. The cluster_id can also be viewed by selecting the cluster in the GUI and inspecting the URL.
https://<databricks-instance>/#/setting/clusters/<cluster-id>
Additional information on Databricks Clusters API can be found here.
databricks clusters list
Create a job using the CLI interface. Enter your cluster_id as the existing_cluster_id
in the JSON:
# Download pi example jar:
wget https://docs.databricks.com/_static/examples/SparkPi-assembly-0.1.jar
# Upload jar to /docs/sparkpi.jar
dbfs cp ./SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar
# Create job:
databricks jobs create --json '{
"name": "pi_computation",
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "pi_test",
"description": "Compute pi",
"depends_on": [ ],
"existing_cluster_id": "<cluster_id>",
"spark_jar_task": {
"main_class_name": "org.apache.spark.examples.SparkPi",
"parameters": "1000"
},
"libraries": [{ "jar": "dbfs:/docs/sparkpi.jar"}]
}]
}'
Ensure that the cluster is online and ready to accept new tasks. It takes about 10 minutes to launch a GPU-accelerated cluster. The cluster status can be viewed in the GUI under the “Compute” tab. Or visible in the CLI with: databricks clusters list.
Run job by going into the Workflows tab and look for a job called “pi_computation”, then press the “Run now” button on the upper right. Alternatively, the job_id given in the response can be used in:
databricks jobs run-now --job-id <job_id>
Inspect the logs by clicking on the Logs link under the Spark column for the recently completed run.
Then look at the “Standard output” to see the estimated value for π.
Use the GUI to Terminate the cluster by selecting Compute >> [cluster name] >> Terminate. Terminated clusters can be restarted at a future time with the same configurations and cluster_id intact. You can also delete the cluster entirely by clicking on the three-dot menu next to the cluster name and select “Delete”.