Overview
The qualification tool analyzes Spark events generated from CPU-based Spark applications to help quantify the expected acceleration of migrating a Spark application or query to GPU.
The tool first analyzes the CPU event log and determines which operators are likely to run on the GPU. The tool then uses estimates from historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU for the specific query or application. It calculates an Estimated GPU App Duration by adding up the accelerated operator durations along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.
This tool is intended to give the users a starting point and does not guarantee that the queries or applications with the highest recommendation will be accelerated the most. Currently, it reports by looking at the amount of time spent in tasks of SQL Dataframe operations. Note that the qualification tool estimates assume that the application is run on a dedicated cluster where it can use all of the available Spark resources.
The estimations for GPU duration are available for different environments and are based on benchmarks run in the applicable environments. The following table lists the cluster information used to run the benchmarks.
Environment |
CPU Cluster |
GPU Cluster |
---|---|---|
On-prem | 8x 128-core | 8x 128-core + 8x A100 40 GB |
Dataproc (T4) | 4x n1-standard-32 | 4x n1-standard-32 + 8x T4 16GB |
Dataproc (L4) | 8x n1-standard-16 | 8x g2-standard-16 |
EMR | 8x m5d.8xlarge | 4x g4dn.12xlarge |
Databricks AWS | 8x m6gd.8xlage | 8x g5.8xlarge |
Databricks Azure | 8x E8ds_v4 | 8x NC8as_T4_v3 |
Estimates provided by the qualification tool are based on the currently supported “SparkPlan” or “Executor Nodes” used in the application. It currently does not handle all the expressions or datatypes used. Please refer to Understanding Execs report section and the Supported operators guide to check the types and expressions you are using are supported.
- RAPIDS Accelerator for Apache Spark CLI tool
- Java API
As a standalone tool on the Spark event logs after the application(s) have run,
To be integrated into a running Spark application using explicit API calls, and
to install a Spark listener that can output results on a per SQL query basis.
The simplest way to run the qualification tool. In that setting, the qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with estimated speed-up and cost-saving metrics.
In running the qualification tool standalone on Spark event logs, the tool can be run as a user tool command via a pip package for CSP environments (Google Dataproc, AWS EMR, Databricks AWS, and Databricks Azure) in addition to on-prem.
In that setting, the qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with estimated speed-up and cost-saving metrics. For more information on running the Qualification tool from the pip package, visit the quickstart guide
The Java API can be used for other environments that are not supported by the CLI tool.
This allows it to run in three different ways: