Overview#

What the qualification tool is#

The qualification tool analyzes Spark events generated from CPU based Spark applications to help quantify the expected acceleration of migrating a Spark application or query to GPU.

The tool first analyzes the CPU event log and determines which operators are likely to run on the GPU. The tool then uses estimates from historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU for the specific query or application. It calculates an Estimated GPU App Duration by adding up the accelerated operator durations along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.

This tool is intended to give the users a starting point and does not guarantee the queries or applications with the highest recommendation will actually be accelerated the most. Currently, it reports by looking at the amount of time spent in tasks of SQL Dataframe operations. Note that the qualification tool estimates assume that the application is run on a dedicated cluster where it can use all of the available Spark resources.

The estimations for GPU duration are available for different environments and are based on benchmarks run in the applicable environments. Here are the cluster information for the ETL benchmarks used for the estimates:

Environment	CPU Cluster	GPU Cluster
On-prem	8x 128-core	8x 128-core + 8x A100 40 GB
Dataproc (T4)	4x n1-standard-32	4x n1-standard-32 + 8x T4 16GB
Dataproc (L4)	8x n1-standard-16	8x g2-standard-16
EMR	8x m5d.8xlarge	4x g4dn.12xlarge
Databricks AWS	8x m6gd.8xlage	8x g5.8xlarge
Databricks Azure	8x E8ds_v4	8x NC8as_T4_v3

Note that all benchmarks were run using the NDS benchmark at SF3K (3 TB).

Important

Estimates provided by the qualification tool are based on the currently supported “SparkPlan” or “Executor Nodes” used in the application. It currently does not handle all the expressions or datatypes used. Please refer to Understanding Execs report section and the Supported Operators guide to check the types and expressions you are using are supported.

How to run the qualification tool#

The simplest way to run the qualification tool is using the RAPIDS Accelerator for Apache Spark CLI tool. This enables you to run for logs from a number of CSP platforms in addition to on-prem.

In running the qualification tool standalone on Spark event logs, the tool can be run as a user tool command via a pip package for CSP environments (Google Dataproc, AWS EMR, Databricks AWS) or as a java application for other environments.

CLI prerequisites and setup#

Running the qualification tool with the CLI#

The qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with estimated speed-up and cost saving metrics.

Usage: spark_rapids qualification --platform <CSP> --cpu_cluster <CLUSTER> --eventlogs <EVENTLOGS-PATH>

The supported CSPs are dataproc, emr, databricks-aws and databricks-azure. The EVENTLOGS-PATH should be the storage location for your eventlogs. For Dataproc, it should be set to the GCS path. For EMR and Databricks-AWS, it should be set to the S3 path. THE CLUSTER can be a live cluster or a configuration file representing the cluster instances and size. More details are in the above documentation links per CSP environment

Help (to see all options available): spark_rapids qualification --help

Example output:

+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+
|    | App Name   | App ID                         | Recommendation       |   Estimated GPU |   Estimated GPU |           App |   Estimated GPU |
|    |            |                                |                      |         Speedup |     Duration(s) |   Duration(s) |      Savings(%) |
|----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------|
|  0 | query24    | application_1664888311321_0011 | Strongly Recommended |            3.49 |          257.18 |        897.68 |           59.70 |
|  3 | query64    | application_1664888311321_0008 | Strongly Recommended |            2.91 |          150.81 |        440.30 |           51.82 |
|  4 | query50    | application_1664888311321_0003 | Recommended          |            2.47 |          101.54 |        250.95 |           43.08 |
|  7 | query87    | application_1664888311321_0006 | Recommended          |            2.25 |           75.67 |        170.69 |           37.64 |
|  8 | query51    | application_1664888311321_0002 | Recommended          |            1.53 |           53.94 |         82.63 |            8.18 |
+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+

For more information on the detailed output of the qualification tool, go here: Output details.

Running the qualification tool without the CLI#

The qualification tool can be run in a three other ways if you are not using the CLI tool. One is to run it as a standalone tool on the Spark event logs after the application(s) have run, the second is to be integrated into a running Spark application using explicit API calls, and the third is to install a Spark listener which can output results on a per SQL query basis.

For more information on running the qualification tool using a jar instead of the CLI tools package, go here: Jar usage.