Overview#

What The Qualification Tool Is#

The qualification tool analyzes Spark event logs generated from CPU-based Spark applications to determine which applications are a candidate to migrate to the GPU.

The tool analyzes a CPU event log and extracts various metrics to help determine how the workload would run on GPU. The tool then uses data from historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how a workload would accelerate on GPU. The estimations for GPU duration are available for different environments and are based on benchmarks that were run in the applicable environments. The Benchmark Environments page lists the cluster information used to run the benchmarks.

The tool combines the estimation along with other relevant heuristics to qualify workloads for migration to GPU. In addition to generating the qualified workload list, the tool provides two outputs to assist in the migration to GPU:

  • Optimized Spark configs for GPU: the tool calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running.

  • Recommended GPU cluster shape (for CSPs only): the tool will generate a recommended instance type and count along with GPU information that’s to be used for the migration.

Assumptions and Limitations#

This tool is intended to give the users a starting point and does not guarantee that the queries or applications with the highest recommendation will be accelerated the most. Currently, it reports by looking at the amount of time spent in tasks of SQL Dataframe operations. The qualification tool estimates assume that the application is run on a dedicated cluster where it can use all of the available Spark resources.

The tool performs static analysis by parsing CPU-based Spark event logs to extract execution metrics required for GPU duration estimation. This approach introduces several technical constraints due to the inherent limitations of event log data structures:

  • Expression truncation: Event logs may truncate complex expression metadata for certain operators, preventing complete schema column identification and accurate operator analysis.

  • Schema variability: Event logs lack standardized structure across different Spark distributions and versions, creating parsing challenges when handling proprietary or custom Spark implementations.

  • Incomplete operator metadata: When operators contain insufficient runtime information in the event log, the tool applies optimistic assumptions regarding GPU compatibility, which may not reflect actual hardware acceleration potential.

Supported Data Sources and Execution Engines#

The qualification tool supports analysis of Spark applications across various data sources and execution engines. Understanding which components are supported helps you determine if your workloads can be effectively analyzed and migrated to GPU acceleration.

Data Source Compatibility

The tool can analyze event logs from applications that read from and write to different data storage formats and catalogs:

Data Source/Engine

Support Status

Notes

Hive tables

☑️

Full support for all Hive operations

Delta Lake

☑️

See full details in Delta Lake Support section

Iceberg with Hadoop catalog

WIP

Basic read operations supported

Photon (Databricks)

Experimental

Limited operator coverage

Velox (Meta)

Future work

Planned for future releases

What This Means for Your Analysis

  • Fully Supported: Applications using Hive tables and Delta Lake will receive the most accurate GPU acceleration estimates and recommendations

  • Work in Progress: Iceberg support covers standard read/write operations but may have limitations with advanced features

  • Experimental: Photon analysis provides basic insights but recommendations should be validated through testing

  • Future Work: Velox support is planned but not yet available

For detailed information about supported cloud platforms, Spark versions, and deployment environments, refer to the Qualification Support section.

How To Run The Qualification Tool#

The Qualification tool can be run as a command-line interface via a pip package for CSP environments (Google Dataproc, AWS EMR, Databricks-AWS, and Databricks-Azure) in addition to on-prem environments.

For more information on running the Qualification tool from the pip-package, visit the quick start guide.