Quickstart#

The simplest way to run the tool is using the spark-rapids-user-tools CLI. This enables you to run for logs from a number of CSP platforms in addition to on-prem.

In using standalone on Spark event logs, the tool can be run as a user tool command via RAPIDS user tools pip package or as a java application for CSP environments (Google Dataproc, AWS EMR, and Databricks Azure/AWS) and on-prem. More details on how to use the java application is described in java API.

Tip

For most accurate results, it is recommended to run the latest version of the CLI tool.

Running the Tool#

A typical workflow to successfully run the qualification command in local mode is described as follows:

Follow the instructions to setup the CLI
Spark event logs from prior runs of the applications on Spark 2.x or later. Get the location of the Apache Spark eventlogs generated from CPU based Spark applications. In addition to local storage, the eventlogs should be stored in a valid remote storage:

For Dataproc, it should be set to the GCS path.

For EMR and Databricks-AWS, it should be set to the S3 path.

For Databricks-Azure, it should be set to ABFS

Finally, run the qualification command on the set of selected eventlogs. The cmd helps quantify the expected acceleration and costs savings of migrating a Spark application or query to GPU. The cmd will process each app individually, but will group apps with the same name into the same output row after averaging duration metrics accordingly.

spark_rapids qualification <flag>

Environment Variables#

In addition to the environment variables used to configure the CSP environment, the CLI has its own set of environment variables.

Before running any command, you can set environment variables to specify configurations. RAPIDS variables have a naming pattern RAPIDS_USER_TOOLS_*:

RAPIDS_USER_TOOLS_CACHE_FOLDER: specifies the location of a local directory that the CLI uses to store and cache the downloaded resources. The default is /var/tmp/spark_rapids_user_tools_cache. Note that caching the resources locally has an impact on the total execution time of the command.

RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY: specifies the location of a local directory that the CLI uses to generate the output. The wrapper CLI arguments override that environment variable (i.e., --output_folder).

Command Options#

You can list all the options using the help argument

spark_rapids qualification -- --help

Available options are listed in the following table.

List of arguments and options for `qualification` cmd#
Option	Description	Default	Required
`--eventlogs`	Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the `cluster` argument points to a valid cluster name on the CSP.	N/A	N
`--cluster`	Name or ID (for databricks platforms) of cluster or path to cluster-properties.	N/A	N
`--platform`, `-p`	Defines one of the following “on-prem”, “emr”, “dataproc”, “dataproc-gke”, “databricks-aws”, and “databricks-azure”.	N/A	N
`--target_platform`, `-t`	Cost savings and speedup recommendation for comparable cluster in `target_platform` based on on-prem cluster configuration. Currently only dataproc is supported. If not provided, the final report will be limited to GPU speedups only without cost-savings.	N/A	N
`--output_folder`, `-o`	Path to store the output.	N/A	N
`--filter_apps`, `-f`	Requires `cluster` argument. Filtering criteria of the applications listed in the final STDOUT table without affecting the CSV report: `ALL` means no filter applied. `SPEEDUPS` lists all the apps that are either Recommended, or Strongly Recommended based on speedups. `SAVINGS` lists all the apps that have positive estimated GPU savings except for the apps that are Not Applicable. `TOP_CANDIDATES` lists all apps that have unsupported operators stage duration less than 25% of app duration and speedups greater than 1.3x.	`TOP_CANDIDATES`	N
`--estimation_model`	Model used to calculate the estimated GPU duration and cost savings: `xgboost`: an XGBoost model for GPU duration estimation `speedups`: uses a simple static estimated speedup per operator, set by default.	`speedups`	N
`--tools_jar`	Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository.	N/A	N
`--jvm_heap_size`	The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host.	N/A	N
`--jvm_threads`	Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host.	N/A	N
`--cpu_cluster_price`	The CPU cluster hourly price (float) provided by the user.	N/A	N
`--estimated_gpu_cluster_price`	The GPU cluster hourly price provided by the user.	N/A	N
`--cpu_discount`	A percent discount for the cpu cluster cost in the form of an integer value (e.g. 30 for 30% discount).	N/A	N
`--gpu_discount`	A percent discount for the gpu cluster cost in the form of an integer value (e.g. 30 for 30% discount).	N/A	N
`--global_discount`	A percent discount for both the cpu and gpu cluster costs in the form of an integer value (e.g. 30 for 30% discount).	N/A	N
`--gpu_cluster_recommendation`	Requires `cluster` argument. The type of GPU cluster recommendation to generate: `MATCH`: keep GPU cluster same number of nodes as CPU cluster `CLUSTER`: recommend optimal GPU cluster by cost for entire cluster `JOB`: recommend optimal GPU cluster by cost per job	`MATCH`	N
`--verbose`, `-v`	True or False to enable verbosity of the script.	N/A	N

Sample Commands#

To see a full list of commands in details, please visit Qualification-cmd CLI examples.

Qualification Output#

The Qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with Estimated GPU Speedup and cost saving metrics.

The command creates a directory with UUID that contains the following:

Directory generated by the RAPIDS qualification tool rapids_4_spark_qualification_output;

A CSV file that contains the summary of all the applications along with estimated absolute costs (qualification_summary.csv)

Sample output directory structure.#

  qual_20230314145334_d2CaFA34
  ├── qualification_summary.csv
  └── rapids_4_spark_qualification_output/
      ├── ui/
      │   └── html/
      ...

Note

See this listing for full details of the subdirectory rapids_4_spark_qualification_output.

In qualification_summary.csv, the command output lists the following fields for each application:

App ID#

An application is referenced by its application ID, app-id. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.

App Name#

Name of the application

App Duration#

Wall-Clock time measured since the application starts till it is completed. If an app is not completed an estimated completion time would be computed.

Estimated GPU Duration#

Predicted runtime of the app if it was run on GPU.
It is the sum of the accelerated operator durations and ML functions duration(if applicable) along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.

Estimated GPU Speedup#

That will estimate how much faster the application would run on GPU. It is calculated as the ratio between App Duration and Estimated GPU Duration.

Estimated GPU Savings(%)#

Percentage of cost savings of the app if it migrates to an accelerated cluster. It is calculated as:
\(\texttt{estimated}\_\texttt{saving} = 100 - (\frac{100 \times \texttt{gpu}\_\texttt{cost}}{\texttt{cpu}\_\texttt{cost}})\)

Savings Based Recommendation#

Recommendation based on Estimated GPU Savings.

Strongly Recommended: An app with savings \(\geq\) 40%
Recommended: An app with savings between (1, 40) %
Not Recommended: An app with no savings
Not Applicable: An app that has job or stage failures.

Speedup Based Recommendation#

Recommendation based on Estimated GPU Speedup. Note that an application that has job or stage failures will be labeled Not Applicable

Sample of Qualification cmd output on the STDOUT#

  +----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------+
  |    | App ID   | App Name            | Speedup Based        | Savings Based        |           App |   Estimated GPU |   Estimated GPU |   Estimated GPU |
  |    |          |                     | Recommendation       | Recommendation       |   Duration(s) |     Duration(s) |         Speedup |      Savings(%) |
  |----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------|
  |  0 | app-0002 | spark_data_utils.py | Strongly Recommended | Strongly Recommended |       1201.72 |          220.85 |            5.44 |           44.33 |
  |  3 | app-0001 | Spark shell         | Strongly Recommended | Recommended          |       1783.65 |          533.05 |            3.35 |            9.48 |
  +----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------+

For more information on the detailed output of the Qualification tool, go here: Output Details.

TCO Calculator#

In addition to the above fields, Estimated Job Frequency (monthly) and Annual Cost Savings are to be used as part of a TCO calculator to see the long-term benefit of using Spark RAPIDS with your applications.

Copy the GSheet template and then follow the instructions listed in the Instructions tab.

Quickstart#

Install#

Prerequisites#

Install the CLI Package#