Quickstart#

spark-rapids-user-tools CLI enables user to run the tool for logs from a number of CSP platforms in addition to on-prem.

Tip

For most accurate results, it’s recommended to run the latest version of the CLI tool.
Databricks users can run the tool using a demo notebook.

Running the Tool#

A typical workflow to successfully run the qualification command:

Follow the instructions to set up the prerequisites and install the CLI
Get Apache Spark eventlogs from prior runs of CPU based applications on Spark 2.x or later. In addition to local storage, the eventlogs should be stored in a valid remote storage:
- For Dataproc, it should be set to GCS path.
- For EMR and Databricks-AWS, it should be set to S3 path.
- For Databricks-Azure, it should be set to ABFS path.
Run the qualification command on the set of selected eventlogs. Event logs can be passed as single files, a directory, a comma-separated list of files or directories. The format of event logs can be raw, zip, or gzip.

spark_rapids qualification <flags>

The tool helps quantify the expected acceleration of migrating a Spark application or query to GPU. The tool will process each app individually, but will group apps with the same name and cluster details into a single output row after averaging duration metrics accordingly.

Example Commands#

This section shows examples of Qualification CLI commands assuming the following inputs:

EVENTLOG: Path to Spark eventlogs without the scheme part. The scheme can be a local file system (file://), HDFS (hdfs://), S3 (s3://), ABFS (abfss://), or GCS (gs://).

The following table shows CLI command examples along with platform and expected functionalities based on which analysis is performed.

Examples of `qualification` CLI commands#
CMD	Platform
spark_rapids qualification \ --platform dataproc \ --eventlogs gs://$EVENTLOG	Dataproc
spark_rapids qualification \ --platform emr \ --eventlogs s3://$EVENTLOG	EMR
spark_rapids qualification \ --platform databricks-azure \ --eventlogs file://$EVENTLOG	Databricks-Azure
spark_rapids qualification \ --platform databricks-aws \ --eventlogs file://$EVENTLOG	Databricks-AWS
spark_rapids qualification \ --platform onprem \ --eventlogs file://$EVENTLOG	On-prem

Command Options#

You can list all the options using the help argument

spark_rapids qualification -- --help

Available options are listed in the following table.

List of options for `qualification` CLI command#
Option	Description	Default	Required
`--eventlogs`	Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the `cluster` argument points to a valid cluster name on the CSP.	N/A	N
`--cluster`	The CPU cluster on which the Spark application(s) were executed. Name or ID of cluster or path to cluster property file. Further details described in Cluster Metadata.	N/A	N
`--platform`, `-p`	Defines one of the following “on-prem”, “emr”, “dataproc”, “dataproc-gke”, “databricks-aws”, and “databricks-azure”.	N/A	N
`--target_platform`, `-t`	Speedup recommendation for comparable cluster in `target_platform` based on on-prem cluster configuration. Currently only dataproc is supported.	N/A	N
`--output_folder`, `-o`	Path to store the output.	N/A	N
`--filter_apps`, `-f`	Requires `cluster` argument. Filtering criteria of the applications listed in the final STDOUT table without affecting the CSV report: `ALL` means no filter applied. `TOP_CANDIDATES` lists all apps that have unsupported operators stage duration less than 25% of app duration and speedups greater than 1.3x.	`TOP_CANDIDATES`	N
`--custom_model_file`	Custom model file (JSON format) used to calculate the estimated GPU duration	N/A	N
`--tools_jar`	Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository.	N/A	N
`--jvm_heap_size`	The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host.	N/A	N
`--jvm_threads`	Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host.	N/A	N
`--gpu_cluster_recommendation`	Requires `cluster` argument. The type of GPU cluster recommendation to generate: `MATCH`: keep GPU cluster same number of nodes as CPU cluster `CLUSTER`: recommend optimal GPU cluster for entire cluster to match CPU duration of longest job `JOB`: recommend optimal GPU cluster per job to match CPU duration per job	`MATCH`	N
`--verbose`, `-v`	True or False to enable verbosity of the script.	N/A	N

Cluster-Metadata#

By default, the qualification tool generates estimated speedups of the CPU applications. It will also generate a cluster recommendation for running on GPU. To aid that recommendation, you can provide the CPU cluster information.

The specific type of parameter passed for the cluster is based on platform, see the following scenarios to determine which method to use for your platform:

Cluster by name

User can pass name of the cluster CLUSTER_NAME to the command:
spark_rapids qualification --cluster $CLUSTER_NAME [flags]
This is supported on Dataproc and EMR platform. The tool uses the CSP CLI to collect the cluster information.

Cluster by ID

User can pass ID of the cluster CLUSTER_ID to the command:
spark_rapids qualification --cluster $CLUSTER_ID [flags]
This is supported on Databricks-AWS and Databricks-Azure platform. The tool uses the CSP CLI to collect the cluster information.

Cluster by Property File

User can pass CLUSTER_PROPS - the path to cluster property file (in json/yaml formats) to the command. This is useful if the cluster isn’t accessible or permanently deleted.
spark_rapids qualification --cluster $CLUSTER_PROPS [flags]
On-prem
User defines the cluster configuration of on-prem platform. The following is a sample cluster property file CLUSTER_PROPS in yaml format.
config:
  masterConfig:
    numCores: 2
    memory: 7680MiB
  workerConfig:
    numCores: 8
    memory: 7680MiB
    numWorkers: 2
target_platform is required for on-prem clusters. Currently only Dataproc is supported.
Dataproc
Given Dataproc CLUSTER_NAME, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to gcloud CLI docs)
gcloud dataproc clusters describe $CLUSTER_NAME > $CLUSTER_PROPS
EMR
Given EMR CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to AWS CLI docs)
aws emr describe-cluster --cluster-id $CLUSTER_ID > $CLUSTER_PROPS
Databricks-AWS
Given Databricks CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to Databricks CLI docs)
databricks clusters get $CLUSTER_ID > $CLUSTER_PROPS
Databricks-Azure
Given Databricks CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to Databricks CLI docs)
databricks clusters get $CLUSTER_ID > $CLUSTER_PROPS

Qualification Output#

The Qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with Estimated GPU Speedup.

The command creates a directory with UUID that contains the following output:

  qual_20230314145334_d2CaFA34
  ├── app_metadata.json
  ├── qualification_summary.csv
  ├── qualification_statistics.csv
  ├── intermediate_output/
  ├── rapids_4_spark_qualification_output/
  └── xgboost_predictions/
      ...

Note

See this listing for full details of the directory qual_20230314145334_d2CaFA34.

In qualification_summary.csv, the command output lists these key fields for each application:

App ID#: An application is referenced by its application ID, app-id. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.
App Name#: Name of the application
App Duration#: Wall-Clock time measured since the application starts until it’s completed. If an app isn’t completed an estimated completion time would be computed.
Estimated GPU Duration#: Predicted runtime of the app if it was run on GPU.
It’s the sum of the accelerated operator durations and ML functions duration(if applicable) along with durations that couldn’t run on GPU because they’re unsupported operators or not SQL/Dataframe.
Estimated GPU Speedup#: That will estimate how much faster the application would run on GPU. It’s calculated as the ratio between App Duration and Estimated GPU Duration.
Estimated GPU Speedup Category#: This is the qualification result per job to determine if the job is a good candidate for running on GPU. A value of Large, Medium, or Small indicates the job should be migrated to GPU with different levels of confidence for expected acceleration. A value of Not Recommended or Not Applicable indicates the job should not be migrated to GPU.

The qualification command output in STDOUT will also show a summary of qualified candidates.

Sample of Qualification command output in STDOUT#

  +----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
  |    | App Name        | App ID                         | Estimated GPU   |   Qualified Node | Full Cluster                        | GPU Config                         |
  |    |                 |                                | Speedup         |   Recommendation | Config                              | Recommendation                     |
  |    |                 |                                | Category**      |                  | Recommendations*                    | Breakdown*                         |
  |----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------|
  |  0 | query A0        | application_1696859475058_0007 | Large           | 45 x g5.8xlarge  | application_1696859475058_0007.conf | application_1696859475058_0007.log |
  |  1 | query A1        | application_1696859475058_0008 | Large           | 9 x g5.8xlarge   | application_1696859475058_0008.conf | application_1696859475058_0008.log |
  |  2 | query A2        | application_1696859475058_0012 | Large           | 2 x g5.8xlarge   | application_1696859475058_0012.conf | application_1696859475058_0012.log |
  |  5 | query B0        | application_1696859475058_0014 | Medium          | 150 x g5.8xlarge | application_1696859475058_0014.conf | application_1696859475058_0014.log |
  |  6 | query B1        | application_1696859475058_0016 | Medium          | 2 x g5.8xlarge   | application_1696859475058_0016.conf | application_1696859475058_0016.log |
  |  7 | query B2        | application_1696859475058_0011 | Small           | 2 x g5.8xlarge   | application_1696859475058_0011.conf | application_1696859475058_0011.log |
  +----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+

  Notes:
  --------------------
   - *Cluster config recommendations: ./qual_20240731164401_6D6B7fb8/rapids_4_spark_qualification_output/tuning
   - **Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

For more information on the detailed output of the Qualification tool, go here: Output Details.

Quickstart#

Install#

Prerequisites#

Install the CLI Package#