Running on Google Dataproc

This is a guide for the RAPIDS tools for Apache Spark on [Google Cloud Dataproc](https://cloud.google.com/dataproc). At the end of this guide, the user will be able to run the RAPIDS tools to analyze the clusters and the applications running on Google Cloud Dataproc.

Prerequisites

1.gcloud CLI

Install the gcloud CLI. Follow the instructions on [gcloud-sdk-install](https://cloud.google.com/sdk/docs/install)
Set the configuration settings and credentials of the gcloud CLI:
- Initialize the gcloud CLI by following [these instructions](https://cloud.google.com/sdk/docs/initializing#initialize_the)
- Grant authorization to the gcloud CLI [with a user account](https://cloud.google.com/sdk/docs/authorizing#authorize_with_a_user_account)
- Set up application default credentials to the gcloud CLI [by logging in](https://cloud.google.com/sdk/docs/authorizing#set_up_application_default_credentials)
- Manage gcloud CLI configurations. For more details, visit [gcloud-sdk-configurations](https://cloud.google.com/sdk/docs/configurations)
- Verify that the following [gcloud CLI properties](https://cloud.google.com/sdk/docs/properties) are properly defined:
  - dataproc/region,
  - compute/zone,
  - compute/region
  - core/project

2.RAPIDS tools

Spark event logs:
- The RAPIDS tools can process Apache Spark CPU event logs from Spark 2.0 or higher (raw, .lz4, .lzf, .snappy, .zstd)
- For the qualification command, the event logs need to be archived to an accessible gs folder.

3.Install the package

Install spark-rapids-user-tools with python [3.8, 3.10] using:
- pip: pip install spark-rapids-user-tools
- wheel-file: pip install <wheel-file>
- from source: pip install -e .
Verify the command is installed correctly by running spark_rapids -- --help

4.Environment variables

Before running any command, you can set environment variables to specify configurations. RAPIDS variables have a naming pattern RAPIDS_USER_TOOLS_*:

RAPIDS_USER_TOOLS_CACHE_FOLDER: specifies the location of a local directory that the RAPIDS-cli uses to store and cache the downloaded resources. The default is /var/tmp/spark_rapids_user_tools_cache. Note that caching the resources locally has an impact on the total execution time of the command.
RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY: specifies the location of a local directory that the RAPIDS-cli uses to generate the output. The wrapper CLI arguments override that environment variable (–output_folder and local_folder for Bootstrap and Qualification respectively).

Running the qualification tool

Local deployment

spark_rapids qualification [options]

spark_rapids qualification -- --help

The local deployment runs on the local development machine. It requires:

Installing and configuring the gcloud CLI (gsutil and gcloud commands)
Java 1.8+ development environment
Internet access to download JAR dependencies from mvn: spark-*.jar, and gcs-connector-hadoop-*.jar
Dependencies are cached on the local disk to reduce the overhead of the download.

Command options

Option	Description	Default	Required
cpu_cluster	The Dataproc-cluster on which the Apache Spark applications were executed. Accepted values are an Dataproc-cluster name, or a valid path to the cluster properties file (json format) generated by gcloud CLI command gcloud dataproc clusters describe	N/A	N
eventlogs	A comma seperated list of gs urls pointing to event logs or gs directory	Reads the Spark’s property spark.eventLog.dir defined in cpu_cluster. This property should be included in the output of dataproc clusters describe. Note that the wrapper will raise an exception if the property is not set.	N
remote_folder	The gs folder where the output of the wrapper’s output is copied. If missing, the output will be available only on local disk	N/A	N
gpu_cluster	The Dataproc-cluster on which the Spark applications is planned to be migrated. The argument can be an Dataproc-cluster or a valid path to the cluster’s properties file (json format) generated by the gcloud CLI command gcloud dataproc clusters describe	The wrapper maps the machine instances of the original cluster into GPU supported instances	N
local_folder	Local work-directory path to store the output and to be used as root directory for temporary folders/files. The final output will go into a subdirectory named qual-${EXEC_ID} where exec_id is an auto-generated unique identifier of the execution.	If the argument is NONE, the default value is the env variable RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY if any; or the current working directory.	N
jvm_heap_size	The maximum heap size of the JVM in gigabytes	24	N
tools_jar	Path to a bundled jar including RAPIDS tool. The path is a local filesystem, or remote gs url	Downloads the latest rapids-4-spark-tools_.jar from mvn repo	N
credentials_file	The local path of JSON file that contains the application credentials	If missing, loads the env variable GOOGLE_APPLICATION_CREDENTIALS if any. Otherwise, it uses the default path “$HOME/.config/gcloud/application_default_credentials.json”	N
filter_apps	Filtering criteria of the applications listed in the final STDOUT table is one of the following (ALL, SPEEDUPS, SAVINGS). “ALL” means no filter applied. “SPEEDUPS” lists all the apps that are either ‘Recommended’, or ‘Strongly Recommended’ based on speedups. “SAVINGS” lists all the apps that have positive estimated GPU savings except for the apps that are ‘Not Applicable’.	SAVINGS	N
gpu_cluster_recommendation	The type of GPU cluster recommendation to generate. It accepts one of the following (CLUSTER, JOB, MATCH). MATCH: keep GPU cluster same number of nodes as CPU cluster; CLUSTER: recommend optimal GPU cluster by cost for entire cluster. JOB: recommend optimal GPU cluster by cost per job	MATCH	N
cpu_discount	A percent discount for the cpu cluster cost in the form of an integer value (e.g. 30 for 30% discount)	N/A	N
gpu_discount	A percent discount for the gpu cluster cost in the form of an integer value (e.g. 30 for 30% discount)	N/A	N
global_discount	A percent discount for both the cpu and gpu cluster costs in the form of an integer value (e.g. 30 for 30% discount)	N/A	N
verbose	True or False to enable verbosity to the wrapper script	False if RAPIDS_USER_TOOLS_LOG_DEBUG is not set	N
rapids_options	A list of valid [Qualification tool options](https://docs.nvidia.com/spark-rapids/user-guide/latest/spark-qualification-tool.html#qualification-tool-options). Note that (output-directory, platform) flags are ignored, and that multiple “spark-property” is not supported.	N/A	N

Use case scenario

A typical workflow to successfully run the qualification command in local mode is described as follows:

Store the Apache Spark event logs in gs folder.
A user sets up his development machine:
1. configures Java
2. installs gcloud CLI and configures the profile and the credentials to make sure the gcloud CLI commands can access the gs resources LOGS_BUCKET.
3. installs spark_rapids_user_tools
If the results of the wrapper need to be stored on gs, then another gs uri is required REMOTE_FOLDER=gs://OUT_BUCKET/
User defines the Dataproc-cluster on which the Spark application were running. Note that the cluster does not have to be active; but it has to be visible by the gcloud CLI (i.e., can run gcloud dataproc clusters describe cluster_name).

The following script runs qualification by passing gs remote directory to store the output:

Copy
Copied!

            
            # define the wrapper cache directory if necessary
export RAPIDS_USER_TOOLS_CACHE_FOLDER=my_cache_folder
export EVENTLOGS=gs://LOGS_BUCKET/eventlogs/
export CLUSTER_NAME=my-dataproc-cpu-cluster
export REMOTE_FOLDER=gs://OUT_BUCKET/wrapper_output

spark_rapids_user_tools dataproc qualification \
   --eventlogs $EVENTLOGS \
   --cpu_cluster $CLUSTER_NAME \
   --remote_folder $REMOTE_FOLDER

The wrapper generates a unique-Id for each execution in the format of qual_<YYYYmmddHHmmss>_<0x%08X> The above command will generate a directory containing qualification_summary.csv in addition to the actual folder of the RAPIDS Qualification tool. The directory will be mirrored to gs path (REMOTE_FOLDER).

Copy
Copied!

            
            ./qual_<YYYYmmddHHmmss>_<0x%08X>/qualification_summary.csv
./qual_<YYYYmmddHHmmss>_<0x%08X>/rapids_4_spark_qualification_output/

Qualification output

For each app, the command output lists the following fields:

App ID: An application is referenced by its application ID, app-id. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.
App Name: Name of the application
Speedup Based Recommendation: Recommendation based on Estimated Speed-up Factor. Note that an application that has job or stage failures will be labeled Not Applicable
Savings Based Recommendation: Recommendation based on Estimated GPU Savings.
- Strongly Recommended: An app with savings GEQ 40%
- Recommended: An app with savings between (1, 40) %
- Not Recommended: An app with no savings
- Not Applicable: An app that has job or stage failures.
Estimated GPU Speedup: Speed-up factor estimated for the app. Calculated as the ratio between App Duration and Estimated GPU Duration.
Estimated GPU Duration: Predicted runtime of the app if it was run on GPU
App Duration: Wall-Clock time measured since the application starts till it is completed. If an app is not completed an estimated completion time would be computed.
Estimated GPU Savings(%): Percentage of cost savings of the app if it migrates to an accelerated cluster. It is calculated as: estimated_saving = 100 - ((100 * gpu_cost) / cpu_cost)

The command creates a directory with UUID that contains the following:

Directory generated by the RAPIDS qualification tool rapids_4_spark_qualification_output;
A CSV file that contains the summary of all the applications along with estimated absolute costs

Sample directory structure:

Copy
Copied!

            
            qual_20230314145334_d2CaFA34
├── qualification_summary.csv
└── rapids_4_spark_qualification_output
    ├── ui
    │   └── html
    │       ├── sql-recommendation.html
    │       ├── index.html
    │       ├── application.html
    │       └── raw.html
    ├── rapids_4_spark_qualification_output_stages.csv
    ├── rapids_4_spark_qualification_output.csv
    ├── rapids_4_spark_qualification_output_execs.csv
    └── rapids_4_spark_qualification_output.log
3 directories, 9 files

TCO calculator

In the qualification_summary.csv output file, you will see two additional columns appended: Estimated Job Frequency (monthly) and Annual Cost Savings. These new columns are to be used as part of a TCO calculator to see the long-term benefit of using Spark RAPIDS with your applications. A GSheet template with instructions can be found at here: link. Make a copy of the GSheet template and then follow the instructions listed in the Instructions tab.