Quickstart
The simplest way to run the tool is using the spark-rapids-user-tools CLI. This enables you to run for logs from a number of CSP platforms in addition to on-prem.
In using standalone on Spark event logs, the tool can be run as a user tool command via RAPIDS user tools pip package or as a java application for CSP environments (Google Dataproc, AWS EMR, and Databricks Azure/AWS) and on-prem. More details on how to use the java application is described in java API.
For most accurate results, it is recommended to run the latest version of the CLI tool.
Prerequisites
Set up a Python environment with a version between 3.8 and 3.10
Java 8+
The developer machine used to host the CLI tools needs internet access to download JAR dependencies from mvn:
spark-*.jar
,hadoop-aws-*.jar
, andaws-java-sdk-bundle*.jar
. If the host machine is behind a proxy, then it is recommended to install the CLI package from source using the fat mode as described in Install the CLI Package section.Set the development environment for your CSP or on-prem
The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:
HADOOP_HOME
: the root of your installed Hadoop distribution. Often has “lib/native/libhdfs.so”.JAVA_HOME
: the location of your Java SDK installation.ARROW_LIBHDFS_DIR
(optional): explicit location of “libhdfs.so” if it is installed somewhere other than $HADOOP_HOME/lib/native.Add the Hadoop jars to your CLASSPATH.
No more steps required to run the tools on on-premises environment including standalone/local machines.
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
%HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%
For more information on HDFS requirements, refer to the PyArrow HDFS documentation
Install gcloud CLI. Follow the instructions on gcloud-sdk-install
Set the configuration settings and credentials of the gcloud CLI:
Initialize the gcloud CLI by following these instructions
Grant authorization to the gcloud CLI with a user account
Set up “application default credentials” to the gcloud CLI by logging in
Manage gcloud CLI configurations. For more details, visit gcloud-sdk-configurations
Verify that the following gcloud CLI properties are properly defined:
dataproc/region
compute/zone
compute/region
core/project
If the configuration is not set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: CLOUDSDK_DATAPROC_REGION, and CLOUDSDK_COMPUTE_REGION.
The tools CLI follows the process described in this doc to resolve the credentials. If not running on (GCP), the environment variable
GOOGLE_APPLICATION_CREDENTIALS
is required to point to a JSON file containing credentials.
Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started
Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.
If the AWS CLI configuration is not set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as:
AWS_PROFILE
,AWS_DEFAULT_REGION
,AWS_CONFIG_FILE
,AWS_SHARED_CREDENTIALS_FILE
. See the full list of variables in aws-cli-configure-envvarsNote that it is important to configure with the correct region for the bucket being used on S3. If region is not set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
emvironment variables if the credentials could not be pulled from the credential files.
NoteIn order to be able to run tools that require SSH on the EMR nodes (i.e., bootstrap), then:
make sure that you have SSH access to the cluster nodes; and
create a key pair using Amazon EC2 through the AWS CLI command
aws ec2 create-key-pair
as instructed in aws-cli-create-key-pairs.
The tool currently only supports event logs stored on S3 (no DBFS paths). The remote output storage is also expected to be S3.
Install Databricks CLI
Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.
Set the configuration settings and credentials of the Databricks CLI:
Set up authentication by following these instructions
Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.
If the configuration is not set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. See the description of the variables in environment variables docs.
Setup the environment to access S3
Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started
Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.
If the AWS CLI configuration is not set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as:
AWS_PROFILE
,AWS_DEFAULT_REGION
,AWS_CONFIG_FILE
,AWS_SHARED_CREDENTIALS_FILE
. See the full list of variables in aws-cli-configure-envvarsNote that it is important to configure with the correct region for the bucket being used on S3. If region is not set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
emvironment variables if the credentials could not be pulled from the credential files.
NoteIn order to be able to run tools that require SSH on the EMR nodes (i.e., bootstrap), then:
make sure that you have SSH access to the cluster nodes; and
create a key pair using Amazon EC2 through the AWS CLI command
aws ec2 create-key-pair
as instructed in aws-cli-create-key-pairs.
The tool currently only supports event logs stored on ABFS. The remote output storage is also expected to be ABFS (no DBFS paths).
Install Databricks CLI
Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.
Set the configuration settings and credentials of the Databricks CLI:
Set up authentication by following these instructions
Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.
If the configuration is not set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. See the description of the variables in environment variables docs.
Install Azure CLI
Install the Azure CLI. Follow the instructions on How to install the Azure CLI.
Set the configuration settings and credentials of the Azure CLI:
Set up the authentication by following these instructions.
Configure the Azure CLI by following these instructions.
location is used for retreving instance type description (default is westus).
output should use default of json in core section.
Verify that the configurations are stored in the file $AZURE_CONFIG_DIR/config where the default value of AZURE_CONFIG_DIR is $HOME/.azure on Linux or macOS.
If the configuration is not set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: AZURE_CONFIG_DIR and AZURE_DEFAULTS_LOCATION.
Install the CLI Package
Install spark-rapids-user-tools with one of the options below
pip install spark-rapids-user-tools
pip install <wheel-file>
Checkout the code repository
git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/user_tools
Optional: Run the project in a virtual environment
python -m venv .venv source .venv/bin/activate
Build wheel file using one of the following modes:
- Fat mode
- Default mode
Similar to fat jar in Java, this mode solves the problem when web access is not available to download resources having Url-paths (http/https). The command builds the tools jar file and downloads the necessary dependencies and packages them with the source code into a single wheel file. You may consider this mode if the development environment has no access to download dependencies (i.e., Spark jars) during runtime.
This mode builds a wheel package without any jar dependencies
Finally, install the package using the wheel file
pip install <wheel-file>
A typical workflow to successfully run the qualification
command in local mode is described as follows:
Follow the instructions to setup the CLI
Spark event logs from prior runs of the applications on Spark 2.x or later. Get the location of the Apache Spark eventlogs generated from CPU based Spark applications. In addition to local storage, the eventlogs should be stored in a valid remote storage:
For Dataproc, it should be set to the GCS path.
For EMR and Databricks-AWS, it should be set to the S3 path.
For Databricks-Azure, it should be set to ABFS
Finally, run the qualification
command on the set of selected eventlogs. The cmd helps quantify the expected acceleration and costs savings of migrating a Spark application or query to GPU. The cmd will process each app individually, but will group apps with the same name into the same output row after averaging duration metrics accordingly.
spark_rapids qualification <flag>
Environment Variables
In addition to the environment variables used to configure the CSP environment, the CLI has its own set of environment variables.
Before running any command, you can set environment variables to specify configurations. RAPIDS variables have a naming pattern RAPIDS_USER_TOOLS_*
:
RAPIDS_USER_TOOLS_CACHE_FOLDER
: specifies the location of a local directory that the CLI uses to store and cache the downloaded resources. The default is /var/tmp/spark_rapids_user_tools_cache. Note that caching the resources locally has an impact on the total execution time of the command.RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY
: specifies the location of a local directory that the CLI uses to generate the output. The wrapper CLI arguments override that environment variable (i.e.,--output_folder
).
Command Options
You can list all the options using the help
argument
spark_rapids qualification -- --help
Available options are listed in the following table.
Option |
Description |
Default |
Required |
---|---|---|---|
--eventlogs |
Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the cluster argument points to a valid cluster name on the CSP. |
N/A | N |
--cluster |
Name or ID (for databricks platforms) of cluster or path to cluster-properties. | N/A | N |
--platform , -p |
Defines one of the following “on-prem”, “emr”, “dataproc”, “dataproc-gke”, “databricks-aws”, and “databricks-azure”. | N/A | N |
--target_platform , -t |
Cost savings and speedup recommendation for comparable cluster in target_platform based on on-prem cluster configuration. Currently only dataproc is supported. If not provided, the final report will be limited to GPU speedups only without cost-savings. |
N/A | N |
--output_folder , -o |
Path to store the output. | N/A | N |
--filter_apps , -f |
Requires cluster argument. Filtering criteria of the applications listed in the final STDOUT table without affecting the CSV report:
|
TOP_CANDIDATES |
N |
--estimation_model |
Model used to calculate the estimated GPU duration and cost savings:
|
speedups |
N |
--tools_jar |
Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository. | N/A | N |
--jvm_heap_size |
The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host. | N/A | N |
--jvm_threads |
Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host. | N/A | N |
--cpu_cluster_price |
The CPU cluster hourly price (float) provided by the user. | N/A | N |
--estimated_gpu_cluster_price |
The GPU cluster hourly price provided by the user. | N/A | N |
--cpu_discount |
A percent discount for the cpu cluster cost in the form of an integer value (e.g. 30 for 30% discount). | N/A | N |
--gpu_discount |
A percent discount for the gpu cluster cost in the form of an integer value (e.g. 30 for 30% discount). | N/A | N |
--global_discount |
A percent discount for both the cpu and gpu cluster costs in the form of an integer value (e.g. 30 for 30% discount). | N/A | N |
--gpu_cluster_recommendation |
Requires cluster argument. The type of GPU cluster recommendation to generate:
|
MATCH |
N |
--verbose , -v |
True or False to enable verbosity of the script. | N/A | N |
Cost-Savings
By default, the tool generates estimated speedups of the CPU application. In order to generate the estimated cost-savings, then you need to point to input the CPU cluster information.
The tool allows to pass the cluster properties (including for on-prem cluster) using one of the following scenarios:
Cluster by name
This option is not available for on-prem cluster.
The gcloud command is used to view the details of a cluster (see gcloud SDK docs)
gcloud dataproc clusters describe cluster_name
The list-clusters command provides the status of the cluster visible to the AWS account. (see AWS CLI docs)
aws emr list-clusters --query 'Clusters[?Name==`{cluster_name}`]'
The above command outputs a list of clusters from which we can extract the cluster-id as an input for the describe-cluster cmd.
aws emr describe-cluster --cluster-id {cluster_id}
Databricks-get cmd can be used to print information about an individual cluster in a workspace.
databricks clusters get CLUSTER_ID [flags]
Databricks-get cmd can be used to print information about an individual cluster in a workspace.
databricks clusters get CLUSTER_ID [flags]
Cluster property file
The cluster may be deleted/offline. This case is to point to a cluster using its properties file (json/yaml formats).
User defines the cluster configuration of on-prem platform. The following sample is in
yml
format.config: masterConfig: numCores: 2 memory: 7680MiB workerConfig: numCores: 8 memory: 7680MiB numWorkers: 2
Refer to the gcloud SDK docs.
Refer to the sample output of the describe-cluster cmd.
Refer to Databricks CLI docs.
Refer to Databricks CLI docs.
Sample Commands
To see a full list of commands in details, please visit Qualification-cmd CLI examples.
Qualification Output
The Qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with Estimated GPU Speedup and cost saving metrics.
The command creates a directory with UUID that contains the following:
Directory generated by the RAPIDS qualification tool rapids_4_spark_qualification_output;
A CSV file that contains the summary of all the applications along with estimated absolute costs (qualification_summary.csv)
Sample output directory structure.
qual_20230314145334_d2CaFA34 ├── qualification_summary.csv └── rapids_4_spark_qualification_output/ ├── ui/ │ └── html/ ...
See this listing for full details of the subdirectory rapids_4_spark_qualification_output
.
In qualification_summary.csv, the command output lists the following fields for each application:
- App ID
- App Name
- App Duration
- Estimated GPU Duration
- Estimated GPU Speedup
- Estimated GPU Savings(%)
- Savings Based Recommendation
Strongly Recommended: An app with savings \(\geq\) 40%
Recommended: An app with savings between (1, 40) %
Not Recommended: An app with no savings
Not Applicable: An app that has job or stage failures.
- Speedup Based Recommendation
An application is referenced by its application ID, app-id. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.
Name of the application
Wall-Clock time measured since the application starts till it is completed. If an app is not completed an estimated completion time would be computed.
Predicted runtime of the app if it was run on GPU.
It is the sum of the accelerated operator durations and ML functions duration(if applicable) along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe.
That will estimate how much faster the application would run on GPU. It is calculated as the ratio between App Duration and Estimated GPU Duration.
Percentage of cost savings of the app if it migrates to an accelerated cluster. It is calculated as:
\(\texttt{estimated}\_\texttt{saving} = 100 - (\frac{100 \times \texttt{gpu}\_\texttt{cost}}{\texttt{cpu}\_\texttt{cost}})\)
Recommendation based on Estimated GPU Savings.
Recommendation based on Estimated GPU Speedup. Note that an application that has job or stage failures will be labeled Not Applicable
Sample of Qualification cmd output on the STDOUT
+----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------+
| | App ID | App Name | Speedup Based | Savings Based | App | Estimated GPU | Estimated GPU | Estimated GPU |
| | | | Recommendation | Recommendation | Duration(s) | Duration(s) | Speedup | Savings(%) |
|----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------|
| 0 | app-0002 | spark_data_utils.py | Strongly Recommended | Strongly Recommended | 1201.72 | 220.85 | 5.44 | 44.33 |
| 3 | app-0001 | Spark shell | Strongly Recommended | Recommended | 1783.65 | 533.05 | 3.35 | 9.48 |
+----+----------+---------------------+----------------------+----------------------+---------------+-----------------+-----------------+-----------------+
For more information on the detailed output of the Qualification tool, go here: Output Details.
TCO Calculator
In addition to the above fields, Estimated Job Frequency (monthly)
and Annual Cost Savings
are to be used as part of a TCO calculator to see the long-term benefit of using Spark RAPIDS with your applications.
Copy the GSheet template and then follow the instructions listed in the Instructions tab.