User Guide (24.08.01)
RAPIDS Accelerator for Apache Spark - User Guide (24.08.01)

Quickstart

spark-rapids-user-tools CLI enables user to run the tool for logs from a number of CSP platforms in addition to on-prem.

Tip
  1. For most accurate results, it’s recommended to run the latest version of the CLI tool.

  2. Databricks users can run the tool using a demo notebook.

Prerequisites

  • Set up a Python environment with a version between 3.8 and 3.11

  • Java 8+

  • The developer machine used to host the CLI needs internet access. If the machine is behind a proxy, it’s recommended to install the CLI package from source using fat mode as described in Install the CLI Package.

  • Set up the development environment for your CSP or on-prem

    No more steps required to run the tools on on-premises environment including standalone/local machines.

    The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:

    • HADOOP_HOME: the root of your installed Hadoop distribution. Often has “lib/native/libhdfs.so”.

    • JAVA_HOME: the location of your Java SDK installation.

    • ARROW_LIBHDFS_DIR (optional): explicit location of “libhdfs.so” if it’s installed somewhere other than $HADOOP_HOME/lib/native.

    • Add the Hadoop jars to your CLASSPATH.

      Copy
      Copied!
                  

      export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

      Copy
      Copied!
                  

      %HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%

    For more information on HDFS requirements, refer to the PyArrow HDFS documentation


    • Install gcloud CLI. Follow the instructions on gcloud-sdk-install

    • Set the configuration settings and credentials of the gcloud CLI:

      • Initialize the gcloud CLI by following these instructions

      • Grant authorization to the gcloud CLI with a user account

      • Set up “application default credentials” to the gcloud CLI by logging in

      • Manage gcloud CLI configurations. For more details, visit gcloud-sdk-configurations

      • Verify that the following gcloud CLI properties are properly defined:

        • dataproc/region

        • compute/zone

        • compute/region

        • core/project

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: CLOUDSDK_DATAPROC_REGION, and CLOUDSDK_COMPUTE_REGION.

      • The tools CLI follows the process described in this doc to resolve the credentials. If not running on (GCP), the environment variable GOOGLE_APPLICATION_CREDENTIALS is required to point to a JSON file containing credentials.

    • Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started

    • Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.

    • If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE. Refer to the full list of variables in aws-cli-configure-envvars

    • It’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY emvironment variables if the credentials couldn’t be pulled from the credential files.

    Note

    In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:

    • make sure that you have SSH access to the cluster nodes; and

    • create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs.

    The tool currently only supports event logs stored on S3 (no DBFS paths). The remote output storage is also expected to be S3. In order to get complete eventlogs for a given run-id : ` databricks clusters list | grep <run-id> databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location> ` are a couple of commands that can be used to download all the logs associated with a given run. Please refer to the latest Databricks documentation on up-to-date information. Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv file can call out the failed run due to incomplete logs.

    • Install Databricks CLI

      • Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.

      • Set the configuration settings and credentials of the Databricks CLI:

      • Set up authentication by following these instructions

      • Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. Refer to the description of the variables in environment variables docs.

    • Setup the environment to access S3

      • Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started

      • Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.

      • If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE. Refer to the full list of variables in aws-cli-configure-envvars

      • It’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY emvironment variables if the credentials couldn’t be pulled from the credential files.

      Note

      In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:

      • make sure that you have SSH access to the cluster nodes; and

      • create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs.

    The tool currently only supports event logs stored on ABFS. The remote output storage is also expected to be ABFS (no DBFS paths). In order to get complete eventlogs for a given run-id : ` databricks clusters list | grep <run-id> databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location> ` are a couple of commands that can be used to download all the logs associated with a given run. Please refer to the latest Databricks documentation on up-to-date information. Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv file can call out the failed run due to incomplete logs.

    • Install Databricks CLI

      • Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.

      • Set the configuration settings and credentials of the Databricks CLI:

      • Set up authentication by following these instructions

      • Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. Refer to the description of the variables in environment variables docs.

    • Install Azure CLI

      • Install the Azure CLI. Follow the instructions on How to install the Azure CLI.

      • Set the configuration settings and credentials of the Azure CLI:

        • Set up the authentication by following these instructions.

        • Configure the Azure CLI by following these instructions.

          • location is used for retreving instance type description (default is westus).

          • output should use default of json in core section.

          • Verify that the configurations are stored in the file $AZURE_CONFIG_DIR/config where the default value of AZURE_CONFIG_DIR is $HOME/.azure on Linux or macOS.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: AZURE_CONFIG_DIR and AZURE_DEFAULTS_LOCATION.

Install the CLI Package

Copy
Copied!
            

pip install spark-rapids-user-tools

If you need more details, find in RAPIDS user tools pip package.

Copy
Copied!
            

pip install <wheel-file>

  • Checkout the code repository

    Copy
    Copied!
                

    git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/user_tools

  • Optional: Run the project in a virtual environment

    Copy
    Copied!
                

    python -m venv .venv source .venv/bin/activate

  • Build wheel file using one of the following modes:

    Fat mode

    Similar to fat jar in Java, this mode solves the problem when web access isn’t available to download resources having Url-paths (http/https). The command builds the tools jar file and downloads the necessary dependencies and packages them with the source code into a single wheel file. You may consider this mode if the development environment has no access to download dependencies (that is, Spark jars) during runtime.

    Copy
    Copied!
                

    ./build.sh fat

    Default mode

    This mode builds a wheel package without any jar dependencies

    Copy
    Copied!
                

    ./build.sh

  • Finally, install the package using the wheel file

    Copy
    Copied!
                

    pip install <wheel-file>


A typical workflow to successfully run the qualification command:

  1. Follow the instructions to set up the prerequisites and install the CLI

  2. Get Apache Spark eventlogs from prior runs of CPU based applications on Spark 2.x or later. In addition to local storage, the eventlogs should be stored in a valid remote storage:

    • For Dataproc, it should be set to GCS path.

    • For EMR and Databricks-AWS, it should be set to S3 path.

    • For Databricks-Azure, it should be set to ABFS path.

  3. Run the qualification command on the set of selected eventlogs. Event logs can be passed as single files, a directory, a comma-separated list of files or directories. The format of event logs can be raw, zip, or gzip.

Copy
Copied!
            

spark_rapids qualification <flags>


The tool helps quantify the expected acceleration of migrating a Spark application or query to GPU. The tool will process each app individually, but will group apps with the same name and cluster details into a single output row after averaging duration metrics accordingly.

Example Commands

This section shows examples of Qualification CLI commands assuming the following inputs:

  • EVENTLOG: Path to Spark eventlogs without the scheme part. The scheme can be a local file system (file://), HDFS (hdfs://), S3 (s3://), ABFS (abfss://), or GCS (gs://).

The following table shows CLI command examples along with platform and expected functionalities based on which analysis is performed.

Examples of qualification CLI commands

CMD

Platform

Copy
Copied!
            

spark_rapids qualification \ --platform dataproc \ --eventlogs gs://$EVENTLOG

Dataproc
Copy
Copied!
            

spark_rapids qualification \ --platform emr \ --eventlogs s3://$EVENTLOG

EMR
Copy
Copied!
            

spark_rapids qualification \ --platform databricks-azure \ --eventlogs file://$EVENTLOG

Databricks-Azure
Copy
Copied!
            

spark_rapids qualification \ --platform databricks-aws \ --eventlogs file://$EVENTLOG

Databricks-AWS
Copy
Copied!
            

spark_rapids qualification \ --platform onprem \ --eventlogs file://$EVENTLOG

On-prem

Command Options

You can list all the options using the help argument

Copy
Copied!
            

spark_rapids qualification -- --help


Available options are listed in the following table.

List of options for qualification CLI command

Option

Description

Default

Required

--eventlogs Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the cluster argument points to a valid cluster name on the CSP. N/A N
--cluster The CPU cluster on which the Spark application(s) were executed. Name or ID of cluster or path to cluster property file. Further details described in Cluster Metadata. N/A N
--platform, -p Defines one of the following “on-prem”, “emr”, “dataproc”, “dataproc-gke”, “databricks-aws”, and “databricks-azure”. N/A N
--target_platform, -t Speedup recommendation for comparable cluster in target_platform based on on-prem cluster configuration. Currently only dataproc is supported. N/A N
--output_folder, -o Path to store the output. N/A N
--filter_apps, -f Requires cluster argument.
Filtering criteria of the applications listed in the final STDOUT table without affecting the CSV report:
  • ALL means no filter applied.
  • TOP_CANDIDATES lists all apps that have unsupported operators stage duration less than 25% of app duration and speedups greater than 1.3x.
TOP_CANDIDATES N
--custom_model_file Custom model file (JSON format) used to calculate the estimated GPU duration N/A N
--tools_jar Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository. N/A N
--jvm_heap_size The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host. N/A N
--jvm_threads Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host. N/A N
--gpu_cluster_recommendation Requires cluster argument.
The type of GPU cluster recommendation to generate:
  • MATCH: keep GPU cluster same number of nodes as CPU cluster
  • CLUSTER: recommend optimal GPU cluster for entire cluster to match CPU duration of longest job
  • JOB: recommend optimal GPU cluster per job to match CPU duration per job
MATCH N
--verbose, -v True or False to enable verbosity of the script. N/A N

Cluster-Metadata

By default, the qualification tool generates estimated speedups of the CPU applications. It will also generate a cluster recommendation for running on GPU. To aid that recommendation, you can provide the CPU cluster information.

The specific type of parameter passed for the cluster is based on platform, see the following scenarios to determine which method to use for your platform:

User can pass name of the cluster CLUSTER_NAME to the command:

Copy
Copied!
            

spark_rapids qualification --cluster $CLUSTER_NAME [flags]


This is supported on Dataproc and EMR platform. The tool uses the CSP CLI to collect the cluster information.

User can pass ID of the cluster CLUSTER_ID to the command:

Copy
Copied!
            

spark_rapids qualification --cluster $CLUSTER_ID [flags]


This is supported on Databricks-AWS and Databricks-Azure platform. The tool uses the CSP CLI to collect the cluster information.

User can pass CLUSTER_PROPS - the path to cluster property file (in json/yaml formats) to the command. This is useful if the cluster isn’t accessible or permanently deleted.

Copy
Copied!
            

spark_rapids qualification --cluster $CLUSTER_PROPS [flags]


User defines the cluster configuration of on-prem platform. The following is a sample cluster property file CLUSTER_PROPS in yaml format.

Copy
Copied!
            

config: masterConfig: numCores: 2 memory: 7680MiB workerConfig: numCores: 8 memory: 7680MiB numWorkers: 2

target_platform is required for on-prem clusters. Currently only Dataproc is supported.

Given Dataproc CLUSTER_NAME, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to gcloud CLI docs)

Copy
Copied!
            

gcloud dataproc clusters describe $CLUSTER_NAME > $CLUSTER_PROPS

Given EMR CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to AWS CLI docs)

Copy
Copied!
            

aws emr describe-cluster --cluster-id $CLUSTER_ID > $CLUSTER_PROPS

Given Databricks CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to Databricks CLI docs)

Copy
Copied!
            

databricks clusters get $CLUSTER_ID > $CLUSTER_PROPS

Given Databricks CLUSTER_ID, user can generate its cluster property file CLUSTER_PROPS using the following command. (Refer to Databricks CLI docs)

Copy
Copied!
            

databricks clusters get $CLUSTER_ID > $CLUSTER_PROPS


Qualification Output

The Qualification tool will run against logs from your CSP environment and then will output the applications recommended for acceleration along with Estimated GPU Speedup.

The command creates a directory with UUID that contains the following output:

Copy
Copied!
            

qual_20230314145334_d2CaFA34 ├── app_metadata.json ├── qualification_summary.csv ├── qualification_statistics.csv ├── intermediate_output/ ├── rapids_4_spark_qualification_output/ └── xgboost_predictions/ ...


Note

See this listing for full details of the directory qual_20230314145334_d2CaFA34.

In qualification_summary.csv, the command output lists these key fields for each application:

App ID

An application is referenced by its application ID, app-id. When running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their attempt-id.

App Name

Name of the application

App Duration

Wall-Clock time measured since the application starts until it’s completed. If an app isn’t completed an estimated completion time would be computed.

Estimated GPU Duration

Predicted runtime of the app if it was run on GPU.
It’s the sum of the accelerated operator durations and ML functions duration(if applicable) along with durations that couldn’t run on GPU because they’re unsupported operators or not SQL/Dataframe.

Estimated GPU Speedup

That will estimate how much faster the application would run on GPU. It’s calculated as the ratio between App Duration and Estimated GPU Duration.

Estimated GPU Speedup Category

This is the qualification result per job to determine if the job is a good candidate for running on GPU. A value of Large, Medium, or Small indicates the job should be migrated to GPU with different levels of confidence for expected acceleration. A value of Not Recommended or Not Applicable indicates the job should not be migrated to GPU.

The qualification command output in STDOUT will also show a summary of qualified candidates.

Sample of Qualification command output in STDOUT

Copy
Copied!
            

+----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+ | | App Name | App ID | Estimated GPU | Qualified Node | Full Cluster | GPU Config | | | | | Speedup | Recommendation | Config | Recommendation | | | | | Category** | | Recommendations* | Breakdown* | |----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------| | 0 | query A0 | application_1696859475058_0007 | Large | 45 x g5.8xlarge | application_1696859475058_0007.conf | application_1696859475058_0007.log | | 1 | query A1 | application_1696859475058_0008 | Large | 9 x g5.8xlarge | application_1696859475058_0008.conf | application_1696859475058_0008.log | | 2 | query A2 | application_1696859475058_0012 | Large | 2 x g5.8xlarge | application_1696859475058_0012.conf | application_1696859475058_0012.log | | 5 | query B0 | application_1696859475058_0014 | Medium | 150 x g5.8xlarge | application_1696859475058_0014.conf | application_1696859475058_0014.log | | 6 | query B1 | application_1696859475058_0016 | Medium | 2 x g5.8xlarge | application_1696859475058_0016.conf | application_1696859475058_0016.log | | 7 | query B2 | application_1696859475058_0011 | Small | 2 x g5.8xlarge | application_1696859475058_0011.conf | application_1696859475058_0011.log | +----+-----------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+ Notes: -------------------- - *Cluster config recommendations: ./qual_20240731164401_6D6B7fb8/rapids_4_spark_qualification_output/tuning - **Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.


For more information on the detailed output of the Qualification tool, go here: Output Details.

Previous Overview
Next Output Details
© Copyright 2024, NVIDIA. Last updated on Aug 29, 2024.