User Guide (24.08.01)
RAPIDS Accelerator for Apache Spark - User Guide (24.08.01)

Quickstart

The simplest way to run the tool is using the spark-rapids-user-tools CLI tool. This enables you to run for logs from a number of CSP platforms in addition to on-prem.

In running the tool standalone on Spark event logs, the tool can be run as a user tool command via a RAPIDS user tools pip package for CSP environments (Google Dataproc, AWS EMR, and Databricks-Azure/AWS) or as a java application for other environments. More details on how to use the java application is described in java API.

Tip
  1. For most accurate results, it’s recommended to run the latest version of the CLI tool

  2. Databricks users can run the tool using a demo notebook.

Prerequisites

  • Set up a Python environment with a version between 3.8 and 3.11

  • Java 8+

  • The developer machine used to host the CLI tools needs internet access to download JAR dependencies from mvn: spark-*.jar, hadoop-aws-*.jar, and aws-java-sdk-bundle*.jar. If the host machine is behind a proxy, then it’s recommended to install the CLI package from source using the fat mode as described in the Install the CLI Package section.

  • Set the development environment for your CSP

    No more steps required to run the tools on on-premises environment including standalone/local machines.

    The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:

    • HADOOP_HOME: the root of your installed Hadoop distribution. Often has “lib/native/libhdfs.so”.

    • JAVA_HOME: the location of your Java SDK installation.

    • ARROW_LIBHDFS_DIR (optional): explicit location of “libhdfs.so” if it’s installed somewhere other than $HADOOP_HOME/lib/native.

    • Add the Hadoop jars to your CLASSPATH.

      Copy
      Copied!
                  

      export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

      Copy
      Copied!
                  

      %HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%

    For more information on HDFS requirements, refer to the PyArrow HDFS documentation


    • Install gcloud CLI. Follow the instructions on gcloud-sdk-install

    • Set the configuration settings and credentials of the gcloud CLI:

      • Initialize the gcloud CLI by following these instructions

      • Grant authorization to the gcloud CLI with a user account

      • Set up “application default credentials” to the gcloud CLI by logging in

      • Manage gcloud CLI configurations. For more details, visit gcloud-sdk-configurations

      • Verify that the following gcloud CLI properties are properly defined:

        • dataproc/region

        • compute/zone

        • compute/region

        • core/project

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: CLOUDSDK_DATAPROC_REGION, and CLOUDSDK_COMPUTE_REGION.

      • The tools CLI follows the process described in this doc to resolve the credentials. If not running on (GCP), the environment variable GOOGLE_APPLICATION_CREDENTIALS is required to point to a JSON file containing credentials.

    • Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started

    • Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.

    • If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE. Refer to the full list of variables in aws-cli-configure-envvars

    • It’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY emvironment variables if the credentials couldn’t be pulled from the credential files.

    Note

    In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:

    • make sure that you have SSH access to the cluster nodes; and

    • create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs.

    The tool currently only supports event logs stored on S3 (no DBFS paths). The remote output storage is also expected to be S3. In order to get complete eventlogs for a given run-id : ` databricks clusters list | grep <run-id> databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location> ` are a couple of commands that can be used to download all the logs associated with a given run. Please refer to the latest Databricks documentation on up-to-date information. Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv file can call out the failed run due to incomplete logs.

    • Install Databricks CLI

      • Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.

      • Set the configuration settings and credentials of the Databricks CLI:

      • Set up authentication by following these instructions

      • Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. Refer to the description of the variables in environment variables docs.

    • Setup the environment to access S3

      • Install the AWS CLI version 2. Follow the instructions on aws-cli-getting-started

      • Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files.

      • If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE, AWS_DEFAULT_REGION, AWS_CONFIG_FILE, AWS_SHARED_CREDENTIALS_FILE. Refer to the full list of variables in aws-cli-configure-envvars

      • It’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY emvironment variables if the credentials couldn’t be pulled from the credential files.

      Note

      In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:

      • make sure that you have SSH access to the cluster nodes; and

      • create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs.

    The tool currently only supports event logs stored on ABFS. The remote output storage is also expected to be ABFS (no DBFS paths). In order to get complete eventlogs for a given run-id : ` databricks clusters list | grep <run-id> databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location> ` are a couple of commands that can be used to download all the logs associated with a given run. Please refer to the latest Databricks documentation on up-to-date information. Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv file can call out the failed run due to incomplete logs.

    • Install Databricks CLI

      • Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI.

      • Set the configuration settings and credentials of the Databricks CLI:

      • Set up authentication by following these instructions

      • Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE, DATABRICKS_HOST and DATABRICKS_TOKEN. Refer to the description of the variables in environment variables docs.

    • Install Azure CLI

      • Install the Azure CLI. Follow the instructions on How to install the Azure CLI.

      • Set the configuration settings and credentials of the Azure CLI:

        • Set up the authentication by following these instructions.

        • Configure the Azure CLI by following these instructions.

          • location is used for retreving instance type description (default is westus).

          • output should use default of json in core section.

          • Verify that the configurations are stored in the file $AZURE_CONFIG_DIR/config where the default value of AZURE_CONFIG_DIR is $HOME/.azure on Linux or macOS.

      • If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: AZURE_CONFIG_DIR and AZURE_DEFAULTS_LOCATION.

Install the CLI Package

  • Install spark-rapids-user-tools with one of the options below

    Copy
    Copied!
                

    pip install spark-rapids-user-tools

    Copy
    Copied!
                

    pip install <wheel-file>

    • Checkout the code repository

      Copy
      Copied!
                  

      git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/user_tools

    • Optional: Run the project in a virtual environment

      Copy
      Copied!
                  

      python -m venv .venv source .venv/bin/activate

    • Build wheel file using one of the following modes:

      Fat mode

      Similar to fat jar in Java, this mode solves the problem when web access isn’t available to download resources having Url-paths (http/https). The command builds the tools jar file and downloads the necessary dependencies and packages them with the source code into a single wheel file. You may consider this mode if the development environment has no access to download dependencies (that is, Spark jars) during runtime.

      Copy
      Copied!
                  

      ./build.sh fat

      Default mode

      This mode builds a wheel package without any jar dependencies

      Copy
      Copied!
                  

      ./build.sh

    • Finally, install the package using the wheel file

      Copy
      Copied!
                  

      pip install <wheel-file>

A typical workflow to successfully run the profiling command in local mode is described as follows:

  1. Follow the instructions to setup the CLI

  2. Spark event logs from prior runs of the applications on Spark 2.x or later. Get the location of the Apache Spark event logs generated from CPU based Spark applications. In addition to local storage, the event logs should be stored in a valid remote storage:

    • For Dataproc, it should be set to the GCS path.

    • For EMR and Databricks-AWS, it should be set to the S3 path.

    • For Databricks-Azure, it should be set to ABFS

Finally, run the profiling command on the set of selected event logs.

Copy
Copied!
            

spark_rapids profiling <flag>


Environment Variables

In addition to the environment variables used to configure the CSP environment, the CLI has its own set of environment variables.

Before running any command, you can set environment variables to specify configurations. RAPIDS variables have a naming pattern RAPIDS_USER_TOOLS_*:

  1. RAPIDS_USER_TOOLS_CACHE_FOLDER: specifies the location of a local directory that the CLI uses to store and cache the downloaded resources. The default is /var/tmp/spark_rapids_user_tools_cache. Caching the resources locally has an impact on the total execution time of the command.

  2. RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY: specifies the location of a local directory that the CLI uses to generate the output. The wrapper CLI arguments override that environment variable (that is, --output_folder).

Command Options

You can list all the options using the help argument

Copy
Copied!
            

spark_rapids profiling -- --help


Available options are listed in the following table.

List of arguments and options for profiling cmd

Option

Description

Default

Required

--eventlogs, -e Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the cluster argument points to a valid cluster name on the CSP. N/A N
--cluster, -c The cluster on which the Spark applications were executed. The argument can be a cluster name or ID (for databricks platforms) or a valid path to the cluster’s properties file (json format) generated by the CSP SDK. N/A N
--platform, -p Defines one of the following “onprem”, “emr”, “dataproc”, “databricks-aws”, and “databricks-azure”. N/A N
--driverlog, -d Valid path to the GPU driver log file. N/A N
--output_folder, -o Path to store the output. N/A N
--tools_jar, -t Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository. N/A N
--jvm_heap_size The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host. N/A N
--jvm_threads Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host. N/A N
--verbose, -v True or False to enable verbosity of the script. N/A N

Sample Commands

To see a full list of commands in details, visit Profiling-cmd CLI examples.

Profiling Output

The tool runs against event logs from your CSP environment and collects information on each application individually and outputs a file per application. The tool generates a summary text file named profile.log along with a profile summary for each application under “rapids_4_spark_profile/{APP_ID}”.

Sample output directory structure.

Copy
Copied!
            

Profiling tool output: ~/prof_20240105163618_9e2B995F/rapids_4_spark_profile prof_20240105163618_9e2B995F ├── rapids_4_spark_profile │ ├── appID-01 │ │ └── profile.log │ ├── appID-02 │ │ └── profile.log │ ├── .. │ │ └── profile.log │ └── appID-74 │ └── profile.log ├── profiling_summary.log └── runtime_properties.log 74 directories, 74 files

Refer to Understanding The Profiling Output for full details of the content of profile.log.

The tool prints a list of recommended settings per application. The STDOUT includes both a list view and a tabular format of the recommendations.

Recommendations table summary as it appears in the STDOUT.

Copy
Copied!
            

+--------+-------------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ | App ID | App Name | Recommendations | Comments | +========+=============+====================================================================================+=======================================================================================================+ | App-01 | sanity_test | --conf spark.executor.cores=16 | - 'spark.rapids.shuffle.multiThreaded.reader.threads' wasn't set. | | | | --conf spark.executor.instances=8 | - 'spark.rapids.shuffle.multiThreaded.writer.threads' wasn't set. | | | | --conf spark.executor.memory=32768m | - 'spark.rapids.sql.concurrentGpuTasks' wasn't set. | | | | --conf spark.kubernetes.memoryOverheadFactor=8396m | - 'spark.shuffle.manager' wasn't set. | | | | --conf spark.rapids.memory.pinnedPool.size=4096m | - 'spark.sql.adaptive.advisoryPartitionSizeInBytes' wasn't set. | | | | --conf spark.rapids.shuffle.multiThreaded.reader.threads=16 | - 'spark.sql.adaptive.coalescePartitions.minPartitionSize' wasn't set. | | | | --conf spark.rapids.shuffle.multiThreaded.writer.threads=16 | - 'spark.sql.adaptive.enabled' should be enabled for better performance. | | | | --conf spark.rapids.sql.concurrentGpuTasks=2 | - 'spark.sql.files.maxPartitionBytes' wasn't set. | | | | --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark322.RapidsShuffleManager | - 'spark.sql.shuffle.partitions' wasn't set. | | | | --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=128m | - A newer RAPIDS Accelerator for Apache Spark plugin is available: | | | | --conf spark.sql.adaptive.coalescePartitions.minPartitionSize=4m | https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar | | | | --conf spark.sql.files.maxPartitionBytes=529m | Version used in application is 22.12.0. | | | | --conf spark.sql.shuffle.partitions=200 | - The RAPIDS Shuffle Manager requires spark.driver.extraClassPath | | | | --conf spark.task.resource.gpu.amount=0.0625 | and spark.executor.extraClassPath settings to include the | | | | | path to the Spark RAPIDS plugin jar. | | | | | If the Spark RAPIDS jar is being bundled with your Spark | | | | | distribution, this step isn't needed. | +--------+-------------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+

Previous Overview
Next Output Details
© Copyright 2024, NVIDIA. Last updated on Aug 29, 2024.