Quickstart#

The simplest way to run the tool is using the spark-rapids-user-tools CLI tool. This enables you to run for logs from a number of CSP platforms in addition to on-prem.

In running the tool standalone on Spark event logs, the tool can be run as a user tool command via a RAPIDS user tools pip package for CSP environments (Google Dataproc, AWS EMR, and Databricks-Azure/AWS) or as a java application for other environments. More details on how to use the java application is described in java API.

Tip

For most accurate results, it’s recommended to run the latest version of the CLI tool
Databricks users can run the tool using a demo notebook.

Running The Tool#

A typical workflow to successfully run the profiling command in local mode is described as follows:

Follow the instructions to setup the CLI
Spark event logs from prior runs of the applications on Spark 2.x or later. Get the location of the Apache Spark event logs generated from CPU based Spark applications. In addition to local storage, the event logs should be stored in a valid remote storage:
- For Dataproc, it should be set to the GCS path.
- For EMR and Databricks-AWS, it should be set to the S3 path.
- For Databricks-Azure, it should be set to ABFS

Finally, run the profiling command on the set of selected event logs.

spark_rapids profiling <flag>

Environment Variables#

In addition to the environment variables used to configure the CSP environment, the CLI has its own set of environment variables.

Before running any command, you can set environment variables to specify configurations. RAPIDS variables have a naming pattern RAPIDS_USER_TOOLS_*:

RAPIDS_USER_TOOLS_CACHE_FOLDER: specifies the location of a local directory that the CLI uses to store and cache the downloaded resources. The default is /var/tmp/spark_rapids_user_tools_cache. Caching the resources locally has an impact on the total execution time of the command.

RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY: specifies the location of a local directory that the CLI uses to generate the output. The wrapper CLI arguments override that environment variable (that is, --output_folder).

Command Options#

You can list all the options using the help argument

spark_rapids profiling -- --help

Available options are listed in the following table.

List of arguments and options for `profiling` cmd#
Option	Description	Default	Required
`--eventlogs`, `-e`	Event log filenames or CSP storage directories containing event logs (comma separated). Skipping this argument requires that the `cluster` argument points to a valid cluster name on the CSP.	N/A	N
`--cluster`, `-c`	The cluster on which the Spark applications were executed. The argument can be a cluster name or ID (for databricks platforms) or a valid path to the cluster’s properties file (json format) generated by the CSP SDK.	N/A	N
`--platform`, `-p`	Defines one of the following “onprem”, “emr”, “dataproc”, “databricks-aws”, and “databricks-azure”.	N/A	N
`--driverlog`, `-d`	Valid path to the GPU driver log file.	N/A	N
`--output_folder`, `-o`	Path to store the output.	N/A	N
`--tools_jar`, `-t`	Path to a bundled jar including Rapids tool. The path is a local filesystem, or remote cloud storage url. If missing, the wrapper downloads the latest rapids-4-spark-tools_*.jar from maven repository.	N/A	N
`--jvm_heap_size`	The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host.	N/A	N
`--jvm_threads`	Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host.	N/A	N
`--verbose`, `-v`	True or False to enable verbosity of the script.	N/A	N

Sample Commands#

To see a full list of commands in details, visit Profiling-cmd CLI examples.

Profiling Output#

The tool runs against event logs from your CSP environment and collects information on each application individually and outputs a file per application. The tool generates a summary text file named profile.log along with a profile summary for each application under “rapids_4_spark_profile/{APP_ID}”.

Sample output directory structure.#

Profiling tool output: ~/prof_20240105163618_9e2B995F/rapids_4_spark_profile
    prof_20240105163618_9e2B995F
    ├── rapids_4_spark_profile
    │   ├── appID-01
    │   │   └── profile.log
    │   ├── appID-02
    │   │   └── profile.log
    │   ├── ..
    │   │   └── profile.log
    │   └── appID-74
    │       └── profile.log
    ├── profiling_summary.log
    └── runtime_properties.log

    74 directories, 74 files

Refer to Understanding The Profiling Output for full details of the content of profile.log.

The tool prints a list of recommended settings per application. The STDOUT includes both a list view and a tabular format of the recommendations.

Recommendations table summary as it appears in the STDOUT.#

+--------+-------------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
| App ID | App Name    | Recommendations                                                                    | Comments                                                                                              |
+========+=============+====================================================================================+=======================================================================================================+
| App-01 | sanity_test | --conf spark.executor.cores=16                                                     | - 'spark.rapids.shuffle.multiThreaded.reader.threads' wasn't set.                                    |
|        |             | --conf spark.executor.instances=8                                                  | - 'spark.rapids.shuffle.multiThreaded.writer.threads' wasn't set.                                    |
|        |             | --conf spark.executor.memory=32768m                                                | - 'spark.rapids.sql.concurrentGpuTasks' wasn't set.                                                  |
|        |             | --conf spark.kubernetes.memoryOverheadFactor=8396m                                 | - 'spark.shuffle.manager' wasn't set.                                                                |
|        |             | --conf spark.rapids.memory.pinnedPool.size=4096m                                   | - 'spark.sql.adaptive.advisoryPartitionSizeInBytes' wasn't set.                                      |
|        |             | --conf spark.rapids.shuffle.multiThreaded.reader.threads=16                        | - 'spark.sql.adaptive.coalescePartitions.minPartitionSize' wasn't set.                               |
|        |             | --conf spark.rapids.shuffle.multiThreaded.writer.threads=16                        | - 'spark.sql.adaptive.enabled' should be enabled for better performance.                              |
|        |             | --conf spark.rapids.sql.concurrentGpuTasks=2                                       | - 'spark.sql.files.maxPartitionBytes' wasn't set.                                                    |
|        |             | --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark322.RapidsShuffleManager | - 'spark.sql.shuffle.partitions' wasn't set.                                                         |
|        |             | --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=128m                        | - A newer RAPIDS Accelerator for Apache Spark plugin is available:                                    |
|        |             | --conf spark.sql.adaptive.coalescePartitions.minPartitionSize=4m                   | https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar |
|        |             | --conf spark.sql.files.maxPartitionBytes=529m                                      | Version used in application is 22.12.0.                                                               |
|        |             | --conf spark.sql.shuffle.partitions=200                                            | - The RAPIDS Shuffle Manager requires spark.driver.extraClassPath                                     |
|        |             | --conf spark.task.resource.gpu.amount=0.0625                                       | and spark.executor.extraClassPath settings to include the                                             |
|        |             |                                                                                    | path to the Spark RAPIDS plugin jar.                                                                  |
|        |             |                                                                                    | If the Spark RAPIDS jar is being bundled with your Spark                                              |
|        |             |                                                                                    | distribution, this step isn't needed.                                                                |
+--------+-------------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+

Quickstart#

Install#

Prerequisites#

Install the CLI Package#