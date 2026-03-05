spark-rapids-user-tools CLI enables user to run the tool for logs from a number of CSP platforms in addition to on-prem.

For most accurate results, it’s recommended to run the latest version of the CLI tool.

The fat wheel is recommended for development environments that do not have Spark configured and cannot access external dependencies (such as Spark jars) during runtime. There is a fat wheel package available with releases starting with the 24.10.1 release. They are available on the RAPIDS user tools releases page.

Build wheel file using one of the following modes:

Recommended: Run the project in a virtual environment to isolate the dependencies.

This installation method is recommended for development environments that do not have Spark configured and cannot access external dependencies (such as Spark jars) during runtime. Download the wheel file from the “Assets” section of any release in RAPIDS user tools releases .

If you need more details, find in RAPIDS user tools pip package .

Install spark-rapids-user-tools with one of the options below

It is recommended to set up a Python virtual environment to avoid conflicts between package dependencies.

For users who have multiple Databricks profiles, user can switch profiles by setting environment variable RAPIDS_USER_TOOLS_DATABRICKS_PROFILE .

If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: AZURE_CONFIG_DIR and AZURE_DEFAULTS_LOCATION .

Verify that the configurations are stored in the file $AZURE_CONFIG_DIR/config where the default value of AZURE_CONFIG_DIR is $HOME/.azure on Linux or macOS.

output should use default of json in core section.

location is used for retreving instance type description (default is westus ).

Configure the Azure CLI by following these instructions .

Set up the authentication by following these instructions .

Set the configuration settings and credentials of the Azure CLI:

Install the Azure CLI. Follow the instructions on How to install the Azure CLI .

If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: DATABRICKS_CONFIG_FILE , DATABRICKS_HOST and DATABRICKS_TOKEN . Refer to the description of the variables in environment variables docs .

Verify that the access credentials are stored in the file ~/.databrickscfg on Unix, Linux, or macOS, or in another file defined by environment variable DATABRICKS_CONFIG_FILE .

Set up authentication by following these instructions

Set the configuration settings and credentials of the Databricks CLI:

Install the Databricks CLI version 0.200+. Follow the instructions on Install the CLI .

The tool currently only supports event logs stored on ABFS . The remote output storage is also expected to be ABFS (no DBFS paths). In order to get complete eventlogs for a given run-id : ` databricks clusters list | grep <run-id> databricks fs cp -r <databricks log location/<cluster id got from the above command> <destination_location> ` are a couple of commands that can be used to download all the logs associated with a given run. Please refer to the latest Databricks documentation on up-to-date information. Due to some platform limitations, it is likely that the logs may be incomplete. Thq qualification tool attempts to process them as best as possible. If the results come back empty, the rapids_4_spark_qualification_output_status.csv file can call out the failed run due to incomplete logs.

create a key pair using Amazon EC2 through the AWS CLI command aws ec2 create-key-pair as instructed in aws-cli-create-key-pairs .

make sure that you have SSH access to the cluster nodes; and

In order to be able to run tools that require SSH on the EMR nodes (that is, bootstrap), then:

The Spark event logs are stored in HDFS in the EMR cluster at the path /var/log/spark/apps/ . Please make sure the logs are copied to S3 (or a local directory) and specify that path before running the Qualification tool.

It’s important to configure with the correct region for the bucket being used on S3. If region isn’t set, the AWS SDK will choose a default value that may not be valid. In addition, the tools CLI by inspects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables if the credentials couldn’t be pulled from the credential files.

If the AWS CLI configuration isn’t set to the default values, then make sure to explicitly set some environment variables tp be picked up by the tools cmd such as: AWS_PROFILE , AWS_DEFAULT_REGION , AWS_CONFIG_FILE , AWS_SHARED_CREDENTIALS_FILE . Refer to the full list of variables in aws-cli-configure-envvars

Set the configuration settings and credentials of the AWS CLI by creating credentials and config files as described in aws-cli-configure-files .

The tools CLI follows the process described in this doc to resolve the credentials. If not running on (GCP), the environment variable GOOGLE_APPLICATION_CREDENTIALS is required to point to a JSON file containing credentials.

If the configuration isn’t set to default values, then make sure to explicitly set some environment variables to be picked up by the tools cmd such as: CLOUDSDK_DATAPROC_REGION , and CLOUDSDK_COMPUTE_REGION .

Verify that the following gcloud CLI properties are properly defined:

Set up “application default credentials” to the gcloud CLI by logging in

Initialize the gcloud CLI by following these instructions

Set the configuration settings and credentials of the gcloud CLI:

For more information on HDFS requirements, refer to the PyArrow HDFS documentation

ARROW_LIBHDFS_DIR (optional): explicit location of “ libhdfs.so ” if it’s installed somewhere other than $HADOOP_HOME/lib/native .

HADOOP_CONF_DIR : the path to your Hadoop configuration if HADOOP_HOME is not specified.

HADOOP_HOME : the root of your installed Hadoop distribution. Often has “ lib/native/libhdfs.so ”.

The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:

No more steps required to run the tools on on-premises environment including standalone/local machines.

Set up the development environment for your CSP or on-prem

The developer machine used to host the CLI needs internet access. If the machine is behind a proxy, it’s recommended to install the CLI package from source using Tools-Fat-Wheel as described in Install the CLI Package .

Set up a Python environment with a version between 3.10 and 3.12

Running the Tool#

For Databricks AWS or Azure platform, use the notebook here. Set the platform and CPU eventlog location(s) to an S3 or DBFS location before running the imported notebook. To see a per-app recommended set of Spark configs, you can download the entire qualification tool output via the Download Output button further down in the notebook.

A typical workflow to successfully run the qualification command:

Follow the instructions to set up the prerequisites and install the CLI Get Apache Spark eventlogs from prior runs of CPU based applications on Spark 2.x or later. In addition to local storage, the eventlogs should be stored in a valid remote storage: For Dataproc, it should be set to GCS path.

For EMR and Databricks-AWS, it should be set to S3 path.

For Databricks-Azure, it should be set to ABFS path. Run the qualification command on the set of selected eventlogs. Event logs can be passed as single files, a directory, a comma-separated list of files or directories. The format of event logs can be raw, zip, or gzip.

spark_rapids qualification <flags>

The tool helps quantify the expected acceleration of migrating a Spark application or query to GPU. The tool will process each app individually, but will group apps with the same name and cluster details into a single output row after averaging duration metrics accordingly.

The Console Output cell of the notebook or the CLI run will generate the top candidates for migration to GPU.

In the output for the top candidates, it will give a sizing (Large, Medium, or Small) to indicate the projected speedup on GPU. The top candidate list represents the set of Spark workloads that are candidates to migrate to GPU. This list is what should be used for the migration phase. If you want to start with only one job or a few of the qualified jobs, you should start at the top of the list as they are ranked in order. If you have questions, you can contact the Spark RAPIDS team via email at spark-rapids-support@nvidia.com.

Example Commands# This section shows examples of Qualification CLI commands assuming the following inputs: EVENTLOG : Path to Spark eventlogs without the scheme part. The scheme can be a local file system ( file:// ), HDFS ( hdfs:// ), S3 ( s3:// ), ABFS ( abfss:// ), or GCS ( gs:// ). The following table shows CLI command examples along with platform and expected functionalities based on which analysis is performed. Examples of qualification CLI commands # CMD Platform spark_rapids qualification \ --platform dataproc \ --eventlogs gs:// $EVENTLOG Dataproc spark_rapids qualification \ --platform emr \ --eventlogs s3:// $EVENTLOG EMR spark_rapids qualification \ --platform databricks-azure \ --eventlogs file:// $EVENTLOG Databricks-Azure spark_rapids qualification \ --platform databricks-aws \ --eventlogs file:// $EVENTLOG Databricks-AWS spark_rapids qualification \ --platform onprem \ --eventlogs file:// $EVENTLOG On-prem

Command Options# You can list all the options using the help argument spark_rapids qualification -- --help Available options are listed in the following table. List of options for qualification CLI command # Option Description Default Required --eventlogs Event log filenames or CSP storage directories containing event logs (comma separated), or path to a TXT file containing a list of event log paths. Skipping this argument requires that the cluster argument points to a valid cluster name on the CSP. N/A N --cluster The CPU cluster on which the Spark application(s) were executed. Name or ID of cluster or path to cluster property file. N/A N --platform , -p Defines one of the following “on-prem”, “emr”, “dataproc”, “dataproc-gke”, “databricks-aws”, and “databricks-azure”. N/A Y --output_folder , -o Path to store the output. Cannot be remote folder. N/A N --filter_apps , -f Requires cluster argument.

Filtering criteria of the applications listed in the final STDOUT table without affecting the CSV report: ALL means no filter applied.

TOP_CANDIDATES lists all apps that have unsupported operators stage duration less than 25% of app duration and speedups greater than 1.3x. TOP_CANDIDATES N --custom_model_file Custom model file (JSON format) used to calculate the estimated GPU duration N/A N --jvm_heap_size The maximum heap size of the JVM in gigabytes. Default is calculated based on a function of the total memory of the host. N/A N --jvm_threads Number of thread to use for parallel processing on the eventlogs batch. Default is calculated as a function of the total number of cores and the heap size on the host. N/A N --tools_config_file Path to a configuration file that contains the tools’ options. Please visit tools-config sample. N/A N --target_cluster_info Path to a YAML file that contains the target cluster information. Please visit target cluster-info sample. N/A N --tuning_configs Path to a YAML file that contains the tuning configurations. Please visit bootstrap sample. N/A N --qualx_config Path to a qualx-conf.yaml file to use for configuration. If not provided, the wrapper will use the default qualx-conf.yaml. N/A N --verbose , -v True or False to enable verbosity of the script. N/A N Note The Qualification tool includes an AutoTuner module that can generate optimized Spark configuration recommendations. You can customize the AutoTuner behavior by providing target cluster information and custom tuning parameters. For details, see AutoTuner Configuration.