User Guide (24.08.01)
RAPIDS Accelerator for Apache Spark - User Guide (24.08.01)

Qualification Tool - Jar Usage

The Qualification tool can be run as a Java cmd in three different ways if you aren’t using the CLI tool:

  1. A standalone tool on the Spark event logs after the application(s) have run,

  2. Inside a running Spark application using explicit API calls, and

  3. Using a Spark listener, which can output results per SQL query.

Prerequisites

  • Java 8+

  • Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.

  • The tool requires the Spark 3.x+ jars to be able to run but it doesn’t need an Apache Spark runtime. If you don’t already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.

  • This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix file: in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpath

    Include $HADOOP_CONF_DIR in classpath

    Sample showing Java’s classpath

    Copy
    Copied!
                

    -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/


    Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

    Download the matched jars based on the Hadoop version

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

    In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

    Copy
    Copied!
                

    <?xml version="1.0"?> <configuration> <property> <name>fs.s3a.access.key</name> <value>xxx</value> </property> <property> <name>fs.s3a.secret.key</name> <value>xxx</value> </property> </configuration>

    You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

    Refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.

    • Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

    • The simplest authentication mechanism is to use account-name and account-key.Refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.


Getting the Tools Jar

  • Checkout the code repository

    Copy
    Copied!
                

    git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/core

  • Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (that is, Spark version)

    Copy
    Copied!
                

    mvn clean package

Running the Qualification Tool Standalone on Spark Event Logs

  1. The tool reads the log files and processes them in memory. So, the heap memory should be increased when processing a large volume of events. It’s recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

    Copy
    Copied!
                

    export JVM_HEAP=-Xmx10g

  2. Examples running the tool on the following environments

    • Extract the Spark distribution into a local directory if necessary.

    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:$SPARK_HOME/jars/*:... when you run the Qualification tool.

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] \ <eventlogs | eventlog directories ...>

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.qualification.QualificationMain \ /usr/logs/app-name1

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output, so if you want to point to the local filesystem, be sure to include file: in the path.

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain /eventlogDir

Qualification tool options

Copy
Copied!
            

java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain --help RAPIDS Accelerator Qualification tool for Apache Spark Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] <eventlogs | eventlog directories ...> --all Apply multiple event log filtering criteria and process only logs for which all conditions are satisfied.Example: <Filter1> <Filter2> <Filter3> --all -> result is <Filter1> AND <Filter2> AND <Filter3>. Default is all=true --any Apply multiple event log filtering criteria and process only logs for which any condition is satisfied.Example: <Filter1> <Filter2> <Filter3> --any -> result is <Filter1> OR <Filter2> OR <Filter3> -a, --application-name <arg> Filter event logs by application name. The string specified can be a regular expression, substring, or exact match. For filtering based on complement of application name, use ~APPLICATION_NAME. i.e Select all event logs except the ones which have application name as the input string. --auto-tuner Toggle AutoTuner module. -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on application start timestamp, unique application name or filesystem timestamp. Filesystem based filtering happens before any application based filtering.For application based filtering, the order in which filters areapplied is: application-name, start-app-time, filter-criteria.Application based filter-criteria are:100-newest (for processing newest 100 event logs based on timestamp insidethe eventlog) i.e application start time) 100-oldest (for processing oldest 100 event logs based on timestamp insidethe eventlog) i.e application start time) 100-newest-per-app-name (select at most 100 newest log files for each unique application name) 100-oldest-per-app-name (select at most 100 oldest log files for each unique application name)Filesystem based filter criteria are:100-newest-filesystem (for processing newest 100 event logs based on filesystem timestamp). 100-oldest-filesystem (for processing oldest 100 event logsbased on filesystem timestamp). -h, --html-report Default is to generate an HTML report. --no-html-report Disables generating the HTML report. -m, --match-event-logs <arg> Filter event logs whose filenames contain the input string. Filesystem based filtering happens before any application based filtering. --max-sql-desc-length <arg> Maximum length of the SQL description string output with the per sql output. Default is 100. --ml-functions Report if there are any SparkML or Spark XGBoost functions in the eventlog. -n, --num-output-rows <arg> Number of output rows in the summary report. Default is 1000. --num-threads <arg> Number of thread to use for parallel processing. The default is the number of cores on host divided by 4. --order <arg> Specify the sort order of the report. desc or asc, desc is the default. desc (descending) would report applications most likely to be accelerated at the top and asc (ascending) would show the least likely to be accelerated at the top. -o, --output-directory <arg> Base output directory. Default is current directory for the default filesystem. The final output will go into a subdirectory called rapids_4_spark_qualification_output. It will overwrite any existing directory with the same name. -p, --per-sql Report at the individual SQL query level. --platform <arg> Cluster platform where Spark CPU workloads were executed. Options include onprem, dataproc-t4, dataproc-l4, emr, databricks-aws, and databricks-azure. Default is onprem. -r, --report-read-schema Whether to output the read formats and datatypes to the CSV file. This can be very long. Default is false. --spark-property <arg>... Filter applications based on certain Spark properties that were set during launch of the application. It can filter based on key:value pair or just based on keys. Multiple configs can be provided where the filtering is done if any of theconfig is present in the eventlog. filter on specific configuration: --spark-property=spark.eventLog.enabled:truefilter all eventlogs which has config: --spark-property=spark.driver.portMultiple configs: --spark-property=spark.eventLog.enabled:true --spark-property=spark.driver.port -s, --start-app-time <arg> Filter event logs whose application start occurred within the past specified time period. Valid time periods are min(minute),h(hours),d(days),w(weeks),m(months). If a period isn't specified it defaults to days. -t, --timeout <arg> Maximum time in seconds to wait for the event logs to be processed. Default is 24 hours (86400 seconds) and must be greater than 3 seconds. If it times out, it will report what it was able to process up until the timeout. -u, --user-name <arg> Applications which a particular user has submitted. -w, --worker-info <arg> File path containing the system information of a worker node. It's assumed that all workers are homogenous. It requires the AutoTuner to be enabled. Default is ./worker_info.yaml --help Show help message trailing arguments: eventlog (required) Event log filenames(space separated) or directories containing event logs. for example, s3a://<BUCKET>/eventlog1 /path/to/eventlog2

Note
  • --help should be before the trailing event logs.

  • The “regular expression” used by -a option is based on java.util.regex.Pattern.

Please refer to Java CMD Samples for more examples and sample commands.

Tuning Spark Properties For GPU Clusters

Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running. The values loaded from the app logs have higher precedence than the default configs.

Note

Auto-Tuner limitations:

  • It’s assumed that all the worker nodes on the cluster are homogenous.

To run the Auto-Tuner, enable the auto-tuner flag and optionally pass a valid --worker-info <FILE_PATH>. The Auto-Tuner needs to learn the system properties of the worker nodes that run application code in the cluster. The argument FILE_PATH can either be local or remote file (that is, HDFS).

If the --worker-info argument isn’t supplied, then the Auto-Tuner will only recommend tuned settings based on the job event log and not on any cluster or worker information since that isn’t available.

Template of the worker information file in “yaml” format

Copy
Copied!
            

system: numCores: 32 memory: 212992MiB numWorkers: 5 gpu: memory: 15109MiB count: 4 name: T4 softwareProperties: spark.driver.maxResultSize: 7680m spark.driver.memory: 15360m spark.executor.cores: '8' spark.executor.instances: '2' spark.executor.memory: 47222m spark.executorEnv.OPENBLAS_NUM_THREADS: '1' spark.scheduler.mode: FAIR spark.sql.cbo.enabled: 'true' spark.ui.port: '0' spark.yarn.am.memory: 640m


Property

Optional

If Missing

system.numCores No Auto-Tuner doesn’t calculate recommendations
system.memory No Auto-Tuner doesn’t calculate any recommendations
system.numWorkers Yes Default: 1
gpu.name Yes Default: T4 (Nvidia Tesla T4)
gpu.memory Yes Default: 16G
softwareProperties Yes This section is optional. The Auto-Tuner reads the configs within the logs of the Apache Spark apps with higher precedence

Running Using a Spark Listener

We provide a Spark Listener that can be installed at the application start that will produce output for each SQL query in the running application and indicate if that query is a good fit to try with the Rapids Accelerator for Spark.

Configuration

  • Add the following class to the spark listeners configuration:

    Copy
    Copied!
                

    spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor


  • The user should specify the output directory (spark.rapids.qualification.outputDir) if they want the output to go to separate files. Otherwise, it will go to the Spark driver log. If the output directory is specified, it outputs two files, one CSV, and one pretty printed log file. The output directory can be a local directory or point to a distributed file system or blobstore like S3.

  • By default, this will output results for 10 SQL queries per file and keep 100 files. This behavior is because many blob stores don’t show files until they’re fully written so you wouldn’t be able to see the results for a running application until it finishes the number of SQL queries per file. This behavior can be configured with the following configs.

    • spark.rapids.qualification.output.numSQLQueriesPerFile: default 10

    • spark.rapids.qualification.output.maxNumFiles: default 100

Run the Spark Application

Run the application and include the tools jar, spark.extraListeners config, and optionally the other configs to control the tool’s behavior.

For example:

Copy
Copied!
            

$SPARK_HOME/bin/spark-shell \ --jars rapids-4-spark-tools_2.12-<version>.jar \ --conf spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor \ --conf spark.rapids.qualification.outputDir=/tmp/qualPerSqlOutput \ --conf spark.rapids.qualification.output.numSQLQueriesPerFile=5 \ --conf spark.rapids.qualification.output.maxNumFiles=10

After running some SQL queries you can look in the output directory and see files like:

Copy
Copied!
            

rapids_4_spark_qualification_output_persql_0.csv rapids_4_spark_qualification_output_persql_0.log rapids_4_spark_qualification_output_persql_1.csv rapids_4_spark_qualification_output_persql_1.log rapids_4_spark_qualification_output_persql_2.csv rapids_4_spark_qualification_output_persql_2.log

Refer to the Understanding the Qualification tool output section on the file contents details.

Running the Qualification Tool Inside a Running Spark Application Using the API

Modify Your Application Code To Call the APIs

Currently, only Scala APIs are supported. This doesn’t support reporting at the per SQL level currently. This can be done manually by just wrapping and reporting around those queries instead of the entire application.

  1. Create the RunningQualicationApp:

    Copy
    Copied!
                

    val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp()

  2. Get the event listener from it and install it as a Spark listener:

    Copy
    Copied!
                

    val listener = qualApp.getEventListener spark.sparkContext.addSparkListener(listener)

  3. Run your queries and get the summary or detailed output to see the results.

    • The summary output API:

      Copy
      Copied!
                  

      /** * Get the summary report for qualification. * @param delimiter The delimiter separating fields of the summary report. * @param prettyPrint Whether to including the separate at start and end and * add spacing so the data rows align with column headings. * @return String of containing the summary report. */ getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String


    • The detailed output api:

      Copy
      Copied!
                  

      /** * Get the detailed report for qualification. * @param delimiter The delimiter separating fields of the summary report. * @param prettyPrint Whether to including the separate at start and end and * add spacing so the data rows align with column headings. * @return String of containing the detailed report. */ getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String


Example:

Copy
Copied!
            

// run your sql queries ... // To get the summary output: val summaryOutput = qualApp.getSummary() // To get the detailed output: val detailedOutput = qualApp.getDetailed() // print the output somewhere for user to see println(summaryOutput) println(detailedOutput)

If you need to specify the tools jar as a maven dependency to compile the Spark application:

Copy
Copied!
            

<dependency> <groupId>com.nvidia</groupId> <artifactId>rapids-4-spark-tools_2.12</artifactId> <version>${version}</version> </dependency>

Run the Spark application

Run your Spark application and include the tools jar you downloaded with the spark –jars options and view the output wherever you had it printed.

For example, if running the spark-shell:

Copy
Copied!
            

$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-<version>.jar

  • Process the 10 newest logs, and only output the top 3 in the output:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir

  • Process last 100 days’ logs:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir

  • Process only the newest log with the same application name:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir

  • Parse ML functions from the eventlog:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain --ml-functions /eventlogDir

Previous Frequently Asked Questions
Next Profiling Tool - Jar Usage
© Copyright 2024, NVIDIA. Last updated on Aug 29, 2024.