Jar Usage#

The Qualification tool can be run as a Java cmd in three different ways if you are not using the CLI tool:

  1. A standalone tool on the Spark event logs after the application(s) have run,

  2. Inside a running Spark application using explicit API calls, and

  3. Using a Spark listener, which can output results per SQL query.

Setting Up Environment#

Prerequisites#

  • Java 8 or above

  • Spark event log(s) from Spark 2.0 or above version. Supports rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes and Databricks-specific rolled and compressed(.gz) event logs.

  • The Qualification tool requires the Spark 3.x jars to be able to run but do not need an Apache Spark run time. If you do not already have Spark 3.x installed, you can download the Apache Spark 3 Distribution to any machine and include the jars in the classpath.

  • This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If any input is a remote file path or directory path, then you need the connector dependencies to be on the classpath

    Include $HADOOP_CONF_DIR in classpath

    Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

    Download the matched jars based on the Hadoop version.

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

    In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

     1<?xml version="1.0"?>
     2<configuration>
     3   <property>
     4      <name>fs.s3a.access.key</name>
     5      <value>xxx</value>
     6   </property>
     7   <property>
     8      <name>fs.s3a.secret.key</name>
     9      <value>xxx</value>
    10   </property>
    11</configuration>
    

    You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

    Please refer to the Hadoop-AWS doc on more options for integrating the Hadoop-AWS module with S3.

    • Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

    • The simplest authentication mechanism uses account-name and account-key. Please refer to the Hadoop-ABFS support doc for more options about integrating Hadoop-ABFS module with ABFS.

Getting the Tools Jar#

  • Checkout the code repository

    git clone git@github.com:NVIDIA/spark-rapids-tools.git
    cd spark-rapids-tools/core
    
  • Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar- will be in *target/ directory. Refer to build doc for more information on build options (i.e., Spark version)

    mvn clean package
    

Deploying Tools Jar#

Running the Qualification Tool Standalone on Spark Event Logs#

  1. The Qualification tool reads the log files and processes them in memory. So, the heap memory should be increased when processing a large volume of events. It is recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

    export QUALIFICATION_HEAP=-Xmx10g
    
  2. Examples running the tool on the following environments

    • Extract the Spark distribution into a local directory if necessary.

    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:pathToSparkJars/*:... when you run the Qualification tool.

    1Usage: java ${QUALIFICATION_HEAP} \
    2         -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/- \
    3         com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
    4         <eventlogs | eventlog directories ...>
    
    1Sample: java ${QUALIFICATION_HEAP} \
    2         -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/- \
    3         com.nvidia.spark.rapids.tool.qualification.QualificationMain /usr/logs/app-name1
    

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output, so if you want to point to the local filesystem, be sure to include file: in the path.

    1Usage: java ${QUALIFICATION_HEAP} \
    2         -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3         com.nvidia.spark.rapids.tool.qualification.QualificationMain  /eventlogDir
    

Qualification tool options

  1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
  3
  4RAPIDS Accelerator Qualification tool for Apache Spark
  5
  6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
  7       com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
  8       <eventlogs | eventlog directories ...>
  9
 10      --all                          Apply multiple event log filtering criteria
 11                                     and process only logs for which all
 12                                     conditions are satisfied.Example: <Filter1>
 13                                     <Filter2> <Filter3> --all -> result is
 14                                     <Filter1> AND <Filter2> AND <Filter3>.
 15                                     Default is all=true
 16      --any                          Apply multiple event log filtering criteria
 17                                     and process only logs for which any condition
 18                                     is satisfied.Example: <Filter1> <Filter2>
 19                                     <Filter3> --any -> result is <Filter1> OR
 20                                     <Filter2> OR <Filter3>
 21  -a, --application-name  <arg>      Filter event logs by application name. The
 22                                     string specified can be a regular expression,
 23                                     substring, or exact match. For filtering
 24                                     based on complement of application name, use
 25                                     ~APPLICATION_NAME. i.e Select all event logs
 26                                     except the ones which have application name
 27                                     as the input string.
 28  -f, --filter-criteria  <arg>       Filter newest or oldest N eventlogs based on
 29                                     application start timestamp, unique
 30                                     application name or filesystem timestamp.
 31                                     Filesystem based filtering happens before any
 32                                     application based filtering.For application
 33                                     based filtering, the order in which filters
 34                                     areapplied is: application-name,
 35                                     start-app-time, filter-criteria.Application
 36                                     based filter-criteria are:100-newest (for
 37                                     processing newest 100 event logs based on
 38                                     timestamp insidethe eventlog) i.e application
 39                                     start time)  100-oldest (for processing
 40                                     oldest 100 event logs based on timestamp
 41                                     insidethe eventlog) i.e application start
 42                                     time)  100-newest-per-app-name (select at
 43                                     most 100 newest log files for each unique
 44                                     application name) 100-oldest-per-app-name
 45                                     (select at most 100 oldest log files for each
 46                                     unique application name)Filesystem based
 47                                     filter criteria are:100-newest-filesystem
 48                                     (for processing newest 100 event logs based
 49                                     on filesystem timestamp).
 50                                     100-oldest-filesystem (for processing oldest
 51                                     100 event logsbased on filesystem timestamp).
 52  -h, --html-report                  Default is to generate an HTML report.
 53      --no-html-report               Disables generating the HTML report.
 54  -m, --match-event-logs  <arg>      Filter event logs whose filenames contain the
 55                                     input string. Filesystem based filtering
 56                                     happens before any application based
 57                                     filtering.
 58      --max-sql-desc-length  <arg>   Maximum length of the SQL description
 59                                     string output with the per sql output.
 60                                     Default is 100.
 61      --ml-functions                 Report if there are any SparkML or Spark XGBoost
 62                                     functions in the eventlog.
 63  -n, --num-output-rows  <arg>       Number of output rows in the summary report.
 64                                     Default is 1000.
 65      --num-threads  <arg>           Number of thread to use for parallel
 66                                     processing. The default is the number of
 67                                     cores on host divided by 4.
 68      --order  <arg>                 Specify the sort order of the report. desc or
 69                                     asc, desc is the default. desc (descending)
 70                                     would report applications most likely to be
 71                                     accelerated at the top and asc (ascending)
 72                                     would show the least likely to be accelerated
 73                                     at the top.
 74  -o, --output-directory  <arg>      Base output directory. Default is current
 75                                     directory for the default filesystem. The
 76                                     final output will go into a subdirectory
 77                                     called rapids_4_spark_qualification_output.
 78                                     It will overwrite any existing directory with
 79                                     the same name.
 80  -p, --per-sql                      Report at the individual SQL query level.
 81      --platform  <arg>              Cluster platform where Spark CPU workloads were
 82                                     executed. Options include onprem, dataproc-t4,
 83                                     dataproc-l4, emr, databricks-aws, and
 84                                     databricks-azure.
 85                                     Default is onprem.
 86  -r, --report-read-schema           Whether to output the read formats and
 87                                     datatypes to the CSV file. This can be very
 88                                     long. Default is false.
 89      --spark-property  <arg>...     Filter applications based on certain Spark
 90                                     properties that were set during launch of the
 91                                     application. It can filter based on key:value
 92                                     pair or just based on keys. Multiple configs
 93                                     can be provided where the filtering is done
 94                                     if any of theconfig is present in the
 95                                     eventlog. filter on specific configuration:
 96                                     --spark-property=spark.eventLog.enabled:truefilter
 97                                     all eventlogs which has config:
 98                                     --spark-property=spark.driver.portMultiple
 99                                     configs:
100                                     --spark-property=spark.eventLog.enabled:true
101                                     --spark-property=spark.driver.port
102  -s, --start-app-time  <arg>        Filter event logs whose application start
103                                     occurred within the past specified time
104                                     period. Valid time periods are
105                                     min(minute),h(hours),d(days),w(weeks),m(months).
106                                     If a period is not specified it defaults to
107                                     days.
108  -t, --timeout  <arg>               Maximum time in seconds to wait for the event
109                                     logs to be processed. Default is 24 hours
110                                     (86400 seconds) and must be greater than 3
111                                     seconds. If it times out, it will report what
112                                     it was able to process up until the timeout.
113  -u, --user-name  <arg>             Applications which a particular user has
114                                     submitted.
115      --help                         Show help message
116
117 trailing arguments:
118  eventlog (required)   Event log filenames(space separated) or directories
119                        containing event logs. eg: s3a://<BUCKET>/eventlog1
120                        /path/to/eventlog2

Note

  • --help should be before the trailing event logs.

  • The “regular expression” used by -a option is based on java.util.regex.Pattern.

Please refer to Java CMD Samples for more examples and sample commands.

Running Using a Spark Listener#

We provide a Spark Listener that can be installed at the application start that will produce output for each SQL query in the running application and indicate if that query is a good fit to try with the Rapids Accelerator for Spark.

Configuration#

  • Add the following class to the spark listeners configuration:

    1spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor
    
  • The user should specify the output directory (spark.rapids.qualification.outputDir) if they want the output to go to separate files. Otherwise, it will go to the Spark driver log. If the output directory is specified, it outputs two files, one CSV, and one pretty printed log file. The output directory can be a local directory or point to a distributed file system or blobstore like S3.

  • By default, this will output results for 10 SQL queries per file and keep 100 files. This behavior is because many blob stores don’t show files until they are fully written so you wouldn’t be able to see the results for a running application until it finishes the number of SQL queries per file. This behavior can be configured with the following configs.

    • spark.rapids.qualification.output.numSQLQueriesPerFile: default 10

    • spark.rapids.qualification.output.maxNumFiles: default 100

Run the Spark Application#

Run the application and include the tools jar, spark.extraListeners config, and optionally the other configs to control the tool’s behavior.

For example:

1$SPARK_HOME/bin/spark-shell \
2--jars rapids-4-spark-tools_2.12-<version>.jar \
3--conf spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor \
4--conf spark.rapids.qualification.outputDir=/tmp/qualPerSqlOutput \
5--conf spark.rapids.qualification.output.numSQLQueriesPerFile=5 \
6--conf spark.rapids.qualification.output.maxNumFiles=10

After running some SQL queries you can look in the output directory and see files like:

1rapids_4_spark_qualification_output_persql_0.csv
2rapids_4_spark_qualification_output_persql_0.log
3rapids_4_spark_qualification_output_persql_1.csv
4rapids_4_spark_qualification_output_persql_1.log
5rapids_4_spark_qualification_output_persql_2.csv
6rapids_4_spark_qualification_output_persql_2.log

See the Understanding the Qualification tool output section on the file contents details.

Running the Qualification Tool Inside a Running Spark Application Using the API#

Modify Your Application Code To Call the APIs#

Currently, only Scala APIs are supported. Note this does not support reporting at the per SQL level currently. This can be done manually by just wrapping and reporting around those queries instead of the entire application.

  1. Create the RunningQualicationApp:

    val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp()
    
  2. Get the event listener from it and install it as a Spark listener:

    1val listener = qualApp.getEventListener
    2spark.sparkContext.addSparkListener(listener)
    
  3. Run your queries and get the summary or detailed output to see the results.

    • The summary output API:

      1/**
      2* Get the summary report for qualification.
      3* @param delimiter The delimiter separating fields of the summary report.
      4* @param prettyPrint Whether to including the separate at start and end and
      5*                    add spacing so the data rows align with column headings.
      6* @return String of containing the summary report.
      7*/
      8getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String
      
    • The detailed output api:

      1/**
      2* Get the detailed report for qualification.
      3* @param delimiter The delimiter separating fields of the summary report.
      4* @param prettyPrint Whether to including the separate at start and end and
      5*                    add spacing so the data rows align with column headings.
      6* @return String of containing the detailed report.
      7*/
      8getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String
      

Example:

 1// run your sql queries ...
 2
 3// To get the summary output:
 4val summaryOutput = qualApp.getSummary()
 5
 6// To get the detailed output:
 7val detailedOutput = qualApp.getDetailed()
 8
 9// print the output somewhere for user to see
10println(summaryOutput)
11println(detailedOutput)

If you need to specify the tools jar as a maven dependency to compile the Spark application:

1<dependency>
2   <groupId>com.nvidia</groupId>
3   <artifactId>rapids-4-spark-tools_2.12</artifactId>
4   <version>${version}</version>
5</dependency>

Run the Spark application#

Run your Spark application and include the tools jar you downloaded with the spark ‘–jars’ options and view the output wherever you had it printed.

For example, if running the spark-shell:

$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-<version>.jar