Jar usage#
The Qualification tool can be run in a three other ways if you are not using the CLI tool. One is to run it as a standalone tool on the Spark event logs after the application(s) have run, the second is to be integrated into a running Spark application using explicit API calls, and the third is to install a Spark listener which can output results on a per SQL query basis.
Running the Qualification tool standalone on Spark event logs#
Prerequisites#
Java 8 or above, Spark 3.0.1+ jars.
Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with
.lz4
,.lzf
,.snappy
and.zstd
suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.The tool does not support nested directories. Event log files or event log directories should be at the top level when specifying a directory.
Note
Spark event logs can be downloaded from Spark UI using a “Download” button on the right side, or can be found in the location specified by spark.eventLog.dir
. See the Apache Spark Monitoring documentation for more information.
Step 01: Download the tools jar and Apache Spark 3 Distribution#
The Qualification tool require the Spark 3.x jars to be able to run but do not need an Apache Spark run time. If you do not already have Spark 3.x installed, you can download the Spark distribution to any machine and include the jars in the classpath. - Download the latest jar from Maven repository
Spark 3.1.1 for Apache Hadoop is recommended
Step 02: Run the Qualification tool#
The Qualification tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It is recommended to pass VM options
-Xmx10g
and adjust according to the number-of-apps / size-of-logs being processed.export QUALIFICATION_HEAP=-Xmx10g
Event logs stored on a local machine:
Extract the Spark distribution into a local directory if necessary.
Either set SPARK_HOME to point to that directory or just put the path inside of the classpath
java -cp toolsJar:pathToSparkJars/*:...
when you run the Qualification tool.
This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3 or mixed.
1Usage: java ${QUALIFICATION_HEAP} \ 2 -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] 4 <eventlogs | eventlog directories ...>
1Sample: java ${QUALIFICATION_HEAP} \ 2 -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain /usr/logs/app-name1
Event logs stored on an on-premises HDFS cluster:
Example running on files in HDFS: (include
$HADOOP_CONF_DIR
in classpath)1Usage: java ${QUALIFICATION_HEAP} \ 2 -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain /eventlogDir
Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.
Qualification tool options#
Note
--help
should be before the trailing event logs.
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
29 application start timestamp, unique
30 application name or filesystem timestamp.
31 Filesystem based filtering happens before any
32 application based filtering.For application
33 based filtering, the order in which filters
34 areapplied is: application-name,
35 start-app-time, filter-criteria.Application
36 based filter-criteria are:100-newest (for
37 processing newest 100 event logs based on
38 timestamp insidethe eventlog) i.e application
39 start time) 100-oldest (for processing
40 oldest 100 event logs based on timestamp
41 insidethe eventlog) i.e application start
42 time) 100-newest-per-app-name (select at
43 most 100 newest log files for each unique
44 application name) 100-oldest-per-app-name
45 (select at most 100 oldest log files for each
46 unique application name)Filesystem based
47 filter criteria are:100-newest-filesystem
48 (for processing newest 100 event logs based
49 on filesystem timestamp).
50 100-oldest-filesystem (for processing oldest
51 100 event logsbased on filesystem timestamp).
52 -h, --html-report Default is to generate an HTML report.
53 --no-html-report Disables generating the HTML report.
54 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
55 input string. Filesystem based filtering
56 happens before any application based
57 filtering.
58 --max-sql-desc-length <arg> Maximum length of the SQL description
59 string output with the per sql output.
60 Default is 100.
61 --ml-functions Report if there are any SparkML or Spark XGBoost
62 functions in the eventlog.
63 -n, --num-output-rows <arg> Number of output rows in the summary report.
64 Default is 1000.
65 --num-threads <arg> Number of thread to use for parallel
66 processing. The default is the number of
67 cores on host divided by 4.
68 --order <arg> Specify the sort order of the report. desc or
69 asc, desc is the default. desc (descending)
70 would report applications most likely to be
71 accelerated at the top and asc (ascending)
72 would show the least likely to be accelerated
73 at the top.
74 -o, --output-directory <arg> Base output directory. Default is current
75 directory for the default filesystem. The
76 final output will go into a subdirectory
77 called rapids_4_spark_qualification_output.
78 It will overwrite any existing directory with
79 the same name.
80 -p, --per-sql Report at the individual SQL query level.
81 --platform <arg> Cluster platform where Spark CPU workloads were
82 executed. Options include onprem, dataproc-t4,
83 dataproc-l4, emr, databricks-aws, and
84 databricks-azure.
85 Default is onprem.
86 -r, --report-read-schema Whether to output the read formats and
87 datatypes to the CSV file. This can be very
88 long. Default is false.
89 --spark-property <arg>... Filter applications based on certain Spark
90 properties that were set during launch of the
91 application. It can filter based on key:value
92 pair or just based on keys. Multiple configs
93 can be provided where the filtering is done
94 if any of theconfig is present in the
95 eventlog. filter on specific configuration:
96 --spark-property=spark.eventLog.enabled:truefilter
97 all eventlogs which has config:
98 --spark-property=spark.driver.portMultiple
99 configs:
100 --spark-property=spark.eventLog.enabled:true
101 --spark-property=spark.driver.port
102 -s, --start-app-time <arg> Filter event logs whose application start
103 occurred within the past specified time
104 period. Valid time periods are
105 min(minute),h(hours),d(days),w(weeks),m(months).
106 If a period is not specified it defaults to
107 days.
108 -t, --timeout <arg> Maximum time in seconds to wait for the event
109 logs to be processed. Default is 24 hours
110 (86400 seconds) and must be greater than 3
111 seconds. If it times out, it will report what
112 it was able to process up until the timeout.
113 -u, --user-name <arg> Applications which a particular user has
114 submitted.
115 --help Show help message
116
117 trailing arguments:
118 eventlog (required) Event log filenames(space separated) or directories
119 containing event logs. eg: s3a://<BUCKET>/eventlog1
120 /path/to/eventlog2
Example commands:
Process the 10 newest logs, and only output the top 3 in the output:
1java ${QUALIFICATION_HEAP} \
2 -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
3 com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir
Process last 100 days’ logs:
1java ${QUALIFICATION_HEAP} \
2 -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
3 com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir
Process only the newest log with the same application name:
1java ${QUALIFICATION_HEAP} \
2 -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
3 com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir
Parse ML functions from the eventlog:
1java ${QUALIFICATION_HEAP} \
2 -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
3 com.nvidia.spark.rapids.tool.qualification.QualificationMain --ml-functions /eventlogDir
Note
The “regular expression” used by -a
option is based on java.util.regex.Pattern.
Running using a Spark Listener#
We provide a Spark Listener that can be installed at application start that will produce output for each SQL queries in the running application and indicate if that query is a good fit to try with the Rapids Accelerator for Spark.
Prerequisites#
Java 8 or above, Spark 3.0.1+
Download the tools jar#
Download the latest jar from Maven repository
Configuration#
Add the RunningQualificationEventProcess to the spark listeners configuration: spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor
The user should specify the output directory if they want the output to go to separate files, otherwise it will go to the Spark driver log. If the output directory is specified, it outputs two different files, one csv and one pretty printed log file. The output directory can be a local directory or point to a distributed file system or blobstore like S3.
spark.rapids.qualification.outputDir
By default, this will output results for 10 SQL queries per file and will keep 100 files. This behavior is because many blob stores don’t show files until they are fully written so you wouldn’t be able to see the results for a running application until it finishes the number of SQL queries per file. This behavior can be configured with the following configs.
spark.rapids.qualification.output.numSQLQueriesPerFile
default 10
spark.rapids.qualification.output.maxNumFiles
default 100
Run the Spark application#
Run the application and include the tools jar, spark.extraListeners
config and optionally the other configs to control the tools behavior.
For example:
1$SPARK_HOME/bin/spark-shell \
2--jars rapids-4-spark-tools_2.12-<version>.jar \
3--conf spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor \
4--conf spark.rapids.qualification.outputDir=/tmp/qualPerSqlOutput \
5--conf spark.rapids.qualification.output.numSQLQueriesPerFile=5 \
6--conf spark.rapids.qualification.output.maxNumFiles=10
After running some SQL queries you can look in the output directory and see files like:
1rapids_4_spark_qualification_output_persql_0.csv
2rapids_4_spark_qualification_output_persql_0.log
3rapids_4_spark_qualification_output_persql_1.csv
4rapids_4_spark_qualification_output_persql_1.log
5rapids_4_spark_qualification_output_persql_2.csv
6rapids_4_spark_qualification_output_persql_2.log
See the Understanding the Qualification tool output section on the file contents details.
Running the Qualification tool inside a running Spark application using the API#
Prerequisites#
Java 8 or above, Spark 3.0.1+
Download the tools jar#
Download the latest jar from Maven repository
Modify your application code to call the APIs#
Currently only Scala api’s are supported. Note this does not support reporting at the per sql level currently. This can be done manually by just wrapping and reporting around those queries instead of the entire application.
Create the RunningQualicationApp
:
val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp()
Get the event listener from it and install it as a Spark listener:
1val listener = qualApp.getEventListener
2spark.sparkContext.addSparkListener(listener)
Run your queries and then get the summary or detailed output to see the results.
The summary output api:
1/**
2 * Get the summary report for qualification.
3 * @param delimiter The delimiter separating fields of the summary report.
4 * @param prettyPrint Whether to including the separate at start and end and
5 * add spacing so the data rows align with column headings.
6 * @return String of containing the summary report.
7 */
8getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String
The detailed output api:
1/**
2 * Get the detailed report for qualification.
3 * @param delimiter The delimiter separating fields of the summary report.
4 * @param prettyPrint Whether to including the separate at start and end and
5 * add spacing so the data rows align with column headings.
6 * @return String of containing the detailed report.
7 */
8getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String
Example:
1// run your sql queries ...
2
3// To get the summary output:
4val summaryOutput = qualApp.getSummary()
5
6// To get the detailed output:
7val detailedOutput = qualApp.getDetailed()
8
9// print the output somewhere for user to see
10println(summaryOutput)
11println(detailedOutput)
If you need to specify the tools jar as a maven dependency to compile the Spark application:
1<dependency>
2 <groupId>com.nvidia</groupId>
3 <artifactId>rapids-4-spark-tools_2.12</artifactId>
4 <version>${version}</version>
5</dependency>
Run the Spark application#
Run your Spark application and include the tools jar you downloaded with the spark ‘–jars’ options and view the output wherever you had it printed.
For example, if running the spark-shell:
$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-<version>.jar