Jar Usage#
The Qualification tool can be run as a Java cmd in three different ways if you are not using the CLI tool:
A standalone tool on the Spark event logs after the application(s) have run,
Inside a running Spark application using explicit API calls, and
Using a Spark listener, which can output results per SQL query.
Setting Up Environment#
Prerequisites#
Java 8+
Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with
.lz4
,.lzf
,.snappy
and.zstd
suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.The tool requires the Spark 3.x+ jars to be able to run but it does not need an Apache Spark runtime. If you do not already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.
This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix
file:
in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpathInclude
$HADOOP_CONF_DIR
in classpath-cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
Download the
gcs-connector-hadoop3-<version>-shaded.jar
and follow the instructions to configure Hadoop/Spark.Download the matched jars based on the Hadoop version
hadoop-aws-<version>.jar
aws-java-sdk-<version>.jar
In $SPARK_HOME/conf, create
hdfs-site.xml
with below AWS S3 keys inside:1<?xml version="1.0"?> 2<configuration> 3 <property> 4 <name>fs.s3a.access.key</name> 5 <value>xxx</value> 6 </property> 7 <property> 8 <name>fs.s3a.secret.key</name> 9 <value>xxx</value> 10 </property> 11</configuration>
You can test your configuration by including the above jars in the
-jars
option tospark-shell
orspark-submit
Please refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.
Download the matched jar based on the Hadoop version
hadoop-azure-<version>.jar
.The simplest authentication mechanism is to use account-name and account-key. Please refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.
Getting the Tools Jar#
Download the latest release from Maven repository
Refer to the spark-rapids-user-tools github releases page for details on release notes.
Checkout the code repository
git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/core
Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (i.e., Spark version)
mvn clean package
Deploying Tools Jar#
Running the Qualification Tool Standalone on Spark Event Logs#
The tool reads the log files and processes them in memory. So, the heap memory should be increased when processing a large volume of events. It is recommended to pass VM options
-Xmx10g
and adjust according to the number-of-apps / size-of-logs being processed.export JVM_HEAP=-Xmx10g
Examples running the tool on the following environments
Extract the Spark distribution into a local directory if necessary.
Either set SPARK_HOME to point to that directory or just put the path inside of the classpath
java -cp toolsJar:$SPARK_HOME/jars/*:...
when you run the Qualification tool.
1java ${JVM_HEAP} \ 2 -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] \ 4 <eventlogs | eventlog directories ...>
1java ${JVM_HEAP} \ 2 -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain \ 4 /usr/logs/app-name1
Example running on files in HDFS: (include
$HADOOP_CONF_DIR
in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output, so if you want to point to the local filesystem, be sure to includefile:
in the path.1java ${JVM_HEAP} \ 2 -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ 3 com.nvidia.spark.rapids.tool.qualification.QualificationMain /eventlogDir
Qualification tool options
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 --auto-tuner Toggle AutoTuner module.
29 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
30 application start timestamp, unique
31 application name or filesystem timestamp.
32 Filesystem based filtering happens before any
33 application based filtering.For application
34 based filtering, the order in which filters
35 areapplied is: application-name,
36 start-app-time, filter-criteria.Application
37 based filter-criteria are:100-newest (for
38 processing newest 100 event logs based on
39 timestamp insidethe eventlog) i.e application
40 start time) 100-oldest (for processing
41 oldest 100 event logs based on timestamp
42 insidethe eventlog) i.e application start
43 time) 100-newest-per-app-name (select at
44 most 100 newest log files for each unique
45 application name) 100-oldest-per-app-name
46 (select at most 100 oldest log files for each
47 unique application name)Filesystem based
48 filter criteria are:100-newest-filesystem
49 (for processing newest 100 event logs based
50 on filesystem timestamp).
51 100-oldest-filesystem (for processing oldest
52 100 event logsbased on filesystem timestamp).
53 -h, --html-report Default is to generate an HTML report.
54 --no-html-report Disables generating the HTML report.
55 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
56 input string. Filesystem based filtering
57 happens before any application based
58 filtering.
59 --max-sql-desc-length <arg> Maximum length of the SQL description
60 string output with the per sql output.
61 Default is 100.
62 --ml-functions Report if there are any SparkML or Spark XGBoost
63 functions in the eventlog.
64 -n, --num-output-rows <arg> Number of output rows in the summary report.
65 Default is 1000.
66 --num-threads <arg> Number of thread to use for parallel
67 processing. The default is the number of
68 cores on host divided by 4.
69 --order <arg> Specify the sort order of the report. desc or
70 asc, desc is the default. desc (descending)
71 would report applications most likely to be
72 accelerated at the top and asc (ascending)
73 would show the least likely to be accelerated
74 at the top.
75 -o, --output-directory <arg> Base output directory. Default is current
76 directory for the default filesystem. The
77 final output will go into a subdirectory
78 called rapids_4_spark_qualification_output.
79 It will overwrite any existing directory with
80 the same name.
81 -p, --per-sql Report at the individual SQL query level.
82 --platform <arg> Cluster platform where Spark CPU workloads were
83 executed. Options include onprem, dataproc-t4,
84 dataproc-l4, emr, databricks-aws, and
85 databricks-azure.
86 Default is onprem.
87 -r, --report-read-schema Whether to output the read formats and
88 datatypes to the CSV file. This can be very
89 long. Default is false.
90 --spark-property <arg>... Filter applications based on certain Spark
91 properties that were set during launch of the
92 application. It can filter based on key:value
93 pair or just based on keys. Multiple configs
94 can be provided where the filtering is done
95 if any of theconfig is present in the
96 eventlog. filter on specific configuration:
97 --spark-property=spark.eventLog.enabled:truefilter
98 all eventlogs which has config:
99 --spark-property=spark.driver.portMultiple
100 configs:
101 --spark-property=spark.eventLog.enabled:true
102 --spark-property=spark.driver.port
103 -s, --start-app-time <arg> Filter event logs whose application start
104 occurred within the past specified time
105 period. Valid time periods are
106 min(minute),h(hours),d(days),w(weeks),m(months).
107 If a period is not specified it defaults to
108 days.
109 -t, --timeout <arg> Maximum time in seconds to wait for the event
110 logs to be processed. Default is 24 hours
111 (86400 seconds) and must be greater than 3
112 seconds. If it times out, it will report what
113 it was able to process up until the timeout.
114 -u, --user-name <arg> Applications which a particular user has
115 submitted.
116 -w, --worker-info <arg> File path containing the system information
117 of a worker node. It is assumed that all
118 workers are homogenous. It requires the
119 AutoTuner to be enabled. Default is
120 ./worker_info.yaml
121 --help Show help message
122
123 trailing arguments:
124 eventlog (required) Event log filenames(space separated) or directories
125 containing event logs. eg: s3a://<BUCKET>/eventlog1
126 /path/to/eventlog2
Note
--help
should be before the trailing event logs.The “regular expression” used by
-a
option is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.
Tuning Spark Properties For GPU Clusters#
Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (e.g. memory, cores, Spark default configurations) as well as information processed in the application event logs. Note that the tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it is running. The values loaded from the app logs have higher precedence than the default configs.
Note
Auto-Tuner limitations:
It is assumed that all the worker nodes on the cluster are homogenous.
To run the Auto-Tuner, enable the auto-tuner
flag and optionally pass a valid --worker-info <FILE_PATH>
. The Auto-Tuner needs to learn the system properties of the worker nodes that run application code in the cluster. The argument FILE_PATH
can either be local or remote file (i.e., HDFS).
If the --worker-info
argument is not supplied, then the Auto-Tuner will only recommend tuned settings based on the job event log and not on any cluster or worker information since that is not available.
1system:
2 numCores: 32
3 memory: 212992MiB
4 numWorkers: 5
5gpu:
6 memory: 15109MiB
7 count: 4
8 name: T4
9softwareProperties:
10 spark.driver.maxResultSize: 7680m
11 spark.driver.memory: 15360m
12 spark.executor.cores: '8'
13 spark.executor.instances: '2'
14 spark.executor.memory: 47222m
15 spark.executorEnv.OPENBLAS_NUM_THREADS: '1'
16 spark.scheduler.mode: FAIR
17 spark.sql.cbo.enabled: 'true'
18 spark.ui.port: '0'
19 spark.yarn.am.memory: 640m
Property |
Optional |
If Missing |
---|---|---|
system.numCores |
No |
Auto-Tuner does not calculate recommendations |
system.memory |
No |
Auto-Tuner does not calculate any recommendations |
system.numWorkers |
Yes |
Default: 1 |
gpu.name |
Yes |
Default: T4 (Nvidia Tesla T4) |
gpu.memory |
Yes |
Default: 16G |
softwareProperties |
Yes |
This section is optional. The Auto-Tuner reads the configs within the logs of the Apache Spark apps with higher precedence |
Running Using a Spark Listener#
We provide a Spark Listener that can be installed at the application start that will produce output for each SQL query in the running application and indicate if that query is a good fit to try with the Rapids Accelerator for Spark.
Configuration#
Add the following class to the spark listeners configuration:
1spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor
The user should specify the output directory (
spark.rapids.qualification.outputDir
) if they want the output to go to separate files. Otherwise, it will go to the Spark driver log. If the output directory is specified, it outputs two files, one CSV, and one pretty printed log file. The output directory can be a local directory or point to a distributed file system or blobstore like S3.By default, this will output results for 10 SQL queries per file and keep 100 files. This behavior is because many blob stores don’t show files until they are fully written so you wouldn’t be able to see the results for a running application until it finishes the number of SQL queries per file. This behavior can be configured with the following configs.
spark.rapids.qualification.output.numSQLQueriesPerFile
: default 10spark.rapids.qualification.output.maxNumFiles
: default 100
Run the Spark Application#
Run the application and include the tools jar, spark.extraListeners
config, and optionally the other configs to control the tool’s behavior.
For example:
1$SPARK_HOME/bin/spark-shell \
2--jars rapids-4-spark-tools_2.12-<version>.jar \
3--conf spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor \
4--conf spark.rapids.qualification.outputDir=/tmp/qualPerSqlOutput \
5--conf spark.rapids.qualification.output.numSQLQueriesPerFile=5 \
6--conf spark.rapids.qualification.output.maxNumFiles=10
After running some SQL queries you can look in the output directory and see files like:
1rapids_4_spark_qualification_output_persql_0.csv
2rapids_4_spark_qualification_output_persql_0.log
3rapids_4_spark_qualification_output_persql_1.csv
4rapids_4_spark_qualification_output_persql_1.log
5rapids_4_spark_qualification_output_persql_2.csv
6rapids_4_spark_qualification_output_persql_2.log
See the Understanding the Qualification tool output section on the file contents details.
Running the Qualification Tool Inside a Running Spark Application Using the API#
Modify Your Application Code To Call the APIs#
Currently, only Scala APIs are supported. Note this does not support reporting at the per SQL level currently. This can be done manually by just wrapping and reporting around those queries instead of the entire application.
Create the
RunningQualicationApp
:val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp()
Get the event listener from it and install it as a Spark listener:
1val listener = qualApp.getEventListener 2spark.sparkContext.addSparkListener(listener)
Run your queries and get the summary or detailed output to see the results.
The summary output API:
1/** 2* Get the summary report for qualification. 3* @param delimiter The delimiter separating fields of the summary report. 4* @param prettyPrint Whether to including the separate at start and end and 5* add spacing so the data rows align with column headings. 6* @return String of containing the summary report. 7*/ 8getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String
The detailed output api:
1/** 2* Get the detailed report for qualification. 3* @param delimiter The delimiter separating fields of the summary report. 4* @param prettyPrint Whether to including the separate at start and end and 5* add spacing so the data rows align with column headings. 6* @return String of containing the detailed report. 7*/ 8getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String
Example:
1// run your sql queries ...
2
3// To get the summary output:
4val summaryOutput = qualApp.getSummary()
5
6// To get the detailed output:
7val detailedOutput = qualApp.getDetailed()
8
9// print the output somewhere for user to see
10println(summaryOutput)
11println(detailedOutput)
If you need to specify the tools jar as a maven dependency to compile the Spark application:
1<dependency>
2 <groupId>com.nvidia</groupId>
3 <artifactId>rapids-4-spark-tools_2.12</artifactId>
4 <version>${version}</version>
5</dependency>
Run the Spark application#
Run your Spark application and include the tools jar you downloaded with the spark –jars
options and view the output wherever you had it printed.
For example, if running the spark-shell:
$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-<version>.jar