spark-rapids/user-guide/23.12.1/partials/qual-jar-commands-options.html
Qualification tool options
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
29 application start timestamp, unique
30 application name or filesystem timestamp.
31 Filesystem based filtering happens before any
32 application based filtering.For application
33 based filtering, the order in which filters
34 areapplied is: application-name,
35 start-app-time, filter-criteria.Application
36 based filter-criteria are:100-newest (for
37 processing newest 100 event logs based on
38 timestamp insidethe eventlog) i.e application
39 start time) 100-oldest (for processing
40 oldest 100 event logs based on timestamp
41 insidethe eventlog) i.e application start
42 time) 100-newest-per-app-name (select at
43 most 100 newest log files for each unique
44 application name) 100-oldest-per-app-name
45 (select at most 100 oldest log files for each
46 unique application name)Filesystem based
47 filter criteria are:100-newest-filesystem
48 (for processing newest 100 event logs based
49 on filesystem timestamp).
50 100-oldest-filesystem (for processing oldest
51 100 event logsbased on filesystem timestamp).
52 -h, --html-report Default is to generate an HTML report.
53 --no-html-report Disables generating the HTML report.
54 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
55 input string. Filesystem based filtering
56 happens before any application based
57 filtering.
58 --max-sql-desc-length <arg> Maximum length of the SQL description
59 string output with the per sql output.
60 Default is 100.
61 --ml-functions Report if there are any SparkML or Spark XGBoost
62 functions in the eventlog.
63 -n, --num-output-rows <arg> Number of output rows in the summary report.
64 Default is 1000.
65 --num-threads <arg> Number of thread to use for parallel
66 processing. The default is the number of
67 cores on host divided by 4.
68 --order <arg> Specify the sort order of the report. desc or
69 asc, desc is the default. desc (descending)
70 would report applications most likely to be
71 accelerated at the top and asc (ascending)
72 would show the least likely to be accelerated
73 at the top.
74 -o, --output-directory <arg> Base output directory. Default is current
75 directory for the default filesystem. The
76 final output will go into a subdirectory
77 called rapids_4_spark_qualification_output.
78 It will overwrite any existing directory with
79 the same name.
80 -p, --per-sql Report at the individual SQL query level.
81 --platform <arg> Cluster platform where Spark CPU workloads were
82 executed. Options include onprem, dataproc-t4,
83 dataproc-l4, emr, databricks-aws, and
84 databricks-azure.
85 Default is onprem.
86 -r, --report-read-schema Whether to output the read formats and
87 datatypes to the CSV file. This can be very
88 long. Default is false.
89 --spark-property <arg>... Filter applications based on certain Spark
90 properties that were set during launch of the
91 application. It can filter based on key:value
92 pair or just based on keys. Multiple configs
93 can be provided where the filtering is done
94 if any of theconfig is present in the
95 eventlog. filter on specific configuration:
96 --spark-property=spark.eventLog.enabled:truefilter
97 all eventlogs which has config:
98 --spark-property=spark.driver.portMultiple
99 configs:
100 --spark-property=spark.eventLog.enabled:true
101 --spark-property=spark.driver.port
102 -s, --start-app-time <arg> Filter event logs whose application start
103 occurred within the past specified time
104 period. Valid time periods are
105 min(minute),h(hours),d(days),w(weeks),m(months).
106 If a period is not specified it defaults to
107 days.
108 -t, --timeout <arg> Maximum time in seconds to wait for the event
109 logs to be processed. Default is 24 hours
110 (86400 seconds) and must be greater than 3
111 seconds. If it times out, it will report what
112 it was able to process up until the timeout.
113 -u, --user-name <arg> Applications which a particular user has
114 submitted.
115 --help Show help message
116
117 trailing arguments:
118 eventlog (required) Event log filenames(space separated) or directories
119 containing event logs. eg: s3a://<BUCKET>/eventlog1
120 /path/to/eventlog2
Note
--help
should be before the trailing event logs.The “regular expression” used by
-a
option is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.