spark-rapids/user-guide/24.06/partials/qual-jar-commands-options.html
Qualification tool options
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 --auto-tuner Toggle AutoTuner module.
29 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
30 application start timestamp, unique
31 application name or filesystem timestamp.
32 Filesystem based filtering happens before any
33 application based filtering.For application
34 based filtering, the order in which filters
35 areapplied is: application-name,
36 start-app-time, filter-criteria.Application
37 based filter-criteria are:100-newest (for
38 processing newest 100 event logs based on
39 timestamp insidethe eventlog) i.e application
40 start time) 100-oldest (for processing
41 oldest 100 event logs based on timestamp
42 insidethe eventlog) i.e application start
43 time) 100-newest-per-app-name (select at
44 most 100 newest log files for each unique
45 application name) 100-oldest-per-app-name
46 (select at most 100 oldest log files for each
47 unique application name)Filesystem based
48 filter criteria are:100-newest-filesystem
49 (for processing newest 100 event logs based
50 on filesystem timestamp).
51 100-oldest-filesystem (for processing oldest
52 100 event logsbased on filesystem timestamp).
53 -h, --html-report Default is to generate an HTML report.
54 --no-html-report Disables generating the HTML report.
55 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
56 input string. Filesystem based filtering
57 happens before any application based
58 filtering.
59 --max-sql-desc-length <arg> Maximum length of the SQL description
60 string output with the per sql output.
61 Default is 100.
62 --ml-functions Report if there are any SparkML or Spark XGBoost
63 functions in the eventlog.
64 -n, --num-output-rows <arg> Number of output rows in the summary report.
65 Default is 1000.
66 --num-threads <arg> Number of thread to use for parallel
67 processing. The default is the number of
68 cores on host divided by 4.
69 --order <arg> Specify the sort order of the report. desc or
70 asc, desc is the default. desc (descending)
71 would report applications most likely to be
72 accelerated at the top and asc (ascending)
73 would show the least likely to be accelerated
74 at the top.
75 -o, --output-directory <arg> Base output directory. Default is current
76 directory for the default filesystem. The
77 final output will go into a subdirectory
78 called rapids_4_spark_qualification_output.
79 It will overwrite any existing directory with
80 the same name.
81 -p, --per-sql Report at the individual SQL query level.
82 --platform <arg> Cluster platform where Spark CPU workloads were
83 executed. Options include onprem, dataproc-t4,
84 dataproc-l4, emr, databricks-aws, and
85 databricks-azure.
86 Default is onprem.
87 -r, --report-read-schema Whether to output the read formats and
88 datatypes to the CSV file. This can be very
89 long. Default is false.
90 --spark-property <arg>... Filter applications based on certain Spark
91 properties that were set during launch of the
92 application. It can filter based on key:value
93 pair or just based on keys. Multiple configs
94 can be provided where the filtering is done
95 if any of theconfig is present in the
96 eventlog. filter on specific configuration:
97 --spark-property=spark.eventLog.enabled:truefilter
98 all eventlogs which has config:
99 --spark-property=spark.driver.portMultiple
100 configs:
101 --spark-property=spark.eventLog.enabled:true
102 --spark-property=spark.driver.port
103 -s, --start-app-time <arg> Filter event logs whose application start
104 occurred within the past specified time
105 period. Valid time periods are
106 min(minute),h(hours),d(days),w(weeks),m(months).
107 If a period is not specified it defaults to
108 days.
109 -t, --timeout <arg> Maximum time in seconds to wait for the event
110 logs to be processed. Default is 24 hours
111 (86400 seconds) and must be greater than 3
112 seconds. If it times out, it will report what
113 it was able to process up until the timeout.
114 -u, --user-name <arg> Applications which a particular user has
115 submitted.
116 -w, --worker-info <arg> File path containing the system information
117 of a worker node. It is assumed that all
118 workers are homogenous. It requires the
119 AutoTuner to be enabled. Default is
120 ./worker_info.yaml
121 --help Show help message
122
123 trailing arguments:
124 eventlog (required) Event log filenames(space separated) or directories
125 containing event logs. eg: s3a://<BUCKET>/eventlog1
126 /path/to/eventlog2
Note
--help
should be before the trailing event logs.The “regular expression” used by
-a
option is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.