Qualification tool options
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 --auto-tuner Toggle AutoTuner module.
29 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
30 application start timestamp, unique
31 application name or filesystem timestamp.
32 Filesystem based filtering happens before any
33 application based filtering.For application
34 based filtering, the order in which filters
35 areapplied is: application-name,
36 start-app-time, filter-criteria.Application
37 based filter-criteria are:100-newest (for
38 processing newest 100 event logs based on
39 timestamp insidethe eventlog) i.e application
40 start time) 100-oldest (for processing
41 oldest 100 event logs based on timestamp
42 insidethe eventlog) i.e application start
43 time) 100-newest-per-app-name (select at
44 most 100 newest log files for each unique
45 application name) 100-oldest-per-app-name
46 (select at most 100 oldest log files for each
47 unique application name)Filesystem based
48 filter criteria are:100-newest-filesystem
49 (for processing newest 100 event logs based
50 on filesystem timestamp).
51 100-oldest-filesystem (for processing oldest
52 100 event logsbased on filesystem timestamp).
53 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
54 input string. Filesystem based filtering
55 happens before any application based
56 filtering.
57 --max-sql-desc-length <arg> Maximum length of the SQL description
58 string output with the per sql output.
59 Default is 100.
60 --ml-functions Report if there are any SparkML or Spark XGBoost
61 functions in the eventlog.
62 -n, --num-output-rows <arg> Number of output rows in the summary report.
63 Default is 1000.
64 --num-threads <arg> Number of thread to use for parallel
65 processing. The default is the number of
66 cores on host divided by 4.
67 --order <arg> Specify the sort order of the report. desc or
68 asc, desc is the default. desc (descending)
69 would report applications most likely to be
70 accelerated at the top and asc (ascending)
71 would show the least likely to be accelerated
72 at the top.
73 -o, --output-directory <arg> Base output directory. Default is current
74 directory for the default filesystem. The
75 final output will go into a subdirectory
76 called rapids_4_spark_qualification_output.
77 It will overwrite any existing directory with
78 the same name.
79 -p, --per-sql Report at the individual SQL query level.
80 --platform <arg> Cluster platform where Spark CPU workloads were
81 executed. Options include onprem, dataproc-t4,
82 dataproc-l4, emr, databricks-aws, and
83 databricks-azure.
84 Default is onprem.
85 -r, --report-read-schema Whether to output the read formats and
86 datatypes to the CSV file. This can be very
87 long. Default is false.
88 --spark-property <arg>... Filter applications based on certain Spark
89 properties that were set during launch of the
90 application. It can filter based on key:value
91 pair or just based on keys. Multiple configs
92 can be provided where the filtering is done
93 if any of theconfig is present in the
94 eventlog. filter on specific configuration:
95 --spark-property=spark.eventLog.enabled:truefilter
96 all eventlogs which has config:
97 --spark-property=spark.driver.portMultiple
98 configs:
99 --spark-property=spark.eventLog.enabled:true
100 --spark-property=spark.driver.port
101 -s, --start-app-time <arg> Filter event logs whose application start
102 occurred within the past specified time
103 period. Valid time periods are
104 min(minute),h(hours),d(days),w(weeks),m(months).
105 If a period isn't specified it defaults to
106 days.
107 -t, --timeout <arg> Maximum time in seconds to wait for the event
108 logs to be processed. Default is 24 hours
109 (86400 seconds) and must be greater than 3
110 seconds. If it times out, it will report what
111 it was able to process up until the timeout.
112 -u, --user-name <arg> Applications which a particular user has
113 submitted.
114 -w, --worker-info <arg> File path containing the system information
115 of a worker node. It's assumed that all
116 workers are homogenous. It requires the
117 AutoTuner to be enabled. Default is
118 ./worker_info.yaml
119 --help Show help message
120
121 trailing arguments:
122 eventlog (required) Event log filenames(space separated) or directories
123 containing event logs. for example, s3a://<BUCKET>/eventlog1
124 /path/to/eventlog2
Note
--help
should be before the trailing event logs.The “regular expression” used by
-a
option is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.