Qualification tool options
1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
3
4RAPIDS Accelerator Qualification tool for Apache Spark
5
6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
8 <eventlogs | eventlog directories ...>
9
10 --all Apply multiple event log filtering criteria
11 and process only logs for which all
12 conditions are satisfied.Example: <Filter1>
13 <Filter2> <Filter3> --all -> result is
14 <Filter1> AND <Filter2> AND <Filter3>.
15 Default is all=true
16 --any Apply multiple event log filtering criteria
17 and process only logs for which any condition
18 is satisfied.Example: <Filter1> <Filter2>
19 <Filter3> --any -> result is <Filter1> OR
20 <Filter2> OR <Filter3>
21 -a, --application-name <arg> Filter event logs by application name. The
22 string specified can be a regular expression,
23 substring, or exact match. For filtering
24 based on complement of application name, use
25 ~APPLICATION_NAME. i.e Select all event logs
26 except the ones which have application name
27 as the input string.
28 --auto-tuner Toggle AutoTuner module.
29 --target-cluster-info <arg> File path to YAML containing target cluster
30 information including worker instance type
31 and system properties. Provides platform-aware
32 cluster configuration. Requires AutoTuner to
33 be enabled.
34 --tuning-configs <arg> File path to YAML containing custom tuning
35 configuration parameters. Allows overriding
36 default AutoTuner constants. Requires
37 AutoTuner to be enabled.
38 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
39 application start timestamp, unique
40 application name or filesystem timestamp.
41 Filesystem based filtering happens before any
42 application based filtering.For application
43 based filtering, the order in which filters
44 areapplied is: application-name,
45 start-app-time, filter-criteria.Application
46 based filter-criteria are:100-newest (for
47 processing newest 100 event logs based on
48 timestamp insidethe eventlog) i.e application
49 start time) 100-oldest (for processing
50 oldest 100 event logs based on timestamp
51 insidethe eventlog) i.e application start
52 time) 100-newest-per-app-name (select at
53 most 100 newest log files for each unique
54 application name) 100-oldest-per-app-name
55 (select at most 100 oldest log files for each
56 unique application name)Filesystem based
57 filter criteria are:100-newest-filesystem
58 (for processing newest 100 event logs based
59 on filesystem timestamp).
60 100-oldest-filesystem (for processing oldest
61 100 event logsbased on filesystem timestamp).
62 -m, --match-event-logs <arg> Filter event logs whose filenames contain the
63 input string. Filesystem based filtering
64 happens before any application based
65 filtering.
66 --max-sql-desc-length <arg> Maximum length of the SQL description
67 string output with the per sql output.
68 Default is 100.
69 --ml-functions Report if there are any SparkML or Spark XGBoost
70 functions in the eventlog.
71 -n, --num-output-rows <arg> Number of output rows in the summary report.
72 Default is 1000.
73 --num-threads <arg> Number of thread to use for parallel
74 processing. The default is the number of
75 cores on host divided by 4.
76 --order <arg> Specify the sort order of the report. desc or
77 asc, desc is the default. desc (descending)
78 would report applications most likely to be
79 accelerated at the top and asc (ascending)
80 would show the least likely to be accelerated
81 at the top.
82 -o, --output-directory <arg> Base output directory. Default is current
83 directory for the default filesystem. The
84 final output will go into a subdirectory
85 called rapids_4_spark_qualification_output.
86 It will overwrite any existing directory with
87 the same name.
88 -p, --per-sql Report at the individual SQL query level.
89 --platform <arg> Cluster platform where Spark CPU workloads were
90 executed. Options include onprem, dataproc-t4,
91 dataproc-l4, emr, databricks-aws, and
92 databricks-azure.
93 Default is onprem.
94 -r, --report-read-schema Whether to output the read formats and
95 datatypes to the CSV file. This can be very
96 long. Default is false.
97 --spark-property <arg>... Filter applications based on certain Spark
98 properties that were set during launch of the
99 application. It can filter based on key:value
100 pair or just based on keys. Multiple configs
101 can be provided where the filtering is done
102 if any of theconfig is present in the
103 eventlog. filter on specific configuration:
104 --spark-property=spark.eventLog.enabled:truefilter
105 all eventlogs which has config:
106 --spark-property=spark.driver.portMultiple
107 configs:
108 --spark-property=spark.eventLog.enabled:true
109 --spark-property=spark.driver.port
110 -s, --start-app-time <arg> Filter event logs whose application start
111 occurred within the past specified time
112 period. Valid time periods are
113 min(minute),h(hours),d(days),w(weeks),m(months).
114 If a period isn't specified it defaults to
115 days.
116 -t, --timeout <arg> Maximum time in seconds to wait for the event
117 logs to be processed. Default is 24 hours
118 (86400 seconds) and must be greater than 3
119 seconds. If it times out, it will report what
120 it was able to process up until the timeout.
121 -u, --user-name <arg> Applications which a particular user has
122 submitted.
123 --help Show help message
124
125 trailing arguments:
126 eventlog (required) Event log filenames(space separated) or directories
127 containing event logs. for example, s3a://<BUCKET>/eventlog1
128 /path/to/eventlog2
Note
--helpshould be before the trailing event logs.The “regular expression” used by
-aoption is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.