Qualification tool options

  1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
  3
  4RAPIDS Accelerator Qualification tool for Apache Spark
  5
  6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
  7       com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
  8       <eventlogs | eventlog directories ...>
  9
 10      --all                          Apply multiple event log filtering criteria
 11                                     and process only logs for which all
 12                                     conditions are satisfied.Example: <Filter1>
 13                                     <Filter2> <Filter3> --all -> result is
 14                                     <Filter1> AND <Filter2> AND <Filter3>.
 15                                     Default is all=true
 16      --any                          Apply multiple event log filtering criteria
 17                                     and process only logs for which any condition
 18                                     is satisfied.Example: <Filter1> <Filter2>
 19                                     <Filter3> --any -> result is <Filter1> OR
 20                                     <Filter2> OR <Filter3>
 21  -a, --application-name  <arg>      Filter event logs by application name. The
 22                                     string specified can be a regular expression,
 23                                     substring, or exact match. For filtering
 24                                     based on complement of application name, use
 25                                     ~APPLICATION_NAME. i.e Select all event logs
 26                                     except the ones which have application name
 27                                     as the input string.
 28      --auto-tuner                   Toggle AutoTuner module.
 29  -f, --filter-criteria  <arg>       Filter newest or oldest N eventlogs based on
 30                                     application start timestamp, unique
 31                                     application name or filesystem timestamp.
 32                                     Filesystem based filtering happens before any
 33                                     application based filtering.For application
 34                                     based filtering, the order in which filters
 35                                     areapplied is: application-name,
 36                                     start-app-time, filter-criteria.Application
 37                                     based filter-criteria are:100-newest (for
 38                                     processing newest 100 event logs based on
 39                                     timestamp insidethe eventlog) i.e application
 40                                     start time)  100-oldest (for processing
 41                                     oldest 100 event logs based on timestamp
 42                                     insidethe eventlog) i.e application start
 43                                     time)  100-newest-per-app-name (select at
 44                                     most 100 newest log files for each unique
 45                                     application name) 100-oldest-per-app-name
 46                                     (select at most 100 oldest log files for each
 47                                     unique application name)Filesystem based
 48                                     filter criteria are:100-newest-filesystem
 49                                     (for processing newest 100 event logs based
 50                                     on filesystem timestamp).
 51                                     100-oldest-filesystem (for processing oldest
 52                                     100 event logsbased on filesystem timestamp).
 53  -m, --match-event-logs  <arg>      Filter event logs whose filenames contain the
 54                                     input string. Filesystem based filtering
 55                                     happens before any application based
 56                                     filtering.
 57      --max-sql-desc-length  <arg>   Maximum length of the SQL description
 58                                     string output with the per sql output.
 59                                     Default is 100.
 60      --ml-functions                 Report if there are any SparkML or Spark XGBoost
 61                                     functions in the eventlog.
 62  -n, --num-output-rows  <arg>       Number of output rows in the summary report.
 63                                     Default is 1000.
 64      --num-threads  <arg>           Number of thread to use for parallel
 65                                     processing. The default is the number of
 66                                     cores on host divided by 4.
 67      --order  <arg>                 Specify the sort order of the report. desc or
 68                                     asc, desc is the default. desc (descending)
 69                                     would report applications most likely to be
 70                                     accelerated at the top and asc (ascending)
 71                                     would show the least likely to be accelerated
 72                                     at the top.
 73  -o, --output-directory  <arg>      Base output directory. Default is current
 74                                     directory for the default filesystem. The
 75                                     final output will go into a subdirectory
 76                                     called rapids_4_spark_qualification_output.
 77                                     It will overwrite any existing directory with
 78                                     the same name.
 79  -p, --per-sql                      Report at the individual SQL query level.
 80      --platform  <arg>              Cluster platform where Spark CPU workloads were
 81                                     executed. Options include onprem, dataproc-t4,
 82                                     dataproc-l4, emr, databricks-aws, and
 83                                     databricks-azure.
 84                                     Default is onprem.
 85  -r, --report-read-schema           Whether to output the read formats and
 86                                     datatypes to the CSV file. This can be very
 87                                     long. Default is false.
 88      --spark-property  <arg>...     Filter applications based on certain Spark
 89                                     properties that were set during launch of the
 90                                     application. It can filter based on key:value
 91                                     pair or just based on keys. Multiple configs
 92                                     can be provided where the filtering is done
 93                                     if any of theconfig is present in the
 94                                     eventlog. filter on specific configuration:
 95                                     --spark-property=spark.eventLog.enabled:truefilter
 96                                     all eventlogs which has config:
 97                                     --spark-property=spark.driver.portMultiple
 98                                     configs:
 99                                     --spark-property=spark.eventLog.enabled:true
100                                     --spark-property=spark.driver.port
101  -s, --start-app-time  <arg>        Filter event logs whose application start
102                                     occurred within the past specified time
103                                     period. Valid time periods are
104                                     min(minute),h(hours),d(days),w(weeks),m(months).
105                                     If a period isn't specified it defaults to
106                                     days.
107  -t, --timeout  <arg>               Maximum time in seconds to wait for the event
108                                     logs to be processed. Default is 24 hours
109                                     (86400 seconds) and must be greater than 3
110                                     seconds. If it times out, it will report what
111                                     it was able to process up until the timeout.
112  -u, --user-name  <arg>             Applications which a particular user has
113                                     submitted.
114  -w, --worker-info  <arg>           File path containing the system information
115                                     of a worker node. It's assumed that all
116                                     workers are homogenous. It requires the
117                                     AutoTuner to be enabled. Default is
118                                     ./worker_info.yaml
119      --help                         Show help message
120
121 trailing arguments:
122  eventlog (required)   Event log filenames(space separated) or directories
123                        containing event logs. for example, s3a://<BUCKET>/eventlog1
124                        /path/to/eventlog2

Note

  • --help should be before the trailing event logs.

  • The “regular expression” used by -a option is based on java.util.regex.Pattern.

Please refer to Java CMD Samples for more examples and sample commands.