qual-jar-commands-options.html

Qualification tool options
java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help

RAPIDS Accelerator Qualification tool for Apache Spark

Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
       com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
       <eventlogs | eventlog directories ...>

      --all                          Apply multiple event log filtering criteria
                                     and process only logs for which all
                                     conditions are satisfied.Example: <Filter1>
                                     <Filter2> <Filter3> --all -> result is
                                     <Filter1> AND <Filter2> AND <Filter3>.
                                     Default is all=true
      --any                          Apply multiple event log filtering criteria
                                     and process only logs for which any condition
                                     is satisfied.Example: <Filter1> <Filter2>
                                     <Filter3> --any -> result is <Filter1> OR
                                     <Filter2> OR <Filter3>
  -a, --application-name  <arg>      Filter event logs by application name. The
                                     string specified can be a regular expression,
                                     substring, or exact match. For filtering
                                     based on complement of application name, use
                                     ~APPLICATION_NAME. i.e Select all event logs
                                     except the ones which have application name
                                     as the input string.
  -f, --filter-criteria  <arg>       Filter newest or oldest N eventlogs based on
                                     application start timestamp, unique
                                     application name or filesystem timestamp.
                                     Filesystem based filtering happens before any
                                     application based filtering.For application
                                     based filtering, the order in which filters
                                     areapplied is: application-name,
                                     start-app-time, filter-criteria.Application
                                     based filter-criteria are:100-newest (for
                                     processing newest 100 event logs based on
                                     timestamp insidethe eventlog) i.e application
                                     start time)  100-oldest (for processing
                                     oldest 100 event logs based on timestamp
                                     insidethe eventlog) i.e application start
                                     time)  100-newest-per-app-name (select at
                                     most 100 newest log files for each unique
                                     application name) 100-oldest-per-app-name
                                     (select at most 100 oldest log files for each
                                     unique application name)Filesystem based
                                     filter criteria are:100-newest-filesystem
                                     (for processing newest 100 event logs based
                                     on filesystem timestamp).
                                     100-oldest-filesystem (for processing oldest
                                     100 event logsbased on filesystem timestamp).
  -h, --html-report                  Default is to generate an HTML report.
      --no-html-report               Disables generating the HTML report.
  -m, --match-event-logs  <arg>      Filter event logs whose filenames contain the
                                     input string. Filesystem based filtering
                                     happens before any application based
                                     filtering.
      --max-sql-desc-length  <arg>   Maximum length of the SQL description
                                     string output with the per sql output.
                                     Default is 100.
      --ml-functions                 Report if there are any SparkML or Spark XGBoost
                                     functions in the eventlog.
  -n, --num-output-rows  <arg>       Number of output rows in the summary report.
                                     Default is 1000.
      --num-threads  <arg>           Number of thread to use for parallel
                                     processing. The default is the number of
                                     cores on host divided by 4.
      --order  <arg>                 Specify the sort order of the report. desc or
                                     asc, desc is the default. desc (descending)
                                     would report applications most likely to be
                                     accelerated at the top and asc (ascending)
                                     would show the least likely to be accelerated
                                     at the top.
  -o, --output-directory  <arg>      Base output directory. Default is current
                                     directory for the default filesystem. The
                                     final output will go into a subdirectory
                                     called rapids_4_spark_qualification_output.
                                     It will overwrite any existing directory with
                                     the same name.
  -p, --per-sql                      Report at the individual SQL query level.
      --platform  <arg>              Cluster platform where Spark CPU workloads were
                                     executed. Options include onprem, dataproc-t4,
                                     dataproc-l4, emr, databricks-aws, and
                                     databricks-azure.
                                     Default is onprem.
  -r, --report-read-schema           Whether to output the read formats and
                                     datatypes to the CSV file. This can be very
                                     long. Default is false.
      --spark-property  <arg>...     Filter applications based on certain Spark
                                     properties that were set during launch of the
                                     application. It can filter based on key:value
                                     pair or just based on keys. Multiple configs
                                     can be provided where the filtering is done
                                     if any of theconfig is present in the
                                     eventlog. filter on specific configuration:
                                     --spark-property=spark.eventLog.enabled:truefilter
                                     all eventlogs which has config:
                                     --spark-property=spark.driver.portMultiple
                                     configs:
                                     --spark-property=spark.eventLog.enabled:true
                                     --spark-property=spark.driver.port
  -s, --start-app-time  <arg>        Filter event logs whose application start
                                     occurred within the past specified time
                                     period. Valid time periods are
                                     min(minute),h(hours),d(days),w(weeks),m(months).
                                     If a period is not specified it defaults to
                                     days.
  -t, --timeout  <arg>               Maximum time in seconds to wait for the event
                                     logs to be processed. Default is 24 hours
                                     (86400 seconds) and must be greater than 3
                                     seconds. If it times out, it will report what
                                     it was able to process up until the timeout.
  -u, --user-name  <arg>             Applications which a particular user has
                                     submitted.
      --help                         Show help message

 trailing arguments:
  eventlog (required)   Event log filenames(space separated) or directories
                        containing event logs. eg: s3a://<BUCKET>/eventlog1
                        /path/to/eventlog2
Note
--help should be before the trailing event logs.
The “regular expression” used by -a option is based on java.util.regex.Pattern.
Please refer to Java CMD Samples for more examples and sample commands.