spark-rapids/user-guide/23.12.1/partials/qual-jar-commands-options.html

User Guide (23.12.1)

Qualification tool options

 1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \  2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help  3  4RAPIDS Accelerator Qualification tool for Apache Spark  5  6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*  7 com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]  8 <eventlogs | eventlog directories ...>  9  10 --all Apply multiple event log filtering criteria  11 and process only logs for which all  12 conditions are satisfied.Example: <Filter1>  13 <Filter2> <Filter3> --all -> result is  14 <Filter1> AND <Filter2> AND <Filter3>.  15 Default is all=true  16 --any Apply multiple event log filtering criteria  17 and process only logs for which any condition  18 is satisfied.Example: <Filter1> <Filter2>  19 <Filter3> --any -> result is <Filter1> OR  20 <Filter2> OR <Filter3>  21 -a, --application-name <arg> Filter event logs by application name. The  22 string specified can be a regular expression,  23 substring, or exact match. For filtering  24 based on complement of application name, use  25 ~APPLICATION_NAME. i.e Select all event logs  26 except the ones which have application name  27 as the input string.  28 -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on  29 application start timestamp, unique  30 application name or filesystem timestamp.  31 Filesystem based filtering happens before any  32 application based filtering.For application  33 based filtering, the order in which filters  34 areapplied is: application-name,  35 start-app-time, filter-criteria.Application  36 based filter-criteria are:100-newest (for  37 processing newest 100 event logs based on  38 timestamp insidethe eventlog) i.e application  39 start time) 100-oldest (for processing  40 oldest 100 event logs based on timestamp  41 insidethe eventlog) i.e application start  42 time) 100-newest-per-app-name (select at  43 most 100 newest log files for each unique  44 application name) 100-oldest-per-app-name  45 (select at most 100 oldest log files for each  46 unique application name)Filesystem based  47 filter criteria are:100-newest-filesystem  48 (for processing newest 100 event logs based  49 on filesystem timestamp).  50 100-oldest-filesystem (for processing oldest  51 100 event logsbased on filesystem timestamp).  52 -h, --html-report Default is to generate an HTML report.  53 --no-html-report Disables generating the HTML report.  54 -m, --match-event-logs <arg> Filter event logs whose filenames contain the  55 input string. Filesystem based filtering  56 happens before any application based  57 filtering.  58 --max-sql-desc-length <arg> Maximum length of the SQL description  59 string output with the per sql output.  60 Default is 100.  61 --ml-functions Report if there are any SparkML or Spark XGBoost  62 functions in the eventlog.  63 -n, --num-output-rows <arg> Number of output rows in the summary report.  64 Default is 1000.  65 --num-threads <arg> Number of thread to use for parallel  66 processing. The default is the number of  67 cores on host divided by 4.  68 --order <arg> Specify the sort order of the report. desc or  69 asc, desc is the default. desc (descending)  70 would report applications most likely to be  71 accelerated at the top and asc (ascending)  72 would show the least likely to be accelerated  73 at the top.  74 -o, --output-directory <arg> Base output directory. Default is current  75 directory for the default filesystem. The  76 final output will go into a subdirectory  77 called rapids_4_spark_qualification_output.  78 It will overwrite any existing directory with  79 the same name.  80 -p, --per-sql Report at the individual SQL query level.  81 --platform <arg> Cluster platform where Spark CPU workloads were  82 executed. Options include onprem, dataproc-t4,  83 dataproc-l4, emr, databricks-aws, and  84 databricks-azure.  85 Default is onprem.  86 -r, --report-read-schema Whether to output the read formats and  87 datatypes to the CSV file. This can be very  88 long. Default is false.  89 --spark-property <arg>... Filter applications based on certain Spark  90 properties that were set during launch of the  91 application. It can filter based on key:value  92 pair or just based on keys. Multiple configs  93 can be provided where the filtering is done  94 if any of theconfig is present in the  95 eventlog. filter on specific configuration:  96 --spark-property=spark.eventLog.enabled:truefilter  97 all eventlogs which has config:  98 --spark-property=spark.driver.portMultiple  99 configs: 100 --spark-property=spark.eventLog.enabled:true 101 --spark-property=spark.driver.port 102 -s, --start-app-time <arg> Filter event logs whose application start 103 occurred within the past specified time 104 period. Valid time periods are 105 min(minute),h(hours),d(days),w(weeks),m(months). 106 If a period is not specified it defaults to 107 days. 108 -t, --timeout <arg> Maximum time in seconds to wait for the event 109 logs to be processed. Default is 24 hours 110 (86400 seconds) and must be greater than 3 111 seconds. If it times out, it will report what 112 it was able to process up until the timeout. 113 -u, --user-name <arg> Applications which a particular user has 114 submitted. 115 --help Show help message 116 117 trailing arguments: 118 eventlog (required) Event log filenames(space separated) or directories 119 containing event logs. eg: s3a://<BUCKET>/eventlog1 120 /path/to/eventlog2 

Note

  • --help should be before the trailing event logs.

  • The “regular expression” used by -a option is based on java.util.regex.Pattern.

Please refer to Java CMD Samples for more examples and sample commands.

© Copyright 2023-2024, NVIDIA. Last updated on Feb 5, 2024.