Qualification Tool - Jar Usage#

The Qualification tool can be run as a standalone Java cmd on Spark event logs after the application(s) have run, for users who aren’t using the CLI tool.

Setting Up Environment#

Prerequisites#

  • Java 8+

  • Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.

  • The tool requires the Spark [3.x, 4.0] jars to be able to run but it doesn’t need an Apache Spark runtime. If you don’t already have Spark [3.x, 4.0] installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.

  • This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix file: in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpath

    Include $HADOOP_CONF_DIR in classpath

    Sample showing Java’s classpath#
    -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
    

    Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

    Download the matched jars based on the Hadoop version

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

    In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

     1<?xml version="1.0"?>
     2<configuration>
     3   <property>
     4      <name>fs.s3a.access.key</name>
     5      <value>xxx</value>
     6   </property>
     7   <property>
     8      <name>fs.s3a.secret.key</name>
     9      <value>xxx</value>
    10   </property>
    11</configuration>
    

    You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

    Refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.

    • Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

    • The simplest authentication mechanism is to use account-name and account-key.Refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.

Getting the Tools Jar#

  • Checkout the code repository

    git clone git@github.com:NVIDIA/spark-rapids-tools.git
    cd spark-rapids-tools/core
    
  • Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (that is, Spark version)

    mvn clean package
    

Deploying Tools Jar#

Running the Qualification Tool Standalone on Spark Event Logs#

  1. The tool reads the log files and processes them in memory. So, the heap memory should be increased when processing a large volume of events. It’s recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

    export JVM_HEAP=-Xmx10g
    
  2. Examples running the tool on the following environments

    • Extract the Spark distribution into a local directory if necessary.

    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:$SPARK_HOME/jars/*:... when you run the Qualification tool.

    1java ${JVM_HEAP} \
    2     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
    3     com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] \
    4     <eventlogs | eventlog directories ...>
    
    1java ${JVM_HEAP} \
    2     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
    3     com.nvidia.spark.rapids.tool.qualification.QualificationMain \
    4     /usr/logs/app-name1
    

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output, so if you want to point to the local filesystem, be sure to include file: in the path.

    1java ${JVM_HEAP} \
    2     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3     com.nvidia.spark.rapids.tool.qualification.QualificationMain  /eventlogDir
    

Qualification tool options

  1java -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  2 com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
  3
  4RAPIDS Accelerator Qualification tool for Apache Spark
  5
  6Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
  7       com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
  8       <eventlogs | eventlog directories ...>
  9
 10      --all                          Apply multiple event log filtering criteria
 11                                     and process only logs for which all
 12                                     conditions are satisfied.Example: <Filter1>
 13                                     <Filter2> <Filter3> --all -> result is
 14                                     <Filter1> AND <Filter2> AND <Filter3>.
 15                                     Default is all=true
 16      --any                          Apply multiple event log filtering criteria
 17                                     and process only logs for which any condition
 18                                     is satisfied.Example: <Filter1> <Filter2>
 19                                     <Filter3> --any -> result is <Filter1> OR
 20                                     <Filter2> OR <Filter3>
 21  -a, --application-name  <arg>      Filter event logs by application name. The
 22                                     string specified can be a regular expression,
 23                                     substring, or exact match. For filtering
 24                                     based on complement of application name, use
 25                                     ~APPLICATION_NAME. i.e Select all event logs
 26                                     except the ones which have application name
 27                                     as the input string.
 28     --auto-tuner                    Toggle AutoTuner module.
 29     --target-cluster-info  <arg>    File path to YAML containing target cluster
 30                                     information including worker instance type
 31                                     and system properties. Provides platform-aware
 32                                     cluster configuration. Requires AutoTuner to
 33                                     be enabled.
 34     --tuning-configs  <arg>         File path to YAML containing custom tuning
 35                                     configuration parameters. Allows overriding
 36                                     default AutoTuner constants. Requires
 37                                     AutoTuner to be enabled.
 38 -f, --filter-criteria  <arg>        Filter newest or oldest N eventlogs based on
 39                                     application start timestamp, unique
 40                                     application name or filesystem timestamp.
 41                                     Filesystem based filtering happens before any
 42                                     application based filtering.For application
 43                                     based filtering, the order in which filters
 44                                     areapplied is: application-name,
 45                                     start-app-time, filter-criteria.Application
 46                                     based filter-criteria are:100-newest (for
 47                                     processing newest 100 event logs based on
 48                                     timestamp insidethe eventlog) i.e application
 49                                     start time)  100-oldest (for processing
 50                                     oldest 100 event logs based on timestamp
 51                                     insidethe eventlog) i.e application start
 52                                     time)  100-newest-per-app-name (select at
 53                                     most 100 newest log files for each unique
 54                                     application name) 100-oldest-per-app-name
 55                                     (select at most 100 oldest log files for each
 56                                     unique application name)Filesystem based
 57                                     filter criteria are:100-newest-filesystem
 58                                     (for processing newest 100 event logs based
 59                                     on filesystem timestamp).
 60                                     100-oldest-filesystem (for processing oldest
 61                                     100 event logsbased on filesystem timestamp).
 62  -m, --match-event-logs  <arg>      Filter event logs whose filenames contain the
 63                                     input string. Filesystem based filtering
 64                                     happens before any application based
 65                                     filtering.
 66      --max-sql-desc-length  <arg>   Maximum length of the SQL description
 67                                     string output with the per sql output.
 68                                     Default is 100.
 69      --ml-functions                 Report if there are any SparkML or Spark XGBoost
 70                                     functions in the eventlog.
 71  -n, --num-output-rows  <arg>       Number of output rows in the summary report.
 72                                     Default is 1000.
 73      --num-threads  <arg>           Number of thread to use for parallel
 74                                     processing. The default is the number of
 75                                     cores on host divided by 4.
 76      --order  <arg>                 Specify the sort order of the report. desc or
 77                                     asc, desc is the default. desc (descending)
 78                                     would report applications most likely to be
 79                                     accelerated at the top and asc (ascending)
 80                                     would show the least likely to be accelerated
 81                                     at the top.
 82  -o, --output-directory  <arg>      Base output directory. Default is current
 83                                     directory for the default filesystem. The
 84                                     final output will go into a subdirectory
 85                                     called rapids_4_spark_qualification_output.
 86                                     It will overwrite any existing directory with
 87                                     the same name.
 88  -p, --per-sql                      Report at the individual SQL query level.
 89      --platform  <arg>              Cluster platform where Spark CPU workloads were
 90                                     executed. Options include onprem, dataproc-t4,
 91                                     dataproc-l4, emr, databricks-aws, and
 92                                     databricks-azure.
 93                                     Default is onprem.
 94  -r, --report-read-schema           Whether to output the read formats and
 95                                     datatypes to the CSV file. This can be very
 96                                     long. Default is false.
 97      --spark-property  <arg>...     Filter applications based on certain Spark
 98                                     properties that were set during launch of the
 99                                     application. It can filter based on key:value
100                                     pair or just based on keys. Multiple configs
101                                     can be provided where the filtering is done
102                                     if any of theconfig is present in the
103                                     eventlog. filter on specific configuration:
104                                     --spark-property=spark.eventLog.enabled:truefilter
105                                     all eventlogs which has config:
106                                     --spark-property=spark.driver.portMultiple
107                                     configs:
108                                     --spark-property=spark.eventLog.enabled:true
109                                     --spark-property=spark.driver.port
110  -s, --start-app-time  <arg>        Filter event logs whose application start
111                                     occurred within the past specified time
112                                     period. Valid time periods are
113                                     min(minute),h(hours),d(days),w(weeks),m(months).
114                                     If a period isn't specified it defaults to
115                                     days.
116  -t, --timeout  <arg>               Maximum time in seconds to wait for the event
117                                     logs to be processed. Default is 24 hours
118                                     (86400 seconds) and must be greater than 3
119                                     seconds. If it times out, it will report what
120                                     it was able to process up until the timeout.
121 -u, --user-name  <arg>              Applications which a particular user has
122                                     submitted.
123     --help                          Show help message
124
125 trailing arguments:
126  eventlog (required)   Event log filenames(space separated) or directories
127                        containing event logs. for example, s3a://<BUCKET>/eventlog1
128                        /path/to/eventlog2

Note

  • --help should be before the trailing event logs.

  • The “regular expression” used by -a option is based on java.util.regex.Pattern.

Please refer to Java CMD Samples for more examples and sample commands.

Tuning Spark Properties For GPU Clusters#

Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running. The values loaded from the app logs have higher precedence than the default configs.

The recommendations span several categories, ordered from most to least impactful:

  • RAPIDS plugin & GPU resources (required for GPU execution): spark.plugins (must include com.nvidia.spark.SQLPlugin), spark.rapids.sql.enabled, spark.executor.resource.gpu.amount, spark.task.resource.gpu.amount, and spark.shuffle.manager (RAPIDS Shuffle Manager).

  • Executor sizing: spark.executor.cores, spark.executor.instances, spark.executor.memory, spark.executor.memoryOverhead.

  • GPU runtime: spark.rapids.sql.concurrentGpuTasks, spark.rapids.memory.pinnedPool.size, spark.rapids.sql.batchSizeBytes.

  • Shuffle and AQE: spark.sql.shuffle.partitions, spark.sql.files.maxPartitionBytes, spark.sql.adaptive.advisoryPartitionSizeInBytes, spark.sql.adaptive.coalescePartitions.parallelismFirst.

  • Dynamic allocation: spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors, and spark.dynamicAllocation.maxExecutors, sized against the CPU-to-GPU core ratio. When recommended, the Auto-Tuner enforces minExecutors <= initialExecutors <= maxExecutors.

  • Platform-specific plugins: additional recommendations may be emitted for EMR (JVM options that disable Transparent Huge Pages) and Delta Lake (GPU-accelerated Delta write via spark.rapids.sql.format.delta.write.enabled, plus version-compatibility and support comments).

The Auto-Tuner also tunes secondary properties such as Kryo serialization settings, multi-threaded reader/writer threads, RAPIDS file cache, data locality wait, and platform compatibility flags where applicable.

Note

Auto-Tuner limitations:

  • It’s assumed that all the worker nodes on the cluster are homogenous.

To run the Auto-Tuner, enable the auto-tuner flag. Optionally, provide target cluster information using --target-cluster-info <FILE_PATH> to specify the GPU worker node configuration for generating optimized recommendations. The file path can be local or remote (for example, HDFS).

If the --target-cluster-info argument isn’t supplied, the Auto-Tuner will use platform-specific default worker instance types for tuning recommendations. See AutoTuner Configuration for details on default instance types, supported platforms, and how to customize AutoTuner behavior.

Java CMD Samples#

  • Process the 10 newest logs, and only output the top 3 in the output:

    1java ${QUALIFICATION_HEAP} \
    2  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir
    
  • Process last 100 days’ logs:

    1java ${QUALIFICATION_HEAP} \
    2  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3  com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir
    
  • Process only the newest log with the same application name:

    1java ${QUALIFICATION_HEAP} \
    2  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir
    
  • Parse ML functions from the eventlog:

    1java ${QUALIFICATION_HEAP} \
    2  -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
    3  com.nvidia.spark.rapids.tool.qualification.QualificationMain --ml-functions /eventlogDir