Profiling Tool - Jar Usage#

The Profiling tool can be run in as a java cmd in three different ways if you are not using the CLI tool:

There are 3 modes of operation for the Profiling tool:

For sample execution commands, please refer to the examples section.

Setting Up Environment#

Prerequisites#

  • Java 8+

  • Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.

  • The tool requires the Spark 3.x+ jars to be able to run but it does not need an Apache Spark runtime. If you do not already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.

  • This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix file: in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpath

    Include $HADOOP_CONF_DIR in classpath

    Sample showing Java’s classpath#
    -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
    

    Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

    Download the matched jars based on the Hadoop version

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

    In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

     1<?xml version="1.0"?>
     2<configuration>
     3   <property>
     4      <name>fs.s3a.access.key</name>
     5      <value>xxx</value>
     6   </property>
     7   <property>
     8      <name>fs.s3a.secret.key</name>
     9      <value>xxx</value>
    10   </property>
    11</configuration>
    

    You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

    Please refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.

    • Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

    • The simplest authentication mechanism is to use account-name and account-key. Please refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.

Getting the Tools Jar#

  • Checkout the code repository

    git clone git@github.com:NVIDIA/spark-rapids-tools.git
    cd spark-rapids-tools/core
    
  • Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (i.e., Spark version)

    mvn clean package
    

Running Tools Jar#

Profiling Tool Options#

 1Profiling tool for the RAPIDS Accelerator and Apache Spark
 2
 3Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
 4       com.nvidia.spark.rapids.tool.profiling.ProfileMain [options]
 5       <eventlogs | eventlog directories ...>
 6
 7  -a, --auto-tuner                Toggle AutoTuner module.
 8      --combined                  Collect mode but combine all applications into
 9                                  the same tables.
10  -c, --compare                   Compare Applications (Note this may require
11                                  more memory if comparing a large number of
12                                  applications). Default is false.
13      --csv                       Output each table to a CSV file as well
14                                  creating the summary text file.
15  -d, --driverlog <arg>           Specifies the name of a driver log file that
16                                  the profiling tool is to process. The tool
17                                  identifies any invalid operations in the log
18                                  and writes them to a .csv file. When
19                                  --driverlog is specified, the eventlog
20                                  parameter is optional.
21  -f, --filter-criteria  <arg>    Filter newest or oldest N eventlogs based on
22                                  application start timestamp for processing.
23                                  Filesystem based filtering happens before
24                                  application based filtering (see start-app-time).
25                                  eg: 100-newest-filesystem (for processing newest
26                                  100 event logs). eg: 100-oldest-filesystem (for
27                                  processing oldest 100 event logs).
28  -g, --generate-dot              Generate query visualizations in DOT format.
29                                  Default is false
30      --generate-timeline         Write an SVG graph out for the full
31                                  application timeline.
32  -m, --match-event-logs  <arg>   Filter event logs whose filenames contain the
33                                  input string
34  -n, --num-output-rows  <arg>    Number of output rows for each Application.
35                                  Default is 1000
36      --num-threads  <arg>        Number of thread to use for parallel
37                                  processing. The default is the number of cores
38                                  on host divided by 4.
39  -o, --output-directory  <arg>   Base output directory. Default is current
40                                  directory for the default filesystem. The
41                                  final output will go into a subdirectory
42                                  called rapids_4_spark_profile. It will
43                                  overwrite any existing files with the same
44                                  name.
45  -p, --print-plans               Print the SQL plans to a file named
46                                  'planDescriptions.log'.
47                                  Default is false.
48  -s, --start-app-time  <arg>     Filter event logs whose application start
49                                  occurred within the past specified time
50                                  period. Valid time periods are
51                                  min(minute),h(hours),d(days),w(weeks),m(months).
52                                  If a period is not specified it defaults to
53                                  days.
54  -t, --timeout  <arg>            Maximum time in seconds to wait for the event
55                                  logs to be processed. Default is 24 hours
56                                  (86400 seconds) and must be greater than 3
57                                  seconds. If it times out, it will report what
58                                  it was able to process up until the timeout.
59  -w, --worker-info  <arg>        File path containing the system information of
60                                  a worker node. It is assumed that all workers
61                                  are homogenous. It requires the AutoTuner to
62                                  be enabled. Default is ./worker_info.yaml
63  -h, --help                      Show help message
64
65 trailing arguments:
66  eventlog (optional)   Event log filenames (space separated) or directories
67                        containing event logs. eg: s3a://<BUCKET>/eventlog1
68                        /path/to/eventlog2. At least one eventlog or a driver
69                        log must be specified; thus an eventlog parameter is
70                        required if the --driverlog option is not specified.

Tuning Spark Properties For GPU Clusters#

Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (e.g. memory, cores, Spark default configurations) as well as information processed in the application event logs. Note that the tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it is running. The values loaded from the app logs have higher precedence than the default configs.

Note

Auto-Tuner limitations:

  • It is assumed that all the worker nodes on the cluster are homogenous.

To run the Auto-Tuner, enable the auto-tuner flag and optionally pass a valid --worker-info <FILE_PATH>. The Auto-Tuner needs to learn the system properties of the worker nodes that run application code in the cluster. The argument FILE_PATH can either be local or remote file (i.e., HDFS).

If the --worker-info argument is not supplied, then the Auto-Tuner will only recommend tuned settings based on the job event log and not on any cluster or worker information since that is not available.

Template of the worker information file in “yaml” format#
 1system:
 2  numCores: 32
 3  memory: 212992MiB
 4  numWorkers: 5
 5gpu:
 6  memory: 15109MiB
 7  count: 4
 8  name: T4
 9softwareProperties:
10  spark.driver.maxResultSize: 7680m
11  spark.driver.memory: 15360m
12  spark.executor.cores: '8'
13  spark.executor.instances: '2'
14  spark.executor.memory: 47222m
15  spark.executorEnv.OPENBLAS_NUM_THREADS: '1'
16  spark.scheduler.mode: FAIR
17  spark.sql.cbo.enabled: 'true'
18  spark.ui.port: '0'
19  spark.yarn.am.memory: 640m

Property

Optional

If Missing

system.numCores

No

Auto-Tuner does not calculate recommendations

system.memory

No

Auto-Tuner does not calculate any recommendations

system.numWorkers

Yes

Default: 1

gpu.name

Yes

Default: T4 (Nvidia Tesla T4)

gpu.memory

Yes

Default: 16G

softwareProperties

Yes

This section is optional. The Auto-Tuner reads the configs within the logs of the Apache Spark apps with higher precedence

Processing Spark Event Logs#

  1. The tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It is recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

    export JVM_HEAP=-Xmx10g
    
  2. Examples running the tool on the following environments

    • Extract the Spark distribution into a local directory if necessary.

    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:$SPARK_HOME/jars/*:... when you run the Qualification tool.

    java ${JVM_HEAP} \
         -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
         com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \
         <eventlogs | eventlog directories ...>
    
    java ${JVM_HEAP} \
         -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
         com.nvidia.spark.rapids.tool.profiling.ProfileMain \
         /usr/logs/app-name1
    

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.

    java ${JVM_HEAP} \
         -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
         com.nvidia.spark.rapids.tool.profiling.ProfileMain  /eventlogDir
    

Processing Driver Logs#

The Profiling tool can process GPU a driver log as well as CPU and GPU event logs. When the Profiling tool processes a driver log, it generates a .csv file that lists unsupported operators.

You inform the Profiling tool of a GPU driver log with the command line option --driverlog. The option has one required argument, specifying the pathname of a driver log file. You may specify just one driver log file per a single run.

A single run of the Profiling tool may process CPU/GPU event logs, a GPU driver log, or both.

Please refer to Processing event logs section for instructions on accessing the driver log existing on remote and local filesystems.

Example running the tool on a driver log#
java ${JVM_HEAP} \
      -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
      com.nvidia.spark.rapids.tool.profiling.ProfileMain  \
      --driverlog /path_to_driverlog \
      /eventlog

Java CMD Samples#

Collection Modes#

Example running Profiling tool with different collections modes:

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \
     <eventlogs | eventlog directories ...>
java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain --combined \
     <eventlogs | eventlog directories ...>
java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain --compare \
     <eventlogs | eventlog directories ...>