Profiling Tool - Jar Usage
The Profiling tool can be run in as a java cmd in three different ways if you aren’t using the CLI tool:
There are 3 modes of operation for the Profiling tool:
For sample execution commands, refer to the examples section.
Prerequisites
Java 8+
Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with
.lz4
,.lzf
,.snappy
and.zstd
suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.The tool requires the Spark 3.x+ jars to be able to run but it doesn’t need an Apache Spark runtime. If you don’t already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.
This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix
file:
in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpathInclude
$HADOOP_CONF_DIR
in classpathSample showing Java’s classpath
-cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
Download the
gcs-connector-hadoop3-<version>-shaded.jar
and follow the instructions to configure Hadoop/Spark.Download the matched jars based on the Hadoop version
hadoop-aws-<version>.jar
aws-java-sdk-<version>.jar
In $SPARK_HOME/conf, create
hdfs-site.xml
with below AWS S3 keys inside:<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.access.key</name> <value>xxx</value> </property> <property> <name>fs.s3a.secret.key</name> <value>xxx</value> </property> </configuration>
You can test your configuration by including the above jars in the
-jars
option tospark-shell
orspark-submit
Refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.
Download the matched jar based on the Hadoop version
hadoop-azure-<version>.jar
.The simplest authentication mechanism is to use account-name and account-key.Refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.
Getting the Tools Jar
Download the latest release from Maven repository
Refer to the spark-rapids-user-tools github releases page for details on release notes.
Checkout the code repository
git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/core
Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (that is, Spark version)
mvn clean package
Profiling Tool Options
Profiling tool for the RAPIDS Accelerator and Apache Spark
Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
com.nvidia.spark.rapids.tool.profiling.ProfileMain [options]
<eventlogs | eventlog directories ...>
-a, --auto-tuner Toggle AutoTuner module.
--combined Collect mode but combine all applications into
the same tables.
-c, --compare Compare Applications (Note this may require
more memory if comparing a large number of
applications). Default is false.
--csv Output each table to a CSV file as well
creating the summary text file.
-d, --driverlog <arg> Specifies the name of a driver log file that
the profiling tool is to process. The tool
identifies any invalid operations in the log
and writes them to a .csv file. When
--driverlog is specified, the eventlog
parameter is optional.
-f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on
application start timestamp for processing.
Filesystem based filtering happens before
application based filtering (see start-app-time).
for example, 100-newest-filesystem (for processing newest
100 event logs). For example, 100-oldest-filesystem (for
processing oldest 100 event logs).
-g, --generate-dot Generate query visualizations in DOT format.
Default is false
--generate-timeline Write an SVG graph out for the full
application timeline.
-m, --match-event-logs <arg> Filter event logs whose filenames contain the
input string
-n, --num-output-rows <arg> Number of output rows for each Application.
Default is 1000
--num-threads <arg> Number of thread to use for parallel
processing. The default is the number of cores
on host divided by 4.
-o, --output-directory <arg> Base output directory. Default is current
directory for the default filesystem. The
final output will go into a subdirectory
called rapids_4_spark_profile. It will
overwrite any existing files with the same
name.
-p, --print-plans Print the SQL plans to a file named
'planDescriptions.log'.
Default is false.
-s, --start-app-time <arg> Filter event logs whose application start
occurred within the past specified time
period. Valid time periods are
min(minute),h(hours),d(days),w(weeks),m(months).
If a period isn't specified it defaults to
days.
-t, --timeout <arg> Maximum time in seconds to wait for the event
logs to be processed. Default is 24 hours
(86400 seconds) and must be greater than 3
seconds. If it times out, it will report what
it was able to process up until the timeout.
-w, --worker-info <arg> File path containing the system information of
a worker node. It's assumed that all workers
are homogenous. It requires the AutoTuner to
be enabled. Default is ./worker_info.yaml
-h, --help Show help message
trailing arguments:
eventlog (optional) Event log filenames (space separated) or directories
containing event logs. For example, s3a://<BUCKET>/eventlog1
/path/to/eventlog2. At least one eventlog or a driver
log must be specified; thus an eventlog parameter is
required if the --driverlog option isn't specified.
Tuning Spark Properties For GPU Clusters
Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running. The values loaded from the app logs have higher precedence than the default configs.
Auto-Tuner limitations:
It’s assumed that all the worker nodes on the cluster are homogenous.
To run the Auto-Tuner, enable the auto-tuner
flag and optionally pass a valid --worker-info <FILE_PATH>
. The Auto-Tuner needs to learn the system properties of the worker nodes that run application code in the cluster. The argument FILE_PATH
can either be local or remote file (that is, HDFS).
If the --worker-info
argument isn’t supplied, then the Auto-Tuner will only recommend tuned settings based on the job event log and not on any cluster or worker information since that isn’t available.
Template of the worker information file in “yaml” format
system:
numCores: 32
memory: 212992MiB
numWorkers: 5
gpu:
memory: 15109MiB
count: 4
name: T4
softwareProperties:
spark.driver.maxResultSize: 7680m
spark.driver.memory: 15360m
spark.executor.cores: '8'
spark.executor.instances: '2'
spark.executor.memory: 47222m
spark.executorEnv.OPENBLAS_NUM_THREADS: '1'
spark.scheduler.mode: FAIR
spark.sql.cbo.enabled: 'true'
spark.ui.port: '0'
spark.yarn.am.memory: 640m
Property |
Optional |
If Missing |
---|---|---|
system.numCores | No | Auto-Tuner doesn’t calculate recommendations |
system.memory | No | Auto-Tuner doesn’t calculate any recommendations |
system.numWorkers | Yes | Default: 1 |
gpu.name | Yes | Default: T4 (Nvidia Tesla T4) |
gpu.memory | Yes | Default: 16G |
softwareProperties | Yes | This section is optional. The Auto-Tuner reads the configs within the logs of the Apache Spark apps with higher precedence |
Processing Spark Event Logs
The tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It’s recommended to pass VM options
-Xmx10g
and adjust according to the number-of-apps / size-of-logs being processed.export JVM_HEAP=-Xmx10g
Examples running the tool on the following environments
Extract the Spark distribution into a local directory if necessary.
Either set SPARK_HOME to point to that directory or just put the path inside of the classpath
java -cp toolsJar:$SPARK_HOME/jars/*:...
when you run the Qualification tool.
java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \ <eventlogs | eventlog directories ...>
java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain \ /usr/logs/app-name1
Example running on files in HDFS: (include
$HADOOP_CONF_DIR
in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.profiling.ProfileMain /eventlogDir
Processing Driver Logs
The Profiling tool can process GPU a driver log as well as CPU and GPU event logs. When the Profiling tool processes a driver log, it generates a .csv
file that lists unsupported operators.
You inform the Profiling tool of a GPU driver log with the command line option --driverlog
. The option has one required argument, specifying the pathname of a driver log file. You may specify just one driver log file per a single run.
A single run of the Profiling tool may process CPU/GPU event logs, a GPU driver log, or both.
Please refer to Processing event logs section for instructions on accessing the driver log existing on remote and local filesystems.
Example running the tool on a driver log
java ${JVM_HEAP} \
-cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
com.nvidia.spark.rapids.tool.profiling.ProfileMain \
--driverlog /path_to_driverlog \
/eventlog
Collection Modes
Example running Profiling tool with different collections modes:
java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \
<eventlogs | eventlog directories ...>
java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
com.nvidia.spark.rapids.tool.profiling.ProfileMain --combined \
<eventlogs | eventlog directories ...>
java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
com.nvidia.spark.rapids.tool.profiling.ProfileMain --compare \
<eventlogs | eventlog directories ...>