Is this page helpful?

Profiling Tool - Jar Usage#

The Profiling tool can be run as a standalone Java cmd on Spark event logs for users who aren’t using the CLI tool. For sample execution commands, refer to the examples section.

Setting Up Environment#

Prerequisites#

Java 8+
Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.
The tool requires the Spark [3.x, 4.0] jars to be able to run but it doesn’t need an Apache Spark runtime. If you don’t already have Spark [3.x, 4.0] installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.
This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix file: in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpath
On-prem HDFS
Include $HADOOP_CONF_DIR in classpath

Sample showing Java’s classpath#

-cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
GCS

Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

S3 :sync: s3-key
Download the matched jars based on the Hadoop version

hadoop-aws-<version>.jar

aws-java-sdk-<version>.jar

In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

1<?xml version="1.0"?> 2<configuration> 3 <property> 4 <name>fs.s3a.access.key</name> 5 <value>xxx</value> 6 </property> 7 <property> 8 <name>fs.s3a.secret.key</name> 9 <value>xxx</value> 10 </property> 11</configuration>

You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

Refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.
ABFS
Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

The simplest authentication mechanism is to use account-name and account-key.Refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.

Getting the Tools Jar#

Direct link

Download the latest release from Maven repository
Refer to the spark-rapids-user-tools github releases page for details on release notes.

Build from source

Checkout the code repository

git clone git@github.com:NVIDIA/cudf-spark-tools.git
cd cudf-spark-tools/core

Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (that is, Spark version)
```
mvn clean package
```

Running Tools Jar#

Profiling Tool Options#

Profiling tool for the cuDF for Apache Spark and Apache Spark

Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
       com.nvidia.spark.rapids.tool.profiling.ProfileMain [options]
       <eventlogs | eventlog directories ...>

 -a, --auto-tuner                 Toggle AutoTuner module.
     --target-cluster-info  <arg> File path to YAML containing target cluster
                                  information including worker instance type
                                  and system properties. Provides platform-aware
                                  cluster configuration. Requires AutoTuner to
                                  be enabled.
     --tuning-configs  <arg>      File path to YAML containing custom tuning
                                  configuration parameters. Allows overriding
                                  default AutoTuner constants. Requires
                                  AutoTuner to be enabled.
      --csv                       Output each table to a CSV file as well
                                  creating the summary text file.
  -d, --driverlog <arg>           Specifies the name of a driver log file that
                                  the profiling tool is to process. The tool
                                  identifies any invalid operations in the log
                                  and writes them to a .csv file. When
                                  --driverlog is specified, the eventlog
                                  parameter is optional.
  -f, --filter-criteria  <arg>    Filter newest or oldest N eventlogs based on
                                  application start timestamp for processing.
                                  Filesystem based filtering happens before
                                  application based filtering (see start-app-time).
                                  for example, 100-newest-filesystem (for processing newest
                                  100 event logs). For example, 100-oldest-filesystem (for
                                  processing oldest 100 event logs).
  -g, --generate-dot              Generate query visualizations in DOT format.
                                  Default is false
      --generate-timeline         Write an SVG graph out for the full
                                  application timeline.
  -m, --match-event-logs  <arg>   Filter event logs whose filenames contain the
                                  input string
  -n, --num-output-rows  <arg>    Number of output rows for each Application.
                                  Default is 1000
      --num-threads  <arg>        Number of thread to use for parallel
                                  processing. The default is the number of cores
                                  on host divided by 4.
  -o, --output-directory  <arg>   Base output directory. Default is current
                                  directory for the default filesystem. The
                                  final output will go into a subdirectory
                                  called rapids_4_spark_profile. It will
                                  overwrite any existing files with the same
                                  name.
  -p, --print-plans               Print the SQL plans to a file named
                                  'planDescriptions.log'.
                                  Default is false.
  -s, --start-app-time  <arg>     Filter event logs whose application start
                                  occurred within the past specified time
                                  period. Valid time periods are
                                  min(minute),h(hours),d(days),w(weeks),m(months).
                                  If a period isn't specified it defaults to
                                  days.
 -t, --timeout  <arg>             Maximum time in seconds to wait for the event
                                  logs to be processed. Default is 24 hours
                                  (86400 seconds) and must be greater than 3
                                  seconds. If it times out, it will report what
                                  it was able to process up until the timeout.
 -h, --help                       Show help message

 trailing arguments:
  eventlog (optional)   Event log filenames (space separated) or directories
                        containing event logs. For example, s3a://<BUCKET>/eventlog1
                        /path/to/eventlog2. At least one eventlog or a driver
                        log must be specified; thus an eventlog parameter is
                        required if the --driverlog option isn't specified.

Tuning Spark Properties For GPU Clusters#

Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running. The values loaded from the app logs have higher precedence than the default configs.

The recommendations span several categories, ordered from most to least impactful:

RAPIDS plugin & GPU resources (required for GPU execution): spark.plugins (must include com.nvidia.spark.SQLPlugin), spark.rapids.sql.enabled, spark.executor.resource.gpu.amount, spark.task.resource.gpu.amount, and spark.shuffle.manager (RAPIDS Shuffle Manager).
Executor sizing: spark.executor.cores, spark.executor.instances, spark.executor.memory, spark.executor.memoryOverhead.
GPU runtime: spark.rapids.sql.concurrentGpuTasks, spark.rapids.memory.pinnedPool.size, spark.rapids.sql.batchSizeBytes.
Shuffle and AQE: spark.sql.shuffle.partitions, spark.sql.files.maxPartitionBytes, spark.sql.adaptive.advisoryPartitionSizeInBytes, spark.sql.adaptive.coalescePartitions.parallelismFirst.
Dynamic allocation: spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors, and spark.dynamicAllocation.maxExecutors, sized against the CPU-to-GPU core ratio. When recommended, the Auto-Tuner enforces minExecutors <= initialExecutors <= maxExecutors.
Platform-specific plugins: additional recommendations may be emitted for EMR (JVM options that disable Transparent Huge Pages) and Delta Lake (GPU-accelerated Delta write via spark.rapids.sql.format.delta.write.enabled, plus version-compatibility and support comments).

The Auto-Tuner also tunes secondary properties such as Kryo serialization settings, multi-threaded reader/writer threads, RAPIDS file cache, data locality wait, and platform compatibility flags where applicable.

Note

Auto-Tuner limitations:

It’s assumed that all the worker nodes on the cluster are homogenous.

To run the Auto-Tuner, enable the auto-tuner flag. Optionally, provide target cluster information using --target-cluster-info <FILE_PATH> to specify the GPU worker node configuration for generating optimized recommendations. The file path can be local or remote (for example, HDFS).

If the --target-cluster-info argument isn’t supplied, the Auto-Tuner will use platform-specific default worker instance types for tuning recommendations. See AutoTuner Configuration for details on default instance types, supported platforms, and how to customize AutoTuner behavior.

Processing Spark Event Logs#

The tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It’s recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.
```
export JVM_HEAP=-Xmx10g
```

Examples running the tool on the following environments

Local Filesystem

Extract the Spark distribution into a local directory if necessary.
Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:$SPARK_HOME/jars/*:... when you run the Qualification tool.

java ${JVM_HEAP} \
     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \
     <eventlogs | eventlog directories ...>

java ${JVM_HEAP} \
     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain \
     /usr/logs/app-name1

On-prem HDFS

Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.

java ${JVM_HEAP} \
     -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain  /eventlogDir

Processing Driver Logs#

The Profiling tool can process GPU a driver log as well as CPU and GPU event logs. When the Profiling tool processes a driver log, it generates a .csv file that lists unsupported operators.

You inform the Profiling tool of a GPU driver log with the command line option --driverlog. The option has one required argument, specifying the pathname of a driver log file. You may specify just one driver log file per a single run.

A single run of the Profiling tool may process CPU/GPU event logs, a GPU driver log, or both.

Please refer to Processing event logs section for instructions on accessing the driver log existing on remote and local filesystems.

Example running the tool on a driver log#

java ${JVM_HEAP} \
      -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
      com.nvidia.spark.rapids.tool.profiling.ProfileMain  \
      --driverlog /path_to_driverlog \
      /eventlog

Java CMD Samples#

Example running the Profiling tool:

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \
     com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \
     <eventlogs | eventlog directories ...>