User Guide (24.08.01)
RAPIDS Accelerator for Apache Spark - User Guide (24.08.01)

Profiling Tool - Jar Usage

The Profiling tool can be run in as a java cmd in three different ways if you aren’t using the CLI tool:

There are 3 modes of operation for the Profiling tool:

For sample execution commands, refer to the examples section.

Prerequisites

  • Java 8+

  • Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with .lz4, .lzf, .snappy and .zstd suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.

  • The tool requires the Spark 3.x+ jars to be able to run but it doesn’t need an Apache Spark runtime. If you don’t already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.

  • This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix file: in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpath

    Include $HADOOP_CONF_DIR in classpath

    Sample showing Java’s classpath

    Copy
    Copied!
                

    -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/


    Download the gcs-connector-hadoop3-<version>-shaded.jar and follow the instructions to configure Hadoop/Spark.

    Download the matched jars based on the Hadoop version

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

    In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

    Copy
    Copied!
                

    <?xml version="1.0"?> <configuration> <property> <name>fs.s3a.access.key</name> <value>xxx</value> </property> <property> <name>fs.s3a.secret.key</name> <value>xxx</value> </property> </configuration>

    You can test your configuration by including the above jars in the -jars option to spark-shell or spark-submit

    Refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.

    • Download the matched jar based on the Hadoop version hadoop-azure-<version>.jar.

    • The simplest authentication mechanism is to use account-name and account-key.Refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.


Getting the Tools Jar

  • Checkout the code repository

    Copy
    Copied!
                

    git clone git@github.com:NVIDIA/spark-rapids-tools.git cd spark-rapids-tools/core

  • Build using MVN. After a successful build, the jar of rapids-4-spark-tools_2.12-<version>-SNAPSHOT.jar will be in target/ directory. Refer to build doc for more information on build options (that is, Spark version)

    Copy
    Copied!
                

    mvn clean package

Profiling Tool Options

Copy
Copied!
            

Profiling tool for the RAPIDS Accelerator and Apache Spark Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] <eventlogs | eventlog directories ...> -a, --auto-tuner Toggle AutoTuner module. --combined Collect mode but combine all applications into the same tables. -c, --compare Compare Applications (Note this may require more memory if comparing a large number of applications). Default is false. --csv Output each table to a CSV file as well creating the summary text file. -d, --driverlog <arg> Specifies the name of a driver log file that the profiling tool is to process. The tool identifies any invalid operations in the log and writes them to a .csv file. When --driverlog is specified, the eventlog parameter is optional. -f, --filter-criteria <arg> Filter newest or oldest N eventlogs based on application start timestamp for processing. Filesystem based filtering happens before application based filtering (see start-app-time). for example, 100-newest-filesystem (for processing newest 100 event logs). For example, 100-oldest-filesystem (for processing oldest 100 event logs). -g, --generate-dot Generate query visualizations in DOT format. Default is false --generate-timeline Write an SVG graph out for the full application timeline. -m, --match-event-logs <arg> Filter event logs whose filenames contain the input string -n, --num-output-rows <arg> Number of output rows for each Application. Default is 1000 --num-threads <arg> Number of thread to use for parallel processing. The default is the number of cores on host divided by 4. -o, --output-directory <arg> Base output directory. Default is current directory for the default filesystem. The final output will go into a subdirectory called rapids_4_spark_profile. It will overwrite any existing files with the same name. -p, --print-plans Print the SQL plans to a file named 'planDescriptions.log'. Default is false. -s, --start-app-time <arg> Filter event logs whose application start occurred within the past specified time period. Valid time periods are min(minute),h(hours),d(days),w(weeks),m(months). If a period isn't specified it defaults to days. -t, --timeout <arg> Maximum time in seconds to wait for the event logs to be processed. Default is 24 hours (86400 seconds) and must be greater than 3 seconds. If it times out, it will report what it was able to process up until the timeout. -w, --worker-info <arg> File path containing the system information of a worker node. It's assumed that all workers are homogenous. It requires the AutoTuner to be enabled. Default is ./worker_info.yaml -h, --help Show help message trailing arguments: eventlog (optional) Event log filenames (space separated) or directories containing event logs. For example, s3a://<BUCKET>/eventlog1 /path/to/eventlog2. At least one eventlog or a driver log must be specified; thus an eventlog parameter is required if the --driverlog option isn't specified.

Tuning Spark Properties For GPU Clusters

Currently, the Auto-Tuner calculates a set of configurations that impact the performance of Apache Spark apps executing on GPU. Those calculations can leverage cluster information (for example, memory, cores, Spark default configurations) as well as information processed in the application event logs. The tool also will recommend settings for the application assuming that the job will be able to use all the cluster resources (CPU and GPU) when it’s running. The values loaded from the app logs have higher precedence than the default configs.

Note

Auto-Tuner limitations:

  • It’s assumed that all the worker nodes on the cluster are homogenous.

To run the Auto-Tuner, enable the auto-tuner flag and optionally pass a valid --worker-info <FILE_PATH>. The Auto-Tuner needs to learn the system properties of the worker nodes that run application code in the cluster. The argument FILE_PATH can either be local or remote file (that is, HDFS).

If the --worker-info argument isn’t supplied, then the Auto-Tuner will only recommend tuned settings based on the job event log and not on any cluster or worker information since that isn’t available.

Template of the worker information file in “yaml” format

Copy
Copied!
            

system: numCores: 32 memory: 212992MiB numWorkers: 5 gpu: memory: 15109MiB count: 4 name: T4 softwareProperties: spark.driver.maxResultSize: 7680m spark.driver.memory: 15360m spark.executor.cores: '8' spark.executor.instances: '2' spark.executor.memory: 47222m spark.executorEnv.OPENBLAS_NUM_THREADS: '1' spark.scheduler.mode: FAIR spark.sql.cbo.enabled: 'true' spark.ui.port: '0' spark.yarn.am.memory: 640m


Property

Optional

If Missing

system.numCores No Auto-Tuner doesn’t calculate recommendations
system.memory No Auto-Tuner doesn’t calculate any recommendations
system.numWorkers Yes Default: 1
gpu.name Yes Default: T4 (Nvidia Tesla T4)
gpu.memory Yes Default: 16G
softwareProperties Yes This section is optional. The Auto-Tuner reads the configs within the logs of the Apache Spark apps with higher precedence

Processing Spark Event Logs

  1. The tool reads the log files and process them in-memory. So the heap memory should be increased when processing large volume of events. It’s recommended to pass VM options -Xmx10g and adjust according to the number-of-apps / size-of-logs being processed.

    Copy
    Copied!
                

    export JVM_HEAP=-Xmx10g

  2. Examples running the tool on the following environments

    • Extract the Spark distribution into a local directory if necessary.

    • Either set SPARK_HOME to point to that directory or just put the path inside of the classpath java -cp toolsJar:$SPARK_HOME/jars/*:... when you run the Qualification tool.

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \ <eventlogs | eventlog directories ...>

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain \ /usr/logs/app-name1

    Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath). Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output so if you want to point to the local filesystem be sure to include file: in the path.

    Copy
    Copied!
                

    java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.profiling.ProfileMain /eventlogDir

Processing Driver Logs

The Profiling tool can process GPU a driver log as well as CPU and GPU event logs. When the Profiling tool processes a driver log, it generates a .csv file that lists unsupported operators.

You inform the Profiling tool of a GPU driver log with the command line option --driverlog. The option has one required argument, specifying the pathname of a driver log file. You may specify just one driver log file per a single run.

A single run of the Profiling tool may process CPU/GPU event logs, a GPU driver log, or both.

Please refer to Processing event logs section for instructions on accessing the driver log existing on remote and local filesystems.

Example running the tool on a driver log

Copy
Copied!
            

java ${JVM_HEAP} \ -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain \ --driverlog /path_to_driverlog \ /eventlog

Collection Modes

Example running Profiling tool with different collections modes:

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \ <eventlogs | eventlog directories ...>

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain --combined \ <eventlogs | eventlog directories ...>

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain --compare \ <eventlogs | eventlog directories ...>


Previous Qualification Tool - Jar Usage
Next Examples
© Copyright 2024, NVIDIA. Last updated on Aug 29, 2024.