spark-rapids/user-guide/24.02/partials/tools-jar-usage-prereqs.html
Java 8+
Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs with
.lz4
,.lzf
,.snappy
and.zstd
suffixes as well as Databricks-specific rolled and compressed(.gz) event logs.The tool requires the Spark 3.x+ jars to be able to run but it does not need an Apache Spark runtime. If you do not already have Spark 3.x+ installed, you can download the Apache Spark Distribution to any machine and include the jars in the classpath.
This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3, ABFS, GCS or mixed. If you want to point to the local filesystem be sure to include prefix
file:
in the path. If any input is a remote file path or directory path, then you need to the connector dependencies to be on the classpathInclude
$HADOOP_CONF_DIR
in classpath-cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/
Download the
gcs-connector-hadoop3-<version>-shaded.jar
and follow the instructions to configure Hadoop/Spark.Download the matched jars based on the Hadoop version
hadoop-aws-<version>.jar
aws-java-sdk-<version>.jar
In $SPARK_HOME/conf, create
hdfs-site.xml
with below AWS S3 keys inside:1<?xml version="1.0"?> 2<configuration> 3 <property> 4 <name>fs.s3a.access.key</name> 5 <value>xxx</value> 6 </property> 7 <property> 8 <name>fs.s3a.secret.key</name> 9 <value>xxx</value> 10 </property> 11</configuration>
You can test your configuration by including the above jars in the
-jars
option tospark-shell
orspark-submit
Please refer to the Hadoop-AWS doc on more options about integrating Hadoop-AWS module with S3.
Download the matched jar based on the Hadoop version
hadoop-azure-<version>.jar
.The simplest authentication mechanism is to use account-name and account-key. Please refer to the Hadoop-ABFS support doc on more options about integrating Hadoop-ABFS module with ABFS.