Compiling the tools jar

User Guide (23.12)

See instructions here: https://github.com/NVIDIA/spark-rapids-tools/tree/main/core#build

If any input is a S3 file path or directory path, 2 extra steps are needed to access S3 in Spark:

  1. Download the matched jars based on the Hadoop version:

    • hadoop-aws-<version>.jar

    • aws-java-sdk-<version>.jar

  2. Take Hadoop 2.7.4 for example, we can download and include below jars in the ‘–jars’ option to spark-shell or spark-submit: hadoop-aws-2.7.4.jar and aws-java-sdk-1.7.4.jar

  1. In $SPARK_HOME/conf, create hdfs-site.xml with below AWS S3 keys inside:

Copy
Copied!
            

<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.access.key</name> <value>xxx</value> </property> <property> <name>fs.s3a.secret.key</name> <value>xxx</value> </property> </configuration>

Please refer to this doc on more options about integrating hadoop-aws module with S3.

© Copyright 2023, NVIDIA. Last updated on Dec 20, 2023.