User Guide (23.12.1)
v23.12.1

spark-rapids/user-guide/23.12.1/partials/tools-setup-onprem-with-hdfs.html

No more steps required to run the tools on on-prmises environment including standalone/local machines.

The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has “lib/native/libhdfs.so”.

  • JAVA_HOME: the location of your Java SDK installation.

  • ARROW_LIBHDFS_DIR (optional): explicit location of “libhdfs.so” if it is installed somewhere other than $HADOOP_HOME/lib/native.

  • Add the Hadoop jars to your CLASSPATH.

    export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
    

    %HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%
    

For more information on HDFS requirements, refer to the PyArrow HDFS documentation

© Copyright 2023-2024, NVIDIA. Last updated on Feb 5, 2024.