spark-rapids/user-guide/24.08.01/partials/tools-setup-onprem-with-hdfs.html
No more steps required to run the tools on on-premises environment including standalone/local machines.
The tools CLI depends on Python implementation of PyArrow which relies on some environment variables to bind with HDFS:
HADOOP_HOME
: the root of your installed Hadoop distribution. Often has “lib/native/libhdfs.so”.JAVA_HOME
: the location of your Java SDK installation.ARROW_LIBHDFS_DIR
(optional): explicit location of “libhdfs.so” if it’s installed somewhere other than $HADOOP_HOME/lib/native.Add the Hadoop jars to your CLASSPATH.
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
%HADOOP_HOME%/bin/hadoop classpath --glob > %CLASSPATH%
For more information on HDFS requirements, refer to the PyArrow HDFS documentation