RAPIDS PerfIO S3 Reader#
The cuDF for Apache Spark includes an optional S3 read path optimized for the RAPIDS Plugin called PerfIO. PerfIO reads data directly from S3 using an AWS SDK v2 async HTTP client (Netty or CRT) instead of routing through the Hadoop S3A file system.
Enabling PerfIO#
spark.rapids.perfio.s3.enabled controls whether PerfIO is active:
Unset (default) — PerfIO is enabled opportunistically. If a compatible HTTP client (Netty or CRT) is found on the classpath at startup, PerfIO is used automatically. If not, Spark falls back to S3A with a warning log and no error is raised. On Amazon EMR, where the Netty client is already included in the AWS SDK bundle, PerfIO activates automatically without any additional configuration.
true— PerfIO is explicitly required. If no compatible HTTP client is found at startup, anIllegalStateExceptionis thrown rather than silently falling back to S3A. Use this to catch misconfigured deployments that should be running with PerfIO.false— PerfIO is unconditionally disabled; S3A is used regardless of what is on the classpath.
PerfIO requires an HTTP backend for communicating with S3. Two backends are supported,
selectable via spark.rapids.perfio.s3.httpClient:
Netty (default) — A JVM-based HTTP client. Bundled on EMR. Must be added explicitly on Databricks and OSS Spark.
CRT — Backed by the AWS Common Runtime, a native C++ library. Because it runs outside the JVM, CRT typically achieves lower CPU overhead than Netty and often delivers better throughput. Must be added explicitly on all platforms. Netty is the default for compatibility (it is bundled on EMR), not performance.
The HTTP client jar for the chosen backend must be present on the classpath. What is already provided depends on the deployment:
Amazon EMR: The Netty client is included in the EMR AWS SDK bundle. No extra jar is needed for the Netty backend. To use the CRT backend, add:
--packages software.amazon.awssdk:aws-crt-client:<version>
Databricks: The
--packagesoption is ignored by the Databricks runtime; jars must be placed on the classpath via an init script instead (see Databricks init script below). Both the AWS SDK v2 core and the HTTP client jar must be added on all DBR versions, as the SDK v2 classes required by PerfIO are not accessible by default.OSS Spark: Add the AWS SDK v2 core and the HTTP client for your chosen backend:
# Netty backend --packages software.amazon.awssdk:s3:<version>,software.amazon.awssdk:netty-nio-client:<version> # CRT backend --packages software.amazon.awssdk:s3:<version>,software.amazon.awssdk:aws-crt-client:<version>
(Alternatively, if you are on Spark with Hadoop 3.4, you can add
--packages org.apache.hadoop:hadoop-aws:3.4.xinstead. Note that OSS Spark does not includehadoop-aws— you must add it explicitly. When you do, it transitively pulls insoftware.amazon.awssdk:bundle, which includes the SDK v2 core and Netty, makingsoftware.amazon.awssdk:s3andsoftware.amazon.awssdk:netty-nio-clientredundant for the Netty backend. The CRT client is never in the bundle and must always be added explicitly regardless.)
Choosing a version: s3, netty-nio-client, and aws-crt-client are all modules
within the software.amazon.awssdk release family and share the same version number.
The recommended version is 2.24.4, which is the version tested with this release of
the cuDF for Apache Spark. A newer version is always safe; using a version older than the
AWS SDK v2 already on the classpath may fail with NoSuchMethodError.
See the cuDF for Apache Spark Advanced Configuration
page for the full list of spark.rapids.perfio.* options.
Databricks Init Script#
Because the Databricks runtime ignores --packages and distributes cluster library jars
lazily to executors (which conflicts with PerfIO’s initialization), the HTTP client jar must
be placed on the driver and all executors before Spark starts, using a cluster init script.
Create a script (for example perfio-init.sh) in your Databricks workspace and attach it
to the cluster in the same way as the main RAPIDS init script described in the
Databricks getting-started guide.
The script below uses Maven to resolve and copy the required jars to /databricks/jars/.
Set AWS_DEPS to include both the SDK v2 core (software.amazon.awssdk:s3) and the HTTP
client for your chosen backend. Replace aws-crt-client with netty-nio-client to use the
Netty backend instead.
perfio-init.sh
#!/bin/bash
set -x
SPARK_JARS_DIR=${SPARK_JARS_DIR:-"/databricks/jars"}
AWS_VER=2.24.4
# Always include the SDK v2 core alongside the HTTP client.
# DBR 17+ ships Hadoop 3.4.1 with a bundled SDK v2, but Databricks relocates it
# under a private package, so the unshaded classes are not visible to PerfIO.
AWS_DEPS="software.amazon.awssdk:s3 software.amazon.awssdk:aws-crt-client"
if [[ ! -f /tmp/apache-maven-3.6.3-bin.tar.gz ]]; then
wget https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz -P /tmp
tar xf /tmp/apache-maven-3.6.3-bin.tar.gz -C $HOME
sudo ln -s $HOME/apache-maven-3.6.3/bin/mvn /usr/local/bin/mvn
fi
AWS_DEPS_DIR=/tmp/aws_deps
mkdir -p $AWS_DEPS_DIR
for dep in $AWS_DEPS; do
mvn dependency:get -Dartifact=${dep}:${AWS_VER} -Dmaven.repo.local=$AWS_DEPS_DIR
done
find $AWS_DEPS_DIR -name "*.jar" | xargs -n 1 bash -c 'sudo cp "$2" "$1"' _ ${SPARK_JARS_DIR}/
Credentials#
PerfIO resolves AWS credentials from the same fs.s3a.* Hadoop configuration keys that
S3A uses, so no additional credential configuration is required if S3A is already working.
The credential resolution precedence is:
Static keys — if
fs.s3a.access.keyandfs.s3a.secret.keyare set (with optionalfs.s3a.session.keyfor temporary credentials), they are used directly. This works on all platforms and Hadoop versions.fs.s3a.aws.credentials.provider — if set, PerfIO builds a credential chain from the listed provider classes. On Hadoop 3.4 and later, this is handled by
CredentialProviderListFactory, which natively supports SDK v1 provider class names (com.amazonaws.*), SDK v2 class names (software.amazon.awssdk.*), and Hadoop S3A wrapper classes. On older Hadoop (EMR before version 7.4),CredentialProviderListFactoryis not available; SDK v1 class names are remapped to their SDK v2 equivalents for providers that kept the same class name across versions. Providers that were renamed in SDK v2 will silently fail — for example,com.amazonaws.auth.WebIdentityTokenCredentialsProviderbecamesoftware.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProviderin SDK v2, so the v1 name should be replaced with the v2 name infs.s3a.aws.credentials.provider. Hadoop S3A wrapper classes (org.apache.hadoop.fs.s3a.*) are silently skipped.Default chain — if no credential config is present, or as a fallback after skipped providers, PerfIO uses the AWS SDK v2
DefaultCredentialsProvider, which checks environment variables (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY), JVM system properties, the~/.aws/credentialsprofile file, and the EC2 instance metadata service (IAM role). Most EMR deployments using IAM roles are covered by this fallback.
Configuration Reference#
PerfIO reads standard Hadoop fs.s3a.* configuration keys and maps them to the
corresponding AWS SDK HTTP client settings. Operators familiar with S3A tuning can use
the same keys they already know; if those keys are already present in the Hadoop or Spark
configuration, PerfIO picks them up automatically. For a full description of each key, see
the Hadoop S3A connection configuration reference.
The PerfIO Default column shows the value PerfIO applies automatically when the key is not explicitly set. Where these differ from S3A’s defaults, they reflect empirical tuning on GPU-accelerated analytic workloads rather than S3A’s general-purpose access patterns.
Note
fs.s3a.connection.keepalive.interval and fs.s3a.connection.keepalive.timeout are
not standard Hadoop S3A keys. They are defined by PerfIO for CRT-specific TCP keep-alive
tuning. All other keys in this table are standard fs.s3a.* keys.
Keys marked Netty only or CRT only are silently ignored by the other backend.
Connection Pool
Key |
PerfIO Default |
S3A Default |
Notes |
|---|---|---|---|
|
200 |
500 |
Max concurrent HTTP connections. |
|
true |
false |
Enable TCP keep-alive on idle connections. |
|
5 min |
(N/A) |
CRT only. Interval between TCP keep-alive probes. Not a standard S3A key. |
|
30 s |
(N/A) |
CRT only. Time to wait for a keep-alive probe response before closing the connection. Not a standard S3A key. |
|
5 min |
5 min |
Netty only. Maximum connection lifetime regardless of activity. CRT does not support connection TTL. |
|
(SDK default) |
60 s |
Maximum idle time before a connection is closed. Only applied if the key is explicitly present in the configuration; otherwise the SDK default is used (Netty: 5 s, CRT: 60 s). |
|
0 (OS default) |
0 |
CRT only. Socket receive buffer size in bytes; 0 defers to the OS. |
Connection Timeouts
Key |
PerfIO Default |
S3A Default |
Notes |
|---|---|---|---|
|
2,000 ms |
30,000 ms |
Time allowed to establish a TCP connection. |
|
30,000 ms |
200,000 ms |
Netty only. Socket read timeout. |
|
10,000 ms |
60,000 ms |
Netty only. Maximum time to wait to acquire a connection from the pool. |
Thread Pool
The thread pool settings below apply to the internal async completion executor shared by both backends.
Key |
PerfIO Default |
S3A Default |
Notes |
|---|---|---|---|
|
200 |
96 |
Maximum threads in the async S3 completion executor. |
|
1,000 |
16 |
Completion queue depth. |
|
10,000 |
32 |
Netty only. Maximum pending connection acquire requests queued in the HTTP
client. The S3A default of 32 causes |
The following keys are also recognized with the same defaults as S3A; refer to the
Hadoop S3A documentation
for details: fs.s3a.threads.keepalivetime, fs.s3a.endpoint,
fs.s3a.endpoint.region, fs.s3a.path.style.access, fs.s3a.checksum.validation.