RAPIDS PerfIO S3 Reader#

The cuDF for Apache Spark includes an optional S3 read path optimized for the RAPIDS Plugin called PerfIO. PerfIO reads data directly from S3 using an AWS SDK v2 async HTTP client (Netty or CRT) instead of routing through the Hadoop S3A file system.

Enabling PerfIO#

spark.rapids.perfio.s3.enabled controls whether PerfIO is active:

  • Unset (default) — PerfIO is enabled opportunistically. If a compatible HTTP client (Netty or CRT) is found on the classpath at startup, PerfIO is used automatically. If not, Spark falls back to S3A with a warning log and no error is raised. On Amazon EMR, where the Netty client is already included in the AWS SDK bundle, PerfIO activates automatically without any additional configuration.

  • true — PerfIO is explicitly required. If no compatible HTTP client is found at startup, an IllegalStateException is thrown rather than silently falling back to S3A. Use this to catch misconfigured deployments that should be running with PerfIO.

  • false — PerfIO is unconditionally disabled; S3A is used regardless of what is on the classpath.

PerfIO requires an HTTP backend for communicating with S3. Two backends are supported, selectable via spark.rapids.perfio.s3.httpClient:

  • Netty (default) — A JVM-based HTTP client. Bundled on EMR. Must be added explicitly on Databricks and OSS Spark.

  • CRT — Backed by the AWS Common Runtime, a native C++ library. Because it runs outside the JVM, CRT typically achieves lower CPU overhead than Netty and often delivers better throughput. Must be added explicitly on all platforms. Netty is the default for compatibility (it is bundled on EMR), not performance.

The HTTP client jar for the chosen backend must be present on the classpath. What is already provided depends on the deployment:

  • Amazon EMR: The Netty client is included in the EMR AWS SDK bundle. No extra jar is needed for the Netty backend. To use the CRT backend, add:

    --packages software.amazon.awssdk:aws-crt-client:<version>
    
  • Databricks: The --packages option is ignored by the Databricks runtime; jars must be placed on the classpath via an init script instead (see Databricks init script below). Both the AWS SDK v2 core and the HTTP client jar must be added on all DBR versions, as the SDK v2 classes required by PerfIO are not accessible by default.

  • OSS Spark: Add the AWS SDK v2 core and the HTTP client for your chosen backend:

    # Netty backend
    --packages software.amazon.awssdk:s3:<version>,software.amazon.awssdk:netty-nio-client:<version>
    
    # CRT backend
    --packages software.amazon.awssdk:s3:<version>,software.amazon.awssdk:aws-crt-client:<version>
    

    (Alternatively, if you are on Spark with Hadoop 3.4, you can add --packages org.apache.hadoop:hadoop-aws:3.4.x instead. Note that OSS Spark does not include hadoop-aws — you must add it explicitly. When you do, it transitively pulls in software.amazon.awssdk:bundle, which includes the SDK v2 core and Netty, making software.amazon.awssdk:s3 and software.amazon.awssdk:netty-nio-client redundant for the Netty backend. The CRT client is never in the bundle and must always be added explicitly regardless.)

Choosing a version: s3, netty-nio-client, and aws-crt-client are all modules within the software.amazon.awssdk release family and share the same version number. The recommended version is 2.24.4, which is the version tested with this release of the cuDF for Apache Spark. A newer version is always safe; using a version older than the AWS SDK v2 already on the classpath may fail with NoSuchMethodError.

See the cuDF for Apache Spark Advanced Configuration page for the full list of spark.rapids.perfio.* options.

Databricks Init Script#

Because the Databricks runtime ignores --packages and distributes cluster library jars lazily to executors (which conflicts with PerfIO’s initialization), the HTTP client jar must be placed on the driver and all executors before Spark starts, using a cluster init script.

Create a script (for example perfio-init.sh) in your Databricks workspace and attach it to the cluster in the same way as the main RAPIDS init script described in the Databricks getting-started guide.

The script below uses Maven to resolve and copy the required jars to /databricks/jars/. Set AWS_DEPS to include both the SDK v2 core (software.amazon.awssdk:s3) and the HTTP client for your chosen backend. Replace aws-crt-client with netty-nio-client to use the Netty backend instead.

perfio-init.sh
#!/bin/bash
set -x

SPARK_JARS_DIR=${SPARK_JARS_DIR:-"/databricks/jars"}
AWS_VER=2.24.4

# Always include the SDK v2 core alongside the HTTP client.
# DBR 17+ ships Hadoop 3.4.1 with a bundled SDK v2, but Databricks relocates it
# under a private package, so the unshaded classes are not visible to PerfIO.
AWS_DEPS="software.amazon.awssdk:s3 software.amazon.awssdk:aws-crt-client"

if [[ ! -f /tmp/apache-maven-3.6.3-bin.tar.gz ]]; then
  wget https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz -P /tmp
  tar xf /tmp/apache-maven-3.6.3-bin.tar.gz -C $HOME
  sudo ln -s $HOME/apache-maven-3.6.3/bin/mvn /usr/local/bin/mvn
fi

AWS_DEPS_DIR=/tmp/aws_deps
mkdir -p $AWS_DEPS_DIR
for dep in $AWS_DEPS; do
  mvn dependency:get -Dartifact=${dep}:${AWS_VER} -Dmaven.repo.local=$AWS_DEPS_DIR
done

find $AWS_DEPS_DIR -name "*.jar" | xargs -n 1 bash -c 'sudo cp "$2" "$1"' _ ${SPARK_JARS_DIR}/

Credentials#

PerfIO resolves AWS credentials from the same fs.s3a.* Hadoop configuration keys that S3A uses, so no additional credential configuration is required if S3A is already working.

The credential resolution precedence is:

  1. Static keys — if fs.s3a.access.key and fs.s3a.secret.key are set (with optional fs.s3a.session.key for temporary credentials), they are used directly. This works on all platforms and Hadoop versions.

  2. fs.s3a.aws.credentials.provider — if set, PerfIO builds a credential chain from the listed provider classes. On Hadoop 3.4 and later, this is handled by CredentialProviderListFactory, which natively supports SDK v1 provider class names (com.amazonaws.*), SDK v2 class names (software.amazon.awssdk.*), and Hadoop S3A wrapper classes. On older Hadoop (EMR before version 7.4), CredentialProviderListFactory is not available; SDK v1 class names are remapped to their SDK v2 equivalents for providers that kept the same class name across versions. Providers that were renamed in SDK v2 will silently fail — for example, com.amazonaws.auth.WebIdentityTokenCredentialsProvider became software.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProvider in SDK v2, so the v1 name should be replaced with the v2 name in fs.s3a.aws.credentials.provider. Hadoop S3A wrapper classes (org.apache.hadoop.fs.s3a.*) are silently skipped.

  3. Default chain — if no credential config is present, or as a fallback after skipped providers, PerfIO uses the AWS SDK v2 DefaultCredentialsProvider, which checks environment variables (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY), JVM system properties, the ~/.aws/credentials profile file, and the EC2 instance metadata service (IAM role). Most EMR deployments using IAM roles are covered by this fallback.

Configuration Reference#

PerfIO reads standard Hadoop fs.s3a.* configuration keys and maps them to the corresponding AWS SDK HTTP client settings. Operators familiar with S3A tuning can use the same keys they already know; if those keys are already present in the Hadoop or Spark configuration, PerfIO picks them up automatically. For a full description of each key, see the Hadoop S3A connection configuration reference.

The PerfIO Default column shows the value PerfIO applies automatically when the key is not explicitly set. Where these differ from S3A’s defaults, they reflect empirical tuning on GPU-accelerated analytic workloads rather than S3A’s general-purpose access patterns.

Note

fs.s3a.connection.keepalive.interval and fs.s3a.connection.keepalive.timeout are not standard Hadoop S3A keys. They are defined by PerfIO for CRT-specific TCP keep-alive tuning. All other keys in this table are standard fs.s3a.* keys.

Keys marked Netty only or CRT only are silently ignored by the other backend.

Connection Pool

Key

PerfIO Default

S3A Default

Notes

fs.s3a.connection.maximum

200

500

Max concurrent HTTP connections.

fs.s3a.connection.keepalive

true

false

Enable TCP keep-alive on idle connections.

fs.s3a.connection.keepalive.interval

5 min

(N/A)

CRT only. Interval between TCP keep-alive probes. Not a standard S3A key.

fs.s3a.connection.keepalive.timeout

30 s

(N/A)

CRT only. Time to wait for a keep-alive probe response before closing the connection. Not a standard S3A key.

fs.s3a.connection.ttl

5 min

5 min

Netty only. Maximum connection lifetime regardless of activity. CRT does not support connection TTL.

fs.s3a.connection.idle.time

(SDK default)

60 s

Maximum idle time before a connection is closed. Only applied if the key is explicitly present in the configuration; otherwise the SDK default is used (Netty: 5 s, CRT: 60 s).

fs.s3a.socket.recv.buffer

0 (OS default)

0

CRT only. Socket receive buffer size in bytes; 0 defers to the OS.

Connection Timeouts

Key

PerfIO Default

S3A Default

Notes

fs.s3a.connection.establish.timeout

2,000 ms

30,000 ms

Time allowed to establish a TCP connection.

fs.s3a.connection.timeout

30,000 ms

200,000 ms

Netty only. Socket read timeout.

fs.s3a.connection.acquisition.timeout

10,000 ms

60,000 ms

Netty only. Maximum time to wait to acquire a connection from the pool.

Thread Pool

The thread pool settings below apply to the internal async completion executor shared by both backends.

Key

PerfIO Default

S3A Default

Notes

fs.s3a.threads.max

200

96

Maximum threads in the async S3 completion executor.

fs.s3a.executor.capacity

1,000

16

Completion queue depth.

fs.s3a.max.total.tasks

10,000

32

Netty only. Maximum pending connection acquire requests queued in the HTTP client. The S3A default of 32 causes Maximum pending connection acquisitions exceeded failures at high concurrency.

The following keys are also recognized with the same defaults as S3A; refer to the Hadoop S3A documentation for details: fs.s3a.threads.keepalivetime, fs.s3a.endpoint, fs.s3a.endpoint.region, fs.s3a.path.style.access, fs.s3a.checksum.validation.