Apache Iceberg Support#

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables. This document details the Apache Iceberg features that are supported.

Supported Versions#

The RAPIDS Accelerator provides experimental support for the following versions:

Spark Version

Iceberg Version

3.5.0 - 3.5.6

1.6.1

Currently we only support table format of v1 and v2 spec.

Supported Data Types#

The RAPIDS Accelerator supports the following primitive data types for Apache Iceberg tables:

  • binary

  • boolean

  • date

  • decimal

  • double

  • fixed

  • float

  • int

  • long

  • string

  • time

  • timestamp

  • timestamptz (timestamp with timezone)

  • uuid

Note

Other types such as struct, list, and map are not supported yet. Operations on tables with unsupported data types will fall back to CPU execution.

Supported Catalogs#

The RAPIDS Accelerator supports the following catalog types:

Catalog Type

Support Type

S3Tables

Officially Supported

REST Catalog

Officially Supported

Hadoop Catalog

Officially Supported

Hive

Experimental

JDBC

Experimental

Supported File Formats#

The RAPIDS Accelerator supports the following file formats for Apache Iceberg tables:

File Format

Read Support

Write Support

Parquet

Supported

Supported

ORC

Not Supported

Not Supported

Avro

Not Supported

Not Supported

Supported Partition#

The RAPIDS Accelerator supports the following partition specifications:

Partition Spec

Read Support

Write Support

Identity

Supported

Not Supported

Bucket

Supported

Not Supported

Void

Supported

Not Supported

Truncate

Supported

Supported

Year

Supported

Supported

Month

Supported

Supported

Day

Supported

Supported

Hour

Supported

Supported

Note

Unsupported partitioning schemes will automatically fall back to CPU execution.

Reading Tables#

GPU acceleration for reading Apache Iceberg tables is enabled by default. You can disable this feature by setting the spark.rapids.sql.format.iceberg.read.enabled configuration to false.

Metadata Queries#

Reads of Apache Iceberg metadata, that is: the history, snapshots, and other metadata tables associated with a table, won’t be GPU-accelerated. The CPU will continue to process these metadata-level queries.

Content File Supported#

The RAPIDS Accelerator supports the following content file types:

Content File Type

Support Status

Data File

Supported

Equality Deletion File

Supported

Position Delete File

Supported

Deletion Vector

Not Supported

Reader Split Size#

The maximum number of bytes to pack into a single partition when reading files on Spark is normally controlled by the config spark.sql.files.maxPartitionBytes. But on Iceberg that doesn’t apply. Iceberg has its own configs to control the split size. There are multiple ways to configure the reader split size:

Using reader options:

Refer to the read options in the Iceberg Runtime Configuration documentation for details.

spark.read.option("split-size", "24217728").table("someTable")

Using table properties:

You can also set the split size through table properties:

  • read.split.target-size: Target size for file splits

  • read.split.planning-lookback: Number of bins to consider when combining input splits

Writing Tables#

GPU acceleration for writing to Apache Iceberg tables is enabled by default. You can disable it by setting the spark.rapids.sql.format.iceberg.write.enabled configuration to false.

The following write operations are supported:

Operation

Support Status

INSERT INTO

Supported

MERGE INTO

Supported

INSERT OVERWRITE

Supported

DELETE FROM

Supported

UPDATE

Supported

CREATE TABLE AS

Supported

REPLACE TABLE AS

Supported

Note

Unsupported partitioning backends will automatically fall back to CPU execution.

Examples#

To run spark-rapids with iceberg, you need add iceberg-spark-runtime as dependency:

$SPARK_HOME/bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
--jars <spark-rapids-jar> \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=<warehouse path>