Is this page helpful?

Apache Iceberg Support#

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables. This document details the Apache Iceberg features that are supported.

Supported Versions#

The RAPIDS Accelerator provides experimental support for the following versions:

Spark Version	Iceberg Version
3.5.0 - 3.5.6	1.6.1

Currently we only support table format of v1 and v2 spec.

Supported Data Types#

The RAPIDS Accelerator supports the following primitive data types for Apache Iceberg tables:

binary
boolean
date
decimal
double
fixed
float
int
long
string
time
timestamp
timestamptz (timestamp with timezone)
uuid

Note

Other types such as struct, list, and map are not supported yet. Operations on tables with unsupported data types will fall back to CPU execution.

Supported Catalogs#

The RAPIDS Accelerator supports the following catalog types:

Catalog Type	Support Type
S3Tables	Officially Supported
REST Catalog	Officially Supported
Hadoop Catalog	Officially Supported
Hive	Experimental
JDBC	Experimental

Supported File Formats#

The RAPIDS Accelerator supports the following file formats for Apache Iceberg tables:

File Format	Read Support	Write Support
Parquet	Supported	Supported
ORC	Not Supported	Not Supported
Avro	Not Supported	Not Supported

Supported Partition#

The RAPIDS Accelerator supports the following partition specifications:

Partition Spec	Read Support	Write Support
Identity	Supported	Not Supported
Bucket	Supported	Not Supported
Void	Supported	Not Supported
Truncate	Supported	Supported
Year	Supported	Supported
Month	Supported	Supported
Day	Supported	Supported
Hour	Supported	Supported

Note

Unsupported partitioning schemes will automatically fall back to CPU execution.

Reading Tables#

GPU acceleration for reading Apache Iceberg tables is enabled by default. You can disable this feature by setting the spark.rapids.sql.format.iceberg.read.enabled configuration to false.

Metadata Queries#

Reads of Apache Iceberg metadata, that is: the history, snapshots, and other metadata tables associated with a table, won’t be GPU-accelerated. The CPU will continue to process these metadata-level queries.

Content File Supported#

The RAPIDS Accelerator supports the following content file types:

Content File Type	Support Status
Data File	Supported
Equality Deletion File	Supported
Position Delete File	Supported
Deletion Vector	Not Supported

Reader Split Size#

The maximum number of bytes to pack into a single partition when reading files on Spark is normally controlled by the config spark.sql.files.maxPartitionBytes. But on Iceberg that doesn’t apply. Iceberg has its own configs to control the split size. There are multiple ways to configure the reader split size:

Using reader options:

Refer to the read options in the Iceberg Runtime Configuration documentation for details.

spark.read.option("split-size", "24217728").table("someTable")

Using table properties:

You can also set the split size through table properties:

read.split.target-size: Target size for file splits
read.split.planning-lookback: Number of bins to consider when combining input splits

Writing Tables#

GPU acceleration for writing to Apache Iceberg tables is enabled by default. You can disable it by setting the spark.rapids.sql.format.iceberg.write.enabled configuration to false.

The following write operations are supported:

Operation	Support Status
INSERT INTO	Supported
MERGE INTO	Supported
INSERT OVERWRITE	Supported
DELETE FROM	Supported
UPDATE	Supported
CREATE TABLE AS	Supported
REPLACE TABLE AS	Supported

Note

Unsupported partitioning backends will automatically fall back to CPU execution.

Examples#

To run spark-rapids with iceberg, you need add iceberg-spark-runtime as dependency:

$SPARK_HOME/bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
--jars <spark-rapids-jar> \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=<warehouse path>