Apache Iceberg Support#
The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables. This document details the Apache Iceberg features that are supported.
Supported Versions#
The RAPIDS Accelerator provides experimental support for the following versions:
Spark Version |
Iceberg Version |
|---|---|
3.5.0 - 3.5.6 |
1.6.1 |
Currently we only support table format of v1 and v2 spec.
Supported Data Types#
The RAPIDS Accelerator supports the following primitive data types for Apache Iceberg tables:
binarybooleandatedecimaldoublefixedfloatintlongstringtimetimestamptimestamptz(timestamp with timezone)uuid
Note
Other types such as struct, list, and map are not supported yet. Operations on tables with unsupported data types will fall back to CPU execution.
Supported Catalogs#
The RAPIDS Accelerator supports the following catalog types:
Catalog Type |
Support Type |
|---|---|
S3Tables |
Officially Supported |
REST Catalog |
Officially Supported |
Hadoop Catalog |
Officially Supported |
Hive |
Experimental |
JDBC |
Experimental |
Supported File Formats#
The RAPIDS Accelerator supports the following file formats for Apache Iceberg tables:
File Format |
Read Support |
Write Support |
|---|---|---|
Parquet |
Supported |
Supported |
ORC |
Not Supported |
Not Supported |
Avro |
Not Supported |
Not Supported |
Supported Partition#
The RAPIDS Accelerator supports the following partition specifications:
Partition Spec |
Read Support |
Write Support |
|---|---|---|
Identity |
Supported |
Not Supported |
Bucket |
Supported |
Not Supported |
Void |
Supported |
Not Supported |
Truncate |
Supported |
Supported |
Year |
Supported |
Supported |
Month |
Supported |
Supported |
Day |
Supported |
Supported |
Hour |
Supported |
Supported |
Note
Unsupported partitioning schemes will automatically fall back to CPU execution.
Reading Tables#
GPU acceleration for reading Apache Iceberg tables is enabled by default. You can disable this feature by setting the spark.rapids.sql.format.iceberg.read.enabled configuration to false.
Metadata Queries#
Reads of Apache Iceberg metadata, that is: the history, snapshots, and other metadata tables associated with a table, won’t be GPU-accelerated. The CPU will continue to process these metadata-level queries.
Content File Supported#
The RAPIDS Accelerator supports the following content file types:
Content File Type |
Support Status |
|---|---|
Data File |
Supported |
Equality Deletion File |
Supported |
Position Delete File |
Supported |
Deletion Vector |
Not Supported |
Reader Split Size#
The maximum number of bytes to pack into a single partition when reading files on Spark is normally controlled by the config spark.sql.files.maxPartitionBytes. But on Iceberg that doesn’t apply. Iceberg has its own configs to control the split size.
There are multiple ways to configure the reader split size:
Using reader options:
Refer to the read options in the Iceberg Runtime Configuration documentation for details.
spark.read.option("split-size", "24217728").table("someTable")
Using table properties:
You can also set the split size through table properties:
read.split.target-size: Target size for file splitsread.split.planning-lookback: Number of bins to consider when combining input splits
Writing Tables#
GPU acceleration for writing to Apache Iceberg tables is enabled by default. You can disable it by setting the spark.rapids.sql.format.iceberg.write.enabled configuration to false.
The following write operations are supported:
Operation |
Support Status |
|---|---|
INSERT INTO |
Supported |
MERGE INTO |
Supported |
INSERT OVERWRITE |
Supported |
DELETE FROM |
Supported |
UPDATE |
Supported |
CREATE TABLE AS |
Supported |
REPLACE TABLE AS |
Supported |
Note
Unsupported partitioning backends will automatically fall back to CPU execution.
Examples#
To run spark-rapids with iceberg, you need add iceberg-spark-runtime as dependency:
$SPARK_HOME/bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
--jars <spark-rapids-jar> \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=<warehouse path>