Apache Iceberg Support#

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables. This document details the Apache Iceberg features that are supported.

Apache Iceberg Versions#

The RAPIDS Accelerator provides experimental support for Apache Iceberg 1.6.1, Apache Spark 3.5.x. Currently we only support table format of v1 and v2 spec.

Note

Apache Iceberg in Databricks isn’t supported by the RAPIDS Accelerator.

Catalogs#

Currently we only support running against the Hadoop filesystem catalog.

Reading Tables#

Metadata Queries#

Reads of Apache Iceberg metadata, that is: the history, snapshots, and other metadata tables associated with a table, won’t be GPU-accelerated. The CPU will continue to process these metadata-level queries.

Data Types Supported#

Currently only primitive types are supported, nested types such as struct, list and map not supported yet.

Content File Supported#

Tables with data files, equal deletion files and position deletion files are supported. Deletion vectors in puffin file not supported yet.

Data Formats#

Apache Iceberg can store data in various formats. Each section below details the levels of support for each of the underlying data formats.

Parquet#

Data stored in Parquet is supported with the same limitations for loading data from raw Parquet files. Refer to the Input/Output documentation for details. The following compression codecs applied to the Parquet data are supported:

  • gzip (Apache Iceberg default)

  • snappy

  • uncompressed

  • zstd

ORC#

The RAPIDS Accelerator doesn’t support Apache Iceberg tables using the ORC data format.

Avro#

The RAPIDS Accelerator doesn’t support Apache Iceberg tables using the Avro data format.

Reader Split Size#

The maximum number of bytes to pack into a single partition when reading files on Spark is normally controlled by the config spark.sql.files.maxPartitionBytes. But on Iceberg that doesn’t apply. Iceberg has its own configs to control the split size. Refer to the read options in the Iceberg Runtime Configuration documentation for details. One example is to use the split-size reader option like:

spark.read.option("split-size", "24217728").table("someTable")

Writing Tables#

The RAPIDS Accelerator for Apache Spark doesn’t accelerate Apache Iceberg writes. Writes to Iceberg tables will be processed by the CPU.

Examples#

To run spark-rapids with iceberg, you need add iceberg-spark-runtime as dependency:

$SPARK_HOME/bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 \
--jars <spark-rapids-jar> \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=<warehouse path>