User Guide (24.08.01)
RAPIDS Accelerator for Apache Spark - User Guide (24.08.01)

Apache Iceberg Support

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables. This document details the Apache Iceberg features that are supported.

The RAPIDS Accelerator supports Apache Iceberg 0.13.x. Earlier versions of Apache Iceberg aren’t supported.

Note

Apache Iceberg in Databricks isn’t supported by the RAPIDS Accelerator.

Metadata Queries

Reads of Apache Iceberg metadata, that is: the history, snapshots, and other metadata tables associated with a table, won’t be GPU-accelerated. The CPU will continue to process these metadata-level queries.

Row-level Delete and Update Support

Apache Iceberg supports row-level deletions and updates. Tables that are using a configuration of write.delete.mode=merge-on-read aren’t supported.

Schema Evolution

Columns that are added and removed at the top level of the table schema are supported. Columns that are added or removed within struct columns aren’t supported.

Data Formats

Apache Iceberg can store data in various formats. Each section below details the levels of support for each of the underlying data formats.

Parquet

Data stored in Parquet is supported with the same limitations for loading data from raw Parquet files. Refer to the Input/Output documentation for details. The following compression codecs applied to the Parquet data are supported:

  • gzip (Apache Iceberg default)

  • snappy

  • uncompressed

  • zstd

ORC

The RAPIDS Accelerator doesn’t support Apache Iceberg tables using the ORC data format.

Avro

The RAPIDS Accelerator doesn’t support Apache Iceberg tables using the Avro data format.

Reader Split Size

The maximum number of bytes to pack into a single partition when reading files on Spark is normally controlled by the config spark.sql.files.maxPartitionBytes. But on Iceberg that doesn’t apply. Iceberg has its own configs to control the split size. Refer to the read options in the Iceberg Runtime Configuration documentation for details. One example is to use the split-size reader option like:

Copy
Copied!
            

spark.read.option("split-size", "24217728").table("someTable")

The RAPIDS Accelerator for Apache Spark doesn’t accelerate Apache Iceberg writes. Writes to Iceberg tables will be processed by the CPU.

Previous RAPIDS Shuffle Manager
Next Delta Lake Support
© Copyright 2024, NVIDIA. Last updated on Aug 29, 2024.