> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.delta_lake_dataset

Delta Lake dataset support for streaming instruction-tuning datasets.

This module provides support for reading Delta Lake tables from Databricks or
local storage as streaming datasets. It integrates with the existing
ColumnMappedTextInstructionDataset infrastructure.

**Supports tables with Deletion Vectors** (Databricks Runtime 15.4+) via Spark
(Databricks runtime) and optionally via Databricks SQL Connector for Unity Catalog
access outside of Spark.

## Module Contents

### Classes

| Name                                                                                                              | Description                                                    |
| ----------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- |
| [`DeltaLakeDataset`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-DeltaLakeDataset)                 | HuggingFace datasets-compatible wrapper for Delta Lake tables. |
| [`DeltaLakeIterator`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-DeltaLakeIterator)               | Iterator that yields rows from a Delta Lake table.             |
| [`_LimitedDeltaLakeDataset`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_LimitedDeltaLakeDataset) | Internal wrapper to limit a Delta Lake dataset to n samples.   |

### Functions

| Name                                                                                                                                                | Description                                                                                      |
| --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| [`_build_uc_table_fqn`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_build_uc_table_fqn)                                             | Build a fully-qualified UC table name with safe quoting.                                         |
| [`_check_databricks_sql_available`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_check_databricks_sql_available)                     | Check if databricks-sql-connector is available for Unity Catalog access.                         |
| [`_check_delta_reader_available`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_check_delta_reader_available)                         | Check if any Delta Lake reader is available (deltalake, Spark, or databricks-sql).               |
| [`_check_deltalake_available`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_check_deltalake_available)                               | Check if the deltalake package is available.                                                     |
| [`_check_pyspark_available`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_check_pyspark_available)                                   | Check if PySpark is available (used as a fallback on Databricks for deletion vectors).           |
| [`_get_spark_session`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_get_spark_session)                                               | Get an active Spark session if available (Databricks notebooks/jobs).                            |
| [`_is_deletion_vectors_error`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_is_deletion_vectors_error)                               | Return True if *e* looks like the deltalake 'deletionVectors' unsupported-reader-features error. |
| [`_is_location_overlap_error`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_is_location_overlap_error)                               | Return True if *e* looks like Databricks UC managed-storage overlap.                             |
| [`_is_unity_catalog_path`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_is_unity_catalog_path)                                       | Check if path refers to a Unity Catalog table (catalog.schema.table format).                     |
| [`_normalize_delta_path`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_normalize_delta_path)                                         | Normalize a Delta Lake path by removing the delta:// prefix if present.                          |
| [`_parse_unity_storage_ids`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_parse_unity_storage_ids)                                   | Parse Unity Catalog managed storage IDs from a \_\_unitystorage path.                            |
| [`_quote_sql_ident`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_quote_sql_ident)                                                   | Quote an identifier for Spark SQL (handles embedded backticks).                                  |
| [`_resolve_uc_table_from_unity_storage_path`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_resolve_uc_table_from_unity_storage_path) | If *path* looks like UC managed storage, try to resolve to catalog.schema.table.                 |
| [`_try_resolve_uc_table_from_system_tables`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_try_resolve_uc_table_from_system_tables)   | Best-effort reverse lookup of a UC table name via Databricks system tables.                      |
| [`is_delta_lake_path`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-is_delta_lake_path)                                               | Check if a path refers to a Delta Lake table.                                                    |

### Data

[`_DATABRICKS_SQL_AVAILABLE`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_DATABRICKS_SQL_AVAILABLE)

[`_DELTALAKE_AVAILABLE`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_DELTALAKE_AVAILABLE)

[`_PYSPARK_AVAILABLE`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_PYSPARK_AVAILABLE)

[`_UNITY_STORAGE_TABLE_PATH_RE`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-_UNITY_STORAGE_TABLE_PATH_RE)

[`logger`](#nemo_automodel-components-datasets-llm-delta_lake_dataset-logger)

### API

```python
class nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset(
    table_path: str,
    columns: typing.Optional[list] = None,
    storage_options: typing.Optional[typing.Dict[str, str]] = None,
    version: typing.Optional[int] = None,
    sql_query: typing.Optional[str] = None
)
```

HuggingFace datasets-compatible wrapper for Delta Lake tables.

This class provides better integration with the HuggingFace datasets library,
supporting features like sharding, shuffling, and epoch setting for distributed
training scenarios.

**Parameters:**

Path to the Delta Lake table.

Optional list of column names to read.

Optional dict of storage options for cloud authentication.

Optional specific version of the Delta table to read.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.Any]
```

Get a specific row by index.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate over rows in the dataset.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.__len__() -> int
```

Return the number of rows in the table.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.set_epoch(
    epoch: int
) -> None
```

Set the current epoch for deterministic shuffling.

**Parameters:**

The epoch number.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.shard(
    num_shards: int,
    index: int
) -> nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset
```

Shard the dataset for distributed processing.

**Parameters:**

Total number of shards.

Index of this shard (0-based).

**Returns:** `DeltaLakeDataset`

Self for method chaining.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.shuffle(
    buffer_size: int = 1000,
    seed: typing.Optional[int] = None
) -> nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset
```

Configure shuffling for the dataset.

Note: For streaming Delta Lake datasets, shuffling is performed on-the-fly
using a shuffle buffer. The actual shuffling happens during iteration.

**Parameters:**

Size of the shuffle buffer.

Random seed for reproducibility.

**Returns:** `DeltaLakeDataset`

Self for method chaining.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset.take(
    n: int
) -> nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset
```

Limit the dataset to the first n samples.

**Parameters:**

Number of samples to take.

**Returns:** `DeltaLakeDataset`

A new DeltaLakeDataset limited to n samples.

```python
class nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator(
    table_path: str,
    columns: typing.Optional[list] = None,
    storage_options: typing.Optional[typing.Dict[str, str]] = None,
    batch_size: int = 1024,
    version: typing.Optional[int] = None,
    sql_query: typing.Optional[str] = None,
    shard_info: typing.Optional[tuple[int, int]] = None
)
```

Iterator that yields rows from a Delta Lake table.

This class provides a streaming interface for Delta Lake tables,
yielding rows as dictionaries one at a time to support memory-efficient
iteration over large tables.

Supports tables with deletion vectors (Databricks Runtime 15.4+) via Spark backend
(recommended when running in Databricks notebooks/jobs).

**Parameters:**

Path to the Delta Lake table.

Optional list of column names to read. If None, reads all columns.

Optional storage options for cloud storage access.

Number of rows to read at a time (default: 1024).

Optional version of the table to read.

Optional SQL query to read the table and/or create alias columns.

Optional sharding configuration `(num_shards, shard_index)`.
When provided, only rows where `row_idx % num_shards == shard_index` are yielded.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator.__iter__() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate over rows in the Delta Lake table.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator._add_env_storage_options()
```

Add storage options from environment variables if not already set.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator._iter_all_rows() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate over all rows (no sharding).

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator._iter_with_databricks_sql() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate using Databricks SQL Connector (for Unity Catalog tables).

This is the recommended method for accessing Unity Catalog tables
as it handles authentication, deletion vectors, and column mapping natively.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator._iter_with_deltalake() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate using deltalake library.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator._iter_with_spark() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

Iterate using Spark (supports deletion vectors on Databricks).

This backend requires a working SparkSession (e.g., Databricks notebooks/jobs).
It is the recommended fallback for Delta tables that use deletion vectors.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator.shard(
    num_shards: int,
    index: int
) -> nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeIterator
```

Shard the iterator for distributed processing.

**Parameters:**

Total number of shards.

Index of this shard (0-based).

```python
class nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset(
    base: nemo_automodel.components.datasets.llm.delta_lake_dataset.DeltaLakeDataset,
    limit: int
)
```

Internal wrapper to limit a Delta Lake dataset to n samples.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.Any]]
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset.set_epoch(
    epoch: int
) -> None
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset.shard(
    num_shards: int,
    index: int
)
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset.shuffle(
    buffer_size: int = 1000,
    seed: typing.Optional[int] = None
)
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._LimitedDeltaLakeDataset.take(
    n: int
)
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._build_uc_table_fqn(
    catalog: str,
    schema: str,
    table: str
) -> str
```

Build a fully-qualified UC table name with safe quoting.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._check_databricks_sql_available() -> bool
```

Check if databricks-sql-connector is available for Unity Catalog access.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._check_delta_reader_available() -> bool
```

Check if any Delta Lake reader is available (deltalake, Spark, or databricks-sql).

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._check_deltalake_available() -> bool
```

Check if the deltalake package is available.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._check_pyspark_available() -> bool
```

Check if PySpark is available (used as a fallback on Databricks for deletion vectors).

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._get_spark_session() -> typing.Optional[typing.Any]
```

Get an active Spark session if available (Databricks notebooks/jobs).

**Returns:** `Optional[Any]`

A SparkSession instance if available, else None.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._is_deletion_vectors_error(
    e: BaseException
) -> bool
```

Return True if *e* looks like the deltalake 'deletionVectors' unsupported-reader-features error.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._is_location_overlap_error(
    e: BaseException
) -> bool
```

Return True if *e* looks like Databricks UC managed-storage overlap.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._is_unity_catalog_path(
    path: str
) -> bool
```

Check if path refers to a Unity Catalog table (catalog.schema.table format).

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._normalize_delta_path(
    path: str
) -> str
```

Normalize a Delta Lake path by removing the delta:// prefix if present.

**Parameters:**

The Delta Lake path.

**Returns:** `str`

The normalized path suitable for the deltalake library.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._parse_unity_storage_ids(
    path: str
) -> typing.Optional[typing.Dict[str, str]]
```

Parse Unity Catalog managed storage IDs from a \_\_unitystorage path.

Direct path access to these locations is blocked on Databricks ("LOCATION\_OVERLAP").

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._quote_sql_ident(
    ident: str
) -> str
```

Quote an identifier for Spark SQL (handles embedded backticks).

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._resolve_uc_table_from_unity_storage_path(
    spark: typing.Any,
    path: str
) -> typing.Optional[str]
```

If *path* looks like UC managed storage, try to resolve to catalog.schema.table.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._try_resolve_uc_table_from_system_tables(
    spark: typing.Any,
    table_id: typing.Optional[str] = None,
    storage_location: typing.Optional[str] = None
) -> typing.Optional[str]
```

Best-effort reverse lookup of a UC table name via Databricks system tables.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.is_delta_lake_path(
    path: str
) -> bool
```

Check if a path refers to a Delta Lake table.

A path is considered a Delta Lake path if:

1. It starts with "delta://" protocol prefix
2. It's a local directory containing a "\_delta\_log" subdirectory
3. It starts with "dbfs\:/" (Databricks file system)

**Parameters:**

The path to check.

**Returns:** `bool`

True if the path is a Delta Lake table, False otherwise.

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._DATABRICKS_SQL_AVAILABLE: Optional[bool] = None
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._DELTALAKE_AVAILABLE: Optional[bool] = None
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._PYSPARK_AVAILABLE: Optional[bool] = None
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset._UNITY_STORAGE_TABLE_PATH_RE = re.compile('__unitystorage/catalogs/(?P<catalog_id>[0-9a-fA-F]{8}-[0-9a-fA-F]{4}...
```

```python
nemo_automodel.components.datasets.llm.delta_lake_dataset.logger = logging.getLogger(__name__)
```