Use the ColumnMappedTextInstructionIterableDataset (Streaming)
Use the ColumnMappedTextInstructionIterableDataset (Streaming)
This guide explains how to use ColumnMappedTextInstructionIterableDataset to stream instruction datasets for LLM fine-tuning, including Delta Lake/Databricks sources.
Unlike ColumnMappedTextInstructionDataset (map-style, non-streaming), this class is a torch.utils.data.IterableDataset and always loads data in streaming mode. This is intentional: it helps ensure data is consumed as a stream and avoids accidentally materializing full datasets/tables to disk or memory (which is especially important for large or sensitive corpora).
When to Use This Dataset
Use ColumnMappedTextInstructionIterableDataset when you need:
- Streaming-only behavior (e.g., to reduce accidental data leakages from full dataset materialization)
- Delta Lake/Databricks (Unity Catalog, cloud lakehouse storage, DBFS, etc.)
- Very large datasets where map-style loading/caching is undesirable
If you do not need streaming (and you want len(ds) / ds[i]), use ColumnMappedTextInstructionDataset.
Key Differences vs ColumnMappedTextInstructionDataset
- Iterable: you iterate (
for sample in ds:); you cannot rely onlen(ds)ords[i]. - Always streaming: there is no
streaming=flag; it is always enabled. - Repeat behavior: by default,
repeat_on_exhaustion=True(infinite stream). Setrepeat_on_exhaustion=Falseto do a single pass. - (Optional) sharding/shuffle helpers: use
.shard(num_shards, index)/.shuffle(buffer_size, seed)when supported by the underlying backend.
The column mapping and tokenization logic are shared with ColumnMappedTextInstructionDataset. See Tokenization Paths for details on output fields (input_ids, labels, attention_mask) and masking behavior.
Quickstart (Hugging Face Streaming)
Delta Lake/Databricks
ColumnMappedTextInstructionIterableDataset supports Delta Lake tables from:
- Local Delta tables (directories containing
_delta_log) - Cloud storage (S3, Azure Blob/ADLS via
abfss://, GCS viags://) - Databricks (DBFS paths and Unity Catalog tables)
Installation
Install the basic Delta Lake reader:
For Unity Catalog access outside of Spark (optional), install:
Local Delta Table
Databricks Unity Catalog
Use the delta:// prefix so the loader selects the Delta backend:
Cloud Storage (S3/Azure/GCS)
YAML Configuration (Delta Lake/Databricks)
Streaming from a Delta SQL Query (Computed/Aliased Columns)
If you want to generate columns dynamically (joins, filters, computed prompt strings, etc.), pass a SQL query that returns the fields referenced by your column_mapping.
SQL engine requirement: delta_sql_query is executed via Spark (Databricks runtime/pyspark) when available, otherwise via databricks-sql-connector. It is not supported in a deltalake-only environment.
Authentication: The Delta Lake loader automatically picks up credentials from environment variables (DATABRICKS_TOKEN, AWS_ACCESS_KEY_ID, AZURE_STORAGE_ACCOUNT_KEY, etc.) if not explicitly provided in delta_storage_options.