Interleaved IO
Read and write interleaved image-text data between WebDataset tar shards and Parquet. All four combinations work out of the box:
Understanding the Round-Trip
Storage Formats
Interleaved samples can live in two on-disk formats; in memory both materialize as InterleavedBatch:
Choose based on the consumer: training loops that stream large datasets often prefer WDS tars; analytics or sample-level inspection prefers Parquet.
When to Convert
A common workflow is “curate once in Parquet, train from tars”:
- Curate in Parquet — fast reads, easy filtering, column projection.
- Convert to WDS — once curation is done, write the final dataset as MINT-1T-style tars for the training loop.
The IO stages on this page support that workflow without intermediate copies.
Readers
InterleavedParquetReader
Reads Parquet files into InterleavedBatch. Each row maps to one item (text, image, or metadata) with a sample_id that groups items into samples.
Key behaviors:
fields=passthrough: keep additional columns alongside the reserved interleaved schema. Missing columns null-fill in a single pass.- Push-down column projection: the reader uses
pq.read_schema()per file and asks Parquet for only the columns it needs. max_batch_bytessplitting: large file groups are split into multiple batches by accumulated size; per-split lineage is preserved on the outputsource_filesfield.- Schema control: pass
schema=for strict alignment to a target Arrow schema, orschema_overrides=for partial type overrides on top ofINTERLEAVED_SCHEMA. Specifying both warns and prefersschema=.
InterleavedWebdatasetReader
Reads MINT-1T-style WebDataset tar shards. Same fields= and max_batch_bytes semantics as the Parquet reader.
Configurable file extensions for image, JSON, and texts members (image_extensions, json_extensions, texts_field, images_field, image_member_field) let the reader cooperate with non-standard tar layouts.
Reader Parameters
Writers
InterleavedParquetWriter
Writes InterleavedBatch to Parquet. Inherits schema= / schema_overrides= and the materialization-error policy (see below).
InterleavedWebdatasetWriterStage
Writes InterleavedBatch to MINT-1T-style WebDataset tar shards.
Implementation notes:
urllib.parse.quote(sample_id, safe="")is used as the tar member key — injective and roundtrip-safe withsample_id_field="sample_id"on the matching reader.- Single-pass
groupby: rows are grouped bysample_idonce instead of filtered per sample (O(n) instead of O(n × m)). - Sparse positions: gaps in the position field are preserved as
Noneentries; the WDS reader skips them on the way back. - Supported modalities:
metadata,text, andimage. Any other modality raisesValueErrorat write time.
Schema Utilities
utils/schema.py ships shared helpers used by every arrow-based reader and writer.
reconcile_schema()
Apply canonical types to reserved columns; preserve passthrough column types as-is; avoid unsafe large↔small downcasts; unwrap Parquet dictionary encoding from passthrough columns.
align_table()
Pad, reorder, and cast an Arrow table to a target schema. Reserved columns use safe=False for predictable casts; passthrough columns use safe=True so overflow surfaces rather than silently corrupting data. Does not re-reconcile the user-provided target.
resolve_schema()
Merge schema= and schema_overrides= arguments with the canonical INTERLEAVED_SCHEMA; warn when both are provided; return None when neither is set. Used internally by readers and writers.
Materialization Error Policy
Writers expose on_materialize_error=, controlling what happens when a fetch step (or upstream stage) sets an error on a row:
The policy is applied after fetch, so it covers errors raised by the materialization step itself and by upstream stages.
Mixed-Backend Path Handling
The internal helper _build_global_range_index() groups paths by filesystem object so a single batch that mixes S3 and local paths no longer fails silently — each backend’s range queries run against the right filesystem. This works automatically; you don’t need to configure it.
Performance
Benchmarks on 80 NVMe shards of MINT-1T PDF data (6,818 samples, 9.9 GB WDS / 6.0 GB Parquet) with an aspect-ratio filter applied:
Parquet-sourced paths are roughly 5× faster than WDS-sourced paths because tar parsing dominates the WDS read cost.
Complete IO Pipeline Examples
Curate-Once-in-Parquet, Train-from-Tars
Format Conversion (No Filtering)
Best Practices
- Curate in Parquet, ship in WDS: Parquet is ~5× faster for filtering operations; convert to WDS only for the final training-loop format.
- Pass through provenance fields: list source-tracking columns (
source_url,license,crawl_date) infields=so they survive into the output. Missing the list silently drops them. - Use
max_batch_bytesfor large shards: without splitting, a single 5 GB Parquet file becomes one giant batch. Setmax_batch_bytes=512 * 1024 * 1024for memory-friendly batches. - Pick the right
on_materialize_error:"error"for production,"warn"for development,"drop_sample"for strict cleanliness. The default raises — set it explicitly when you want anything else.
Related Topics
- Interleaved Filters — sample-level quality filters that operate on
InterleavedBatch. - Nemotron-Parse PDF Pipeline — produces interleaved Parquet output from PDF inputs.