Read and write interleaved image-text data between WebDataset tar shards and Parquet. All four combinations work out of the box:
Interleaved samples can live in two on-disk formats; in memory both materialize as InterleavedBatch:
Choose based on the consumer: training loops that stream large datasets often prefer WDS tars; analytics or sample-level inspection prefers Parquet.
A common workflow is “curate once in Parquet, train from tars”:
The IO stages on this page support that workflow without intermediate copies.
InterleavedParquetReaderReads Parquet files into InterleavedBatch. Each row maps to one item (text, image, or metadata) with a sample_id that groups items into samples.
Key behaviors:
fields= passthrough: keep additional columns alongside the reserved interleaved schema. Missing columns null-fill in a single pass.pq.read_schema() per file and asks Parquet for only the columns it needs.max_batch_bytes splitting: large file groups are split into multiple batches by accumulated size; per-split lineage is preserved on the output source_files field.schema= for strict alignment to a target Arrow schema, or schema_overrides= for partial type overrides on top of INTERLEAVED_SCHEMA. Specifying both warns and prefers schema=.InterleavedWebdatasetReaderReads MINT-1T-style WebDataset tar shards. Same fields= and max_batch_bytes semantics as the Parquet reader.
Configurable file extensions for image, JSON, and texts members (image_extensions, json_extensions, texts_field, images_field, image_member_field) let the reader cooperate with non-standard tar layouts.
InterleavedParquetWriterWrites InterleavedBatch to Parquet. Inherits schema= / schema_overrides= and the materialization-error policy (see below).
InterleavedWebdatasetWriterStageWrites InterleavedBatch to MINT-1T-style WebDataset tar shards.
Implementation notes:
urllib.parse.quote(sample_id, safe="") is used as the tar member key — injective and roundtrip-safe with sample_id_field="sample_id" on the matching reader.groupby: rows are grouped by sample_id once instead of filtered per sample (O(n) instead of O(n × m)).None entries; the WDS reader skips them on the way back.metadata, text, and image. Any other modality raises ValueError at write time.utils/schema.py ships shared helpers used by every arrow-based reader and writer.
reconcile_schema()Apply canonical types to reserved columns; preserve passthrough column types as-is; avoid unsafe large↔small downcasts; unwrap Parquet dictionary encoding from passthrough columns.
align_table()Pad, reorder, and cast an Arrow table to a target schema. Reserved columns use safe=False for predictable casts; passthrough columns use safe=True so overflow surfaces rather than silently corrupting data. Does not re-reconcile the user-provided target.
resolve_schema()Merge schema= and schema_overrides= arguments with the canonical INTERLEAVED_SCHEMA; warn when both are provided; return None when neither is set. Used internally by readers and writers.
Writers expose on_materialize_error=, controlling what happens when a fetch step (or upstream stage) sets an error on a row:
The policy is applied after fetch, so it covers errors raised by the materialization step itself and by upstream stages.
The internal helper _build_global_range_index() groups paths by filesystem object so a single batch that mixes S3 and local paths no longer fails silently — each backend’s range queries run against the right filesystem. This works automatically; you don’t need to configure it.
Benchmarks on 80 NVMe shards of MINT-1T PDF data (6,818 samples, 9.9 GB WDS / 6.0 GB Parquet) with an aspect-ratio filter applied:
Parquet-sourced paths are roughly 5× faster than WDS-sourced paths because tar parsing dominates the WDS read cost.
source_url, license, crawl_date) in fields= so they survive into the output. Missing the list silently drops them.max_batch_bytes for large shards: without splitting, a single 5 GB Parquet file becomes one giant batch. Set max_batch_bytes=512 * 1024 * 1024 for memory-friendly batches.on_materialize_error: "error" for production, "warn" for development, "drop_sample" for strict cleanliness. The default raises — set it explicitly when you want anything else.InterleavedBatch.