Curate interleaved image-text datasets in the format used by MINT-1T and similar large-scale multimodal corpora. Each sample is an ordered sequence of text, image, and metadata items keyed by a stable sample_id.
NeMo Curator’s interleaved support is organized around three responsibilities:
InterleavedBatch is the in-memory representation. On disk, samples can live as WebDataset tar shards (one tar per shard, one file per item) or as Parquet rows (one row per item, grouped by sample_id).WDS ↔ InterleavedBatch ↔ Parquet is supported. Schema utilities ensure reserved columns get canonical types and passthrough columns survive intact.Use the IO stages on their own for format conversion (e.g., curate-once, train-many), or chain them with the filter stages for a full curation pipeline.
Round-trip readers and writers between WebDataset tar shards and Parquet, plus shared schema utilities parquet webdataset schema
Sample-level filter stages for image quality, QR-code detection, CLIP image-text alignment, and image-to-text ratio blur clip qr-detection
Read interleaved Parquet, drop blurry images and low-CLIP-alignment samples, and write the survivors back to MINT-1T-style WebDataset shards: