Curate TextProcess DataInterleaved Datasets

Interleaved Datasets

View as Markdown

Curate interleaved image-text datasets in the format used by MINT-1T and similar large-scale multimodal corpora. Each sample is an ordered sequence of text, image, and metadata items keyed by a stable sample_id.

How it Works

NeMo Curator’s interleaved support is organized around three responsibilities:

  1. Storage formatsInterleavedBatch is the in-memory representation. On disk, samples can live as WebDataset tar shards (one tar per shard, one file per item) or as Parquet rows (one row per item, grouped by sample_id).
  2. IO round-trip — readers and writers exist for both formats, so any combination of WDS ↔ InterleavedBatch ↔ Parquet is supported. Schema utilities ensure reserved columns get canonical types and passthrough columns survive intact.
  3. Sample-level filters — drop samples by image sharpness, QR-code area ratio, CLIP image-text alignment, or image-to-text count ratio.

Use the IO stages on their own for format conversion (e.g., curate-once, train-many), or chain them with the filter stages for a full curation pipeline.

Pages

Quick Example

Read interleaved Parquet, drop blurry images and low-CLIP-alignment samples, and write the survivors back to MINT-1T-style WebDataset shards:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.interleaved.io.reader import InterleavedParquetReader
4from nemo_curator.stages.interleaved.io.writers.webdataset import (
5 InterleavedWebdatasetWriterStage,
6)
7from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
8from nemo_curator.stages.interleaved.filter.clip_score_filter import (
9 InterleavedCLIPScoreFilterStage,
10)
11
12pipeline = Pipeline(name="interleaved_curation")
13
14# 1. Read interleaved Parquet
15pipeline.add_stage(InterleavedParquetReader(file_paths="s3://bucket/interleaved/*.parquet"))
16
17# 2. Drop blurry images
18pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
19
20# 3. Drop low image-text alignment
21pipeline.add_stage(
22 InterleavedCLIPScoreFilterStage(model_dir="/models/clip", min_score=0.2)
23)
24
25# 4. Write surviving samples to MINT-1T-style tar shards
26pipeline.add_stage(InterleavedWebdatasetWriterStage(output_dir="./curated"))
27
28executor = XennaExecutor()
29pipeline.run(executor)