Overview | NeMo Curator

Curate interleaved image-text datasets in the format used by MINT-1T and similar large-scale multimodal corpora. Each sample is an ordered sequence of text, image, and metadata items keyed by a stable sample_id.

How it Works

NeMo Curator’s interleaved support is organized around three responsibilities:

Storage formats — InterleavedBatch is the in-memory representation. On disk, samples can live as WebDataset tar shards (one tar per shard, one file per item) or as Parquet rows (one row per item, grouped by sample_id).
IO round-trip — readers and writers exist for both formats, so any combination of WDS ↔ InterleavedBatch ↔ Parquet is supported. Schema utilities ensure reserved columns get canonical types and passthrough columns survive intact.
Sample-level filters — drop samples by image sharpness, QR-code area ratio, CLIP image-text alignment, or image-to-text count ratio.

Use the IO stages on their own for format conversion (e.g., curate-once, train-many), or chain them with the filter stages for a full curation pipeline.

Pages

Interleaved IO

Round-trip readers and writers between WebDataset tar shards and Parquet, plus shared schema utilities parquet webdataset schema

Interleaved Filters

Sample-level filter stages for image quality, QR-code detection, CLIP image-text alignment, and image-to-text ratio blur clip qr-detection

Quick Example

Read interleaved Parquet, drop blurry images and low-CLIP-alignment samples, and write the survivors back to MINT-1T-style WebDataset shards:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.backends.xenna import XennaExecutor
3 from nemo_curator.stages.interleaved.io.reader import InterleavedParquetReader
4 from nemo_curator.stages.interleaved.io.writers.webdataset import (
5     InterleavedWebdatasetWriterStage,
6 )
7 from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
8 from nemo_curator.stages.interleaved.filter.clip_score_filter import (
9     InterleavedCLIPScoreFilterStage,
10 )
11 
12 pipeline = Pipeline(name="interleaved_curation")
13 
14 # 1. Read interleaved Parquet
15 pipeline.add_stage(InterleavedParquetReader(file_paths="s3://bucket/interleaved/*.parquet"))
16 
17 # 2. Drop blurry images
18 pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
19 
20 # 3. Drop low image-text alignment
21 pipeline.add_stage(
22     InterleavedCLIPScoreFilterStage(model_dir="/models/clip", min_score=0.2)
23 )
24 
25 # 4. Write surviving samples to MINT-1T-style tar shards
26 pipeline.add_stage(InterleavedWebdatasetWriterStage(output_dir="./curated"))
27 
28 executor = XennaExecutor()
29 pipeline.run(executor)

Nemotron-Parse PDF Pipeline — converts PDFs into interleaved Parquet using the Nemotron-Parse VLM.
Common Crawl — fetch web data; pair with interleaved processing for image-text crawls.

How it Works

Pages

Quick Example

Related Topics