For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Interleaved IO
        • Interleaved Filters
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Pages
  • Quick Example
  • Related Topics
Curate TextProcess DataInterleaved Datasets

Interleaved Datasets

||View as Markdown|
Previous

Code Filtering

Next

Interleaved IO

Curate interleaved image-text datasets in the format used by MINT-1T and similar large-scale multimodal corpora. Each sample is an ordered sequence of text, image, and metadata items keyed by a stable sample_id.

How it Works

NeMo Curator’s interleaved support is organized around three responsibilities:

  1. Storage formats — InterleavedBatch is the in-memory representation. On disk, samples can live as WebDataset tar shards (one tar per shard, one file per item) or as Parquet rows (one row per item, grouped by sample_id).
  2. IO round-trip — readers and writers exist for both formats, so any combination of WDS ↔ InterleavedBatch ↔ Parquet is supported. Schema utilities ensure reserved columns get canonical types and passthrough columns survive intact.
  3. Sample-level filters — drop samples by image sharpness, QR-code area ratio, CLIP image-text alignment, or image-to-text count ratio.

Use the IO stages on their own for format conversion (e.g., curate-once, train-many), or chain them with the filter stages for a full curation pipeline.

Pages

Interleaved IO

Round-trip readers and writers between WebDataset tar shards and Parquet, plus shared schema utilities parquet webdataset schema

Interleaved Filters

Sample-level filter stages for image quality, QR-code detection, CLIP image-text alignment, and image-to-text ratio blur clip qr-detection

Quick Example

Read interleaved Parquet, drop blurry images and low-CLIP-alignment samples, and write the survivors back to MINT-1T-style WebDataset shards:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.interleaved.io.reader import InterleavedParquetReader
4from nemo_curator.stages.interleaved.io.writers.webdataset import (
5 InterleavedWebdatasetWriterStage,
6)
7from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
8from nemo_curator.stages.interleaved.filter.clip_score_filter import (
9 InterleavedCLIPScoreFilterStage,
10)
11
12pipeline = Pipeline(name="interleaved_curation")
13
14# 1. Read interleaved Parquet
15pipeline.add_stage(InterleavedParquetReader(file_paths="s3://bucket/interleaved/*.parquet"))
16
17# 2. Drop blurry images
18pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
19
20# 3. Drop low image-text alignment
21pipeline.add_stage(
22 InterleavedCLIPScoreFilterStage(model_dir="/models/clip", min_score=0.2)
23)
24
25# 4. Write surviving samples to MINT-1T-style tar shards
26pipeline.add_stage(InterleavedWebdatasetWriterStage(output_dir="./curated"))
27
28executor = XennaExecutor()
29pipeline.run(executor)

Related Topics

  • Nemotron-Parse PDF Pipeline — converts PDFs into interleaved Parquet using the Nemotron-Parse VLM.
  • Common Crawl — fetch web data; pair with interleaved processing for image-text crawls.