Curate TextProcess DataInterleaved Datasets

Interleaved Filters

View as Markdown

Drop low-quality samples from an InterleavedBatch before downstream training or further processing. Four filter stages are available, each targeting a different quality signal.

Understanding the Filters

What Each Filter Does

FilterTargetsCostDefault Threshold
InterleavedBlurFilterStageOut-of-focus / motion-blurred imagesCheap (CPU)score_threshold=100.0
InterleavedQRCodeFilterStagePromotional / contact-info imageryCheap (CPU)score_threshold=0.05
InterleavedImageToTextRatioFilterStageImage-dump or text-dump samplesCheap (CPU)No filtering by default
InterleavedCLIPScoreFilterStageMisaligned image-text pairsExpensive (GPU, ~20 GB)min_score=0.15

Chain cheap filters first to reduce the number of samples expensive filters have to score. A typical order:

Blur → QR-code → Image-to-Text Ratio → CLIP Score

The CLIP filter dominates cost. Putting it last means it only runs against samples that survived all other checks.

InterleavedBlurFilterStage

Computes the Laplacian variance of each image (via OpenCV) and drops images below score_threshold. Lower variance means a flatter, blurrier image.

Threshold Guidelines

score_thresholdEffect
50Permissive — drops only severely blurred images
100 (default)Balanced — typical for general curation
200Strict — keeps only sharp, high-detail images
500+Very strict — useful for studio-quality requirements

Usage

1from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
2
3pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
ParameterTypeDefaultDescription
score_thresholdfloat100.0Minimum Laplacian variance to keep an image. Higher = sharper-only output.

InterleavedQRCodeFilterStage

Detects QR codes in each image (OpenCV) and drops samples whose largest QR-code bounding box exceeds score_threshold as a fraction of the total image area. Useful for stripping promotional or contact-info imagery from web crawls.

Threshold Guidelines

score_thresholdEffect
0.01Very strict — drops anything with even a small QR code
0.05 (default)Balanced — drops images dominated by QR codes
0.1+Permissive — only drops near-fullscreen QR codes

Usage

1from nemo_curator.stages.interleaved.filter.qrcode_filter import InterleavedQRCodeFilterStage
2
3pipeline.add_stage(InterleavedQRCodeFilterStage(score_threshold=0.05))
ParameterTypeDefaultDescription
score_thresholdfloat0.05Maximum allowed QR-bounding-box area as a fraction of total image area. Lower = stricter.

InterleavedImageToTextRatioFilterStage

Computes the per-sample ratio of image count to text word count and drops samples outside a configurable min_ratio/max_ratio window. Useful for excluding image-dump samples (no text) and text-heavy samples with few or no images.

Range Guidelines

For mixed image-text pretraining data:

Use Casemin_ratiomax_ratio
Balanced multimodal0.0010.1
Image-rich (caption-style)0.010.5
Text-rich (article-style)0.00010.01

A min_ratio=0.001 means “at least one image per 1000 text words.” A max_ratio=0.1 means “no more than one image per 10 text words.”

Usage

1from nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter import (
2 InterleavedImageToTextRatioFilterStage,
3)
4
5pipeline.add_stage(
6 InterleavedImageToTextRatioFilterStage(min_ratio=0.001, max_ratio=0.1)
7)
ParameterTypeDefaultDescription
min_ratiofloat0.0Minimum images-per-word ratio to keep.
max_ratiofloatinfMaximum images-per-word ratio to keep.

InterleavedCLIPScoreFilterStage

Uses a CLIP image-text encoder to score each (image, text) pair by cosine similarity, and drops samples whose alignment is below min_score. Ensures the textual content of each sample actually describes the image.

Threshold Guidelines

CLIP scores depend on the model variant; the table below assumes the default model:

min_scoreEffect
0.10Permissive — keeps loosely aligned pairs (web crawl baseline)
0.15 (default)Balanced — drops mostly unrelated text-image pairs
0.20Stricter — high-quality alignment for caption-style data
0.30+Very strict — for caption datasets with manual review

Usage

1from nemo_curator.stages.interleaved.filter.clip_score_filter import (
2 InterleavedCLIPScoreFilterStage,
3)
4
5pipeline.add_stage(
6 InterleavedCLIPScoreFilterStage(
7 model_dir="/models/clip",
8 min_score=0.15,
9 )
10)
ParameterTypeDefaultDescription
model_dirstr | NoneNoneLocal CLIP model directory. When None, the default model is downloaded on first use.
min_scorefloat0.15Minimum cosine similarity to keep a sample.

The default resource allocation reserves gpu_memory_gb=20.0. Tune this on the stage’s Resources for smaller or larger CLIP variants.

Complete Filtering Pipeline

A pipeline that stacks all four filters in cost order, then writes to WDS:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.interleaved.io.reader import InterleavedParquetReader
4from nemo_curator.stages.interleaved.io.writers.webdataset import (
5 InterleavedWebdatasetWriterStage,
6)
7from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
8from nemo_curator.stages.interleaved.filter.qrcode_filter import InterleavedQRCodeFilterStage
9from nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter import (
10 InterleavedImageToTextRatioFilterStage,
11)
12from nemo_curator.stages.interleaved.filter.clip_score_filter import (
13 InterleavedCLIPScoreFilterStage,
14)
15
16pipeline = Pipeline(name="interleaved_quality_filters")
17
18# 1. Read interleaved Parquet
19pipeline.add_stage(
20 InterleavedParquetReader(file_paths="s3://bucket/raw/*.parquet")
21)
22
23# 2. Cheap filters first (CPU only)
24pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
25pipeline.add_stage(InterleavedQRCodeFilterStage(score_threshold=0.05))
26pipeline.add_stage(
27 InterleavedImageToTextRatioFilterStage(min_ratio=0.001, max_ratio=0.1)
28)
29
30# 3. Expensive CLIP filter last (GPU)
31pipeline.add_stage(
32 InterleavedCLIPScoreFilterStage(
33 model_dir="/models/clip",
34 min_score=0.20,
35 )
36)
37
38# 4. Write filtered output
39pipeline.add_stage(InterleavedWebdatasetWriterStage(output_dir="./curated"))
40
41executor = XennaExecutor()
42pipeline.run(executor)

Best Practices

  • Filter cheap to expensive: blur → QR-code → image-to-text ratio → CLIP. Each early filter reduces what the next has to score.
  • Inspect score distributions before tightening thresholds: run each filter with permissive thresholds first, dump scores to a manifest, plot the distributions, and pick thresholds from percentiles. Defaults are starting points, not final answers.
  • Don’t use CLIP without a budget: CLIP scoring is the most expensive filter by orders of magnitude. If your dataset is millions of samples, it’s still feasible but plan compute accordingly.
  • Avoid stacking redundant filters: blur and CLIP-score both penalize bad images, but in different ways. Blur catches optical issues; CLIP catches semantic mismatch. Use both, but tune separately.
  • Mind the image-to-text ratio for data composition: this filter is the easiest one to misuse — set the wrong max_ratio and you’ll silently drop most caption-style data. Inspect a sample of dropped vs kept first.
  • Interleaved IO — readers and writers that produce and consume the InterleavedBatch format these filters operate on.
  • Nemotron-Parse PDF Pipeline — one source of interleaved data; pair with these filters for end-to-end PDF curation.