Interleaved Filters
Drop low-quality samples from an InterleavedBatch before downstream training or further processing. Four filter stages are available, each targeting a different quality signal.
Understanding the Filters
What Each Filter Does
Recommended Filter Order
Chain cheap filters first to reduce the number of samples expensive filters have to score. A typical order:
The CLIP filter dominates cost. Putting it last means it only runs against samples that survived all other checks.
InterleavedBlurFilterStage
Computes the Laplacian variance of each image (via OpenCV) and drops images below score_threshold. Lower variance means a flatter, blurrier image.
Threshold Guidelines
Usage
InterleavedQRCodeFilterStage
Detects QR codes in each image (OpenCV) and drops samples whose largest QR-code bounding box exceeds score_threshold as a fraction of the total image area. Useful for stripping promotional or contact-info imagery from web crawls.
Threshold Guidelines
Usage
InterleavedImageToTextRatioFilterStage
Computes the per-sample ratio of image count to text word count and drops samples outside a configurable min_ratio/max_ratio window. Useful for excluding image-dump samples (no text) and text-heavy samples with few or no images.
Range Guidelines
For mixed image-text pretraining data:
A min_ratio=0.001 means “at least one image per 1000 text words.” A max_ratio=0.1 means “no more than one image per 10 text words.”
Usage
InterleavedCLIPScoreFilterStage
Uses a CLIP image-text encoder to score each (image, text) pair by cosine similarity, and drops samples whose alignment is below min_score. Ensures the textual content of each sample actually describes the image.
Threshold Guidelines
CLIP scores depend on the model variant; the table below assumes the default model:
Usage
The default resource allocation reserves gpu_memory_gb=20.0. Tune this on the stage’s Resources for smaller or larger CLIP variants.
Complete Filtering Pipeline
A pipeline that stacks all four filters in cost order, then writes to WDS:
Best Practices
- Filter cheap to expensive: blur → QR-code → image-to-text ratio → CLIP. Each early filter reduces what the next has to score.
- Inspect score distributions before tightening thresholds: run each filter with permissive thresholds first, dump scores to a manifest, plot the distributions, and pick thresholds from percentiles. Defaults are starting points, not final answers.
- Don’t use CLIP without a budget: CLIP scoring is the most expensive filter by orders of magnitude. If your dataset is millions of samples, it’s still feasible but plan compute accordingly.
- Avoid stacking redundant filters: blur and CLIP-score both penalize bad images, but in different ways. Blur catches optical issues; CLIP catches semantic mismatch. Use both, but tune separately.
- Mind the image-to-text ratio for data composition: this filter is the easiest one to misuse — set the wrong
max_ratioand you’ll silently drop most caption-style data. Inspect a sample of dropped vs kept first.
Related Topics
- Interleaved IO — readers and writers that produce and consume the
InterleavedBatchformat these filters operate on. - Nemotron-Parse PDF Pipeline — one source of interleaved data; pair with these filters for end-to-end PDF curation.