Drop low-quality samples from an InterleavedBatch before downstream training or further processing. Four filter stages are available, each targeting a different quality signal.
Chain cheap filters first to reduce the number of samples expensive filters have to score. A typical order:
The CLIP filter dominates cost. Putting it last means it only runs against samples that survived all other checks.
InterleavedBlurFilterStageComputes the Laplacian variance of each image (via OpenCV) and drops images below score_threshold. Lower variance means a flatter, blurrier image.
InterleavedQRCodeFilterStageDetects QR codes in each image (OpenCV) and drops samples whose largest QR-code bounding box exceeds score_threshold as a fraction of the total image area. Useful for stripping promotional or contact-info imagery from web crawls.
InterleavedImageToTextRatioFilterStageComputes the per-sample ratio of image count to text word count and drops samples outside a configurable min_ratio/max_ratio window. Useful for excluding image-dump samples (no text) and text-heavy samples with few or no images.
For mixed image-text pretraining data:
A min_ratio=0.001 means “at least one image per 1000 text words.” A max_ratio=0.1 means “no more than one image per 10 text words.”
InterleavedCLIPScoreFilterStageUses a CLIP image-text encoder to score each (image, text) pair by cosine similarity, and drops samples whose alignment is below min_score. Ensures the textual content of each sample actually describes the image.
CLIP scores depend on the model variant; the table below assumes the default model:
The default resource allocation reserves gpu_memory_gb=20.0. Tune this on the stage’s Resources for smaller or larger CLIP variants.
A pipeline that stacks all four filters in cost order, then writes to WDS:
max_ratio and you’ll silently drop most caption-style data. Inspect a sample of dropped vs kept first.InterleavedBatch format these filters operate on.