nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter
nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter
Module Contents
Classes
Functions
Data
DEFAULT_IMAGE_TO_TEXT_MAX_RATIO
DEFAULT_IMAGE_TO_TEXT_MIN_RATIO
API
Dataclass
Bases: BaseInterleavedFilterStage
Filter interleaved samples by image-to-text ratio (images per word).
Groups rows by sample_id. For each sample:
- image_count = number of rows with modality == ‘image’
- text_word_count = sum of len(text_content.split()) over text rows
- ratio = image_count / max(text_word_count, 1)
Samples with ratio outside [min_ratio, max_ratio] are dropped (all their rows).
max_ratio
min_ratio
name
Count words in text by splitting on whitespace.