nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter

View as Markdown

Module Contents

Classes

NameDescription
InterleavedImageToTextRatioFilterStageFilter interleaved samples by image-to-text ratio (images per word).

Functions

NameDescription
_text_word_countCount words in text by splitting on whitespace.

Data

DEFAULT_IMAGE_TO_TEXT_MAX_RATIO

DEFAULT_IMAGE_TO_TEXT_MIN_RATIO

API

class nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter.InterleavedImageToTextRatioFilterStage(
name: str = 'interleaved_image_to_text_...,
drop_invalid_rows: bool = True,
min_ratio: float = DEFAULT_IMAGE_TO_TEXT_MIN_R...,
max_ratio: float = DEFAULT_IMAGE_TO_TEXT_MAX_R...
)
Dataclass

Bases: BaseInterleavedFilterStage

Filter interleaved samples by image-to-text ratio (images per word).

Groups rows by sample_id. For each sample:

  • image_count = number of rows with modality == ‘image’
  • text_word_count = sum of len(text_content.split()) over text rows
  • ratio = image_count / max(text_word_count, 1)

Samples with ratio outside [min_ratio, max_ratio] are dropped (all their rows).

max_ratio
float = DEFAULT_IMAGE_TO_TEXT_MAX_RATIO
min_ratio
float = DEFAULT_IMAGE_TO_TEXT_MIN_RATIO
name
str = 'interleaved_image_to_text_ratio_filter'
nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter.InterleavedImageToTextRatioFilterStage.content_keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series
nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter._text_word_count(
text: str | None
) -> int

Count words in text by splitting on whitespace.

nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter.DEFAULT_IMAGE_TO_TEXT_MAX_RATIO: float = float('inf')
nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter.DEFAULT_IMAGE_TO_TEXT_MIN_RATIO: float = 0.0