nemo_curator.stages.interleaved.filter.clip_score_filter

View as Markdown

Module Contents

Classes

NameDescription
InterleavedCLIPScoreFilterStageFilter interleaved image rows by CLIP image-text relevance score.

Functions

NameDescription
_indices_and_decoded_images_from_rowsDecode image bytes per row; clear keep_mask entries where decode fails.
_sample_texts_list_from_dfReturn list of text_content from all text rows for the given sample_id (non-empty).

Data

DEFAULT_CLIP_MIN_SCORE

API

class nemo_curator.stages.interleaved.filter.clip_score_filter.InterleavedCLIPScoreFilterStage(
name: str = 'interleaved_clip_score_fil...,
drop_invalid_rows: bool = True,
model_dir: str | None = None,
min_score: float = DEFAULT_CLIP_MIN_SCORE,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(gpu_memo...
)
Dataclass

Bases: BaseInterleavedFilterStage

Filter interleaved image rows by CLIP image-text relevance score.

For each image row, all text rows with the same sample_id form (image, text) pairs. CLIP similarity is computed for each pair. An image is kept only if at least one pair has score >= min_score; otherwise it is dropped.

min_score
float = DEFAULT_CLIP_MIN_SCORE
model_dir
str | None = None
name
str = 'interleaved_clip_score_filter'
resources
Resources
nemo_curator.stages.interleaved.filter.clip_score_filter.InterleavedCLIPScoreFilterStage.content_keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series
nemo_curator.stages.interleaved.filter.clip_score_filter.InterleavedCLIPScoreFilterStage.setup(
worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.interleaved.filter.clip_score_filter.InterleavedCLIPScoreFilterStage.setup_on_node(
node_info: nemo_curator.backends.base.NodeInfo,
worker_metadata: nemo_curator.backends.base.WorkerMetadata
) -> None

Download the weights for the CLIP model on the node.

nemo_curator.stages.interleaved.filter.clip_score_filter._indices_and_decoded_images_from_rows(
rows: list[tuple[int, bytes]],
keep_mask: pandas.Series
) -> tuple[list[int], list[numpy.ndarray]]

Decode image bytes per row; clear keep_mask entries where decode fails.

nemo_curator.stages.interleaved.filter.clip_score_filter._sample_texts_list_from_df(
df: pandas.DataFrame,
sample_id: str
) -> list[str]

Return list of text_content from all text rows for the given sample_id (non-empty).

nemo_curator.stages.interleaved.filter.clip_score_filter.DEFAULT_CLIP_MIN_SCORE: float = 0.15