nemo_curator.stages.interleaved.stages

View as Markdown

Module Contents

Classes

NameDescription
BaseInterleavedAnnotatorStageBase stage for row-wise interleaved annotation/filter transforms.
BaseInterleavedFilterStageBase stage for interleaved filtering based on a keep-mask.
InterleavedAspectRatioFilterStageFilter interleaved image rows by aspect-ratio bounds (all image formats).

API

class nemo_curator.stages.interleaved.stages.BaseInterleavedAnnotatorStage(
name: str = 'base_interleaved_annotator'
)
DataclassAbstract

Bases: ProcessingStage[InterleavedBatch, InterleavedBatch]

Base stage for row-wise interleaved annotation/filter transforms.

name
str = 'base_interleaved_annotator'
nemo_curator.stages.interleaved.stages.BaseInterleavedAnnotatorStage.annotate(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.DataFrame
abstract

Apply annotation/filter logic and return transformed dataframe.

nemo_curator.stages.interleaved.stages.BaseInterleavedAnnotatorStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.stages.BaseInterleavedAnnotatorStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.stages.BaseInterleavedAnnotatorStage.process(
task: nemo_curator.tasks.InterleavedBatch
) -> nemo_curator.tasks.InterleavedBatch
class nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage(
name: str = 'base_interleaved_filter',
drop_invalid_rows: bool = True
)
DataclassAbstract

Bases: BaseInterleavedAnnotatorStage

Base stage for interleaved filtering based on a keep-mask.

drop_invalid_rows
bool = True
name
str = 'base_interleaved_filter'
nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage._basic_row_validity_mask(
df: pandas.DataFrame
) -> pandas.Series
staticmethod
nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage.annotate(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.DataFrame
nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage.content_keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series
abstract

Return content-specific boolean keep-mask aligned to dataframe index.

nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage.iter_materialized_bytes(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame,
row_mask: pandas.Series
) -> collections.abc.Iterator[tuple[int, bytes | None]]

Yield (row_index, bytes) for masked rows after materialization.

Only the masked subset is materialized, avoiding redundant I/O for the full task.

nemo_curator.stages.interleaved.stages.BaseInterleavedFilterStage.keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series
class nemo_curator.stages.interleaved.stages.InterleavedAspectRatioFilterStage(
name: str = 'interleaved_aspect_ratio_f...,
drop_invalid_rows: bool = True,
min_aspect_ratio: float = 1.0,
max_aspect_ratio: float = 2.0
)
Dataclass

Bases: BaseInterleavedFilterStage

Filter interleaved image rows by aspect-ratio bounds (all image formats).

max_aspect_ratio
float = 2.0
min_aspect_ratio
float = 1.0
name
str = 'interleaved_aspect_ratio_filter'
nemo_curator.stages.interleaved.stages.InterleavedAspectRatioFilterStage._image_aspect_ratio(
image_bytes: bytes
) -> float | None
staticmethod
nemo_curator.stages.interleaved.stages.InterleavedAspectRatioFilterStage._image_keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series
nemo_curator.stages.interleaved.stages.InterleavedAspectRatioFilterStage.content_keep_mask(
task: nemo_curator.tasks.InterleavedBatch,
df: pandas.DataFrame
) -> pandas.Series