stages.text.download.base.extract#

Module Contents#

Classes#

DocumentExtractStage

Stage that extracts structured content from raw records.

DocumentExtractor

Abstract base class for document extractors.

API#

class stages.text.download.base.extract.DocumentExtractStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Stage that extracts structured content from raw records.

Takes DocumentBatch with raw content and produces DocumentBatch with extracted content. This is for cases where iteration and extraction are separate steps.

add_filename_column: bool | str#

True

extractor: stages.text.download.base.extract.DocumentExtractor#

None

inputs() tuple[list[str], list[str]]#

Define input requirements - expects DocumentBatch with dict records.

outputs() tuple[list[str], list[str]]#

Define output - produces DocumentBatch with processed records.

process(
task: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch#

Extract structured content from raw records.

Args: task (DocumentBatch): Batch containing records

Returns: DocumentBatch: Batch containing extracted records

class stages.text.download.base.extract.DocumentExtractor#

Bases: abc.ABC

Abstract base class for document extractors.

Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.

abstractmethod extract(record: dict[str, str]) dict[str, Any] | None#

Extract/transform a record dict into final record dict.

abstractmethod input_columns() list[str]#

Define input columns - produces DocumentBatch with records.

abstractmethod output_columns() list[str]#

Define output columns - produces DocumentBatch with records.