stages.text.download.base.extract
#
Module Contents#
Classes#
Stage that extracts structured content from raw records. |
|
Abstract base class for document extractors. |
API#
- class stages.text.download.base.extract.DocumentExtractStage#
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.DocumentBatch
,nemo_curator.tasks.DocumentBatch
]Stage that extracts structured content from raw records.
Takes DocumentBatch with raw content and produces DocumentBatch with extracted content. This is for cases where iteration and extraction are separate steps.
- add_filename_column: bool | str#
True
- extractor: stages.text.download.base.extract.DocumentExtractor#
None
- inputs() tuple[list[str], list[str]] #
Define input requirements - expects DocumentBatch with dict records.
- outputs() tuple[list[str], list[str]] #
Define output - produces DocumentBatch with processed records.
- process(
- task: nemo_curator.tasks.DocumentBatch,
Extract structured content from raw records.
Args: task (DocumentBatch): Batch containing records
Returns: DocumentBatch: Batch containing extracted records
- class stages.text.download.base.extract.DocumentExtractor#
Bases:
abc.ABC
Abstract base class for document extractors.
Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.
- abstractmethod extract(record: dict[str, str]) dict[str, Any] | None #
Extract/transform a record dict into final record dict.
- abstractmethod input_columns() list[str] #
Define input columns - produces DocumentBatch with records.
- abstractmethod output_columns() list[str] #
Define output columns - produces DocumentBatch with records.