stages.text.download.base.extract#
Module Contents#
Classes#
Stage that extracts structured content from raw records. |
|
Abstract base class for document extractors. |
API#
- class stages.text.download.base.extract.DocumentExtractStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch,nemo_curator.tasks.DocumentBatch]Stage that extracts structured content from raw records.
Takes DocumentBatch with raw content and produces DocumentBatch with extracted content. This is for cases where iteration and extraction are separate steps.
- add_filename_column: bool | str#
True
- extractor: stages.text.download.base.extract.DocumentExtractor#
None
- inputs() tuple[list[str], list[str]]#
Define input requirements - expects DocumentBatch with dict records.
- outputs() tuple[list[str], list[str]]#
Define output - produces DocumentBatch with processed records.
- process(
- task: nemo_curator.tasks.DocumentBatch,
Extract structured content from raw records.
Args: task (DocumentBatch): Batch containing records
Returns: DocumentBatch: Batch containing extracted records
- class stages.text.download.base.extract.DocumentExtractor#
Bases:
abc.ABCAbstract base class for document extractors.
Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.
- abstractmethod extract(record: dict[str, str]) dict[str, Any] | None#
Extract/transform a record dict into final record dict.
- abstractmethod input_columns() list[str]#
Define input columns - produces DocumentBatch with records.
- abstractmethod output_columns() list[str]#
Define output columns - produces DocumentBatch with records.