nemo_curator.stages.text.download.base.extract

View as Markdown

Module Contents

Classes

NameDescription
DocumentExtractorAbstract base class for document extractors.

API

class nemo_curator.stages.text.download.base.extract.DocumentExtractor()
Abstract

Abstract base class for document extractors.

Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.

nemo_curator.stages.text.download.base.extract.DocumentExtractor.extract(
record: dict[str, str]
) -> dict[str, typing.Any] | None
abstract

Extract/transform a record dict into final record dict.

nemo_curator.stages.text.download.base.extract.DocumentExtractor.input_columns() -> list[str]
abstract

Define input columns - produces DocumentBatch with records.

nemo_curator.stages.text.download.base.extract.DocumentExtractor.output_columns() -> list[str]
abstract

Define output columns - produces DocumentBatch with records.