nemo_curator.stages.text.download.base.extract
nemo_curator.stages.text.download.base.extract
Module Contents
Classes
API
Abstract
Abstract base class for document extractors.
Takes a record dict and returns processed record dict or None to skip. Can transform any fields in the input dict.
abstract
Extract/transform a record dict into final record dict.
abstract
Define input columns - produces DocumentBatch with records.
abstract
Define output columns - produces DocumentBatch with records.