stages.text.download.arxiv.extract#

Module Contents#

Classes#

ArxivExtractor

Extracts text from Arxiv LaTeX files.

API#

class stages.text.download.arxiv.extract.ArxivExtractor#

Bases: nemo_curator.stages.text.download.DocumentExtractor

Extracts text from Arxiv LaTeX files.

Initialization

extract(record: dict[str, str]) dict[str, Any] | None#

Extract/transform a record dict into final record dict.

input_columns() list[str]#

Define input columns - produces DocumentBatch with records.

output_columns() list[str]#

Define output columns - produces DocumentBatch with records.