stages.text.download.arxiv.extract
#
Module Contents#
Classes#
Extracts text from Arxiv LaTeX files. |
API#
- class stages.text.download.arxiv.extract.ArxivExtractor#
Bases:
nemo_curator.stages.text.download.DocumentExtractor
Extracts text from Arxiv LaTeX files.
Initialization
- extract(record: dict[str, str]) dict[str, Any] | None #
Extract/transform a record dict into final record dict.
- input_columns() list[str] #
Define input columns - produces DocumentBatch with records.
- output_columns() list[str] #
Define output columns - produces DocumentBatch with records.