nemo_curator.stages.text.download.arxiv.extract
nemo_curator.stages.text.download.arxiv.extract
Module Contents
Classes
API
Bases: DocumentExtractor
Extracts text from Arxiv LaTeX files.
function takes the content of a tex file and returns a dictionary that contains the definitions of all macros that do not use arguments. The dictionary is of the form {macro_name: macro_value}.
@param file_content: the content of the tex file as a string.
@return: dict
function takes a tex file as input and returns a cleaned version. The cleaned version is a concatenation of the tex files with the following modifications:
- remove all comments (i.e. all lines starting with %)
- remove everything before the first section-like header
- remove everything after the first occurrence of either \appendix or \bibliography
- inline-expand definitions and macros
@param file_content: the content of the tex file as a string.
@return: cleaned tex file as a string