nemo_curator.stages.text.download.arxiv.extract

View as Markdown

Module Contents

Classes

NameDescription
ArxivExtractorExtracts text from Arxiv LaTeX files.

API

class nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor()

Bases: DocumentExtractor

Extracts text from Arxiv LaTeX files.

nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor._build_non_arg_macros_dict(
file_content: str
) -> dict[str, str]

function takes the content of a tex file and returns a dictionary that contains the definitions of all macros that do not use arguments. The dictionary is of the form {macro_name: macro_value}.

@param file_content: the content of the tex file as a string.

@return: dict

nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor._clean_tex_file(
file_content: str,
arg_macros: dict[str, str],
non_arg_macros: dict[str, str]
) -> str

function takes a tex file as input and returns a cleaned version. The cleaned version is a concatenation of the tex files with the following modifications:

  • remove all comments (i.e. all lines starting with %)
  • remove everything before the first section-like header
  • remove everything after the first occurrence of either \appendix or \bibliography
  • inline-expand definitions and macros

@param file_content: the content of the tex file as a string.

@return: cleaned tex file as a string

nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor.extract(
record: dict[str, str]
) -> dict[str, typing.Any] | None
nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor.input_columns() -> list[str]
nemo_curator.stages.text.download.arxiv.extract.ArxivExtractor.output_columns() -> list[str]