nemo_curator.stages.text.download.arxiv.iterator

View as Markdown

Module Contents

Classes

NameDescription
ArxivIteratorProcesses downloaded Arxiv files and extracts article content.

API

class nemo_curator.stages.text.download.arxiv.iterator.ArxivIterator(
log_frequency: int = 1000
)

Bases: DocumentIterator

Processes downloaded Arxiv files and extracts article content.

_counter
= 0
nemo_curator.stages.text.download.arxiv.iterator.ArxivIterator._format_arxiv_id(
arxiv_id: str
) -> str

this function brings the raw arxiv-id into a format compliant with the specification from arxiv. This is used to create the url to the arxiv abstract page.

  • Format prior to March 2007: <archive>/YYMMNNN where N is a 3-digit number
  • Format after March 2007: <archive>/YYMM.NNNNN where N is a 5 (or 6)-digit number

References: https://info.arxiv.org/help/arxiv_identifier.html

@param arxiv_id: raw arxiv id which can be in one of the following formats:

  • <archive><YY><MM><NNN>
  • <YY><MM><NNNNN|NNNNNN>

@return: formatted arxiv id

nemo_curator.stages.text.download.arxiv.iterator.ArxivIterator._tex_proj_loader(
file_or_dir_path: str
) -> list[str] | None

function to load the tex files from a tar file or a gzip file. The function will return a tuple containing a list of tex files and the timestamp of the project.

@param file_or_dir_path: path to the tar file or the gzip file

@return: tuple containing a list of tex files and the timestamp of the project

nemo_curator.stages.text.download.arxiv.iterator.ArxivIterator.iterate(
file_path: str
) -> collections.abc.Iterator[dict[str, typing.Any]]
nemo_curator.stages.text.download.arxiv.iterator.ArxivIterator.output_columns() -> list[str]