nemo_curator.stages.text.download.arxiv.stage

View as Markdown

Module Contents

Classes

NameDescription
ArxivDownloadExtractStageComposite stage for downloading and processing Arxiv data.

API

class nemo_curator.stages.text.download.arxiv.stage.ArxivDownloadExtractStage(
download_dir: str = './arxiv_downloads',
url_limit: int | None = None,
record_limit: int | None = None,
add_filename_column: bool | str = True,
log_frequency: int = 1000,
verbose: bool = False
)

Bases: DocumentDownloadExtractStage

Composite stage for downloading and processing Arxiv data.

This pipeline:

  1. Generates Arxiv dump URLs
  2. Downloads Arxiv .tar files
  3. Extracts articles from the tar files
  4. Cleans and extracts text from LaTeX files
downloader
extractor
= ArxivExtractor()
iterator
= ArxivIterator(log_frequency=log_frequency)
name
= 'arxiv_pipeline'
url_generator
= ArxivUrlGenerator()
nemo_curator.stages.text.download.arxiv.stage.ArxivDownloadExtractStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]

Decompose this composite stage into its constituent stages.

nemo_curator.stages.text.download.arxiv.stage.ArxivDownloadExtractStage.get_description() -> str

Get a description of this composite stage.