nemo_curator.stages.text.download.arxiv.stage
nemo_curator.stages.text.download.arxiv.stage
Module Contents
Classes
API
Bases: DocumentDownloadExtractStage
Composite stage for downloading and processing Arxiv data.
This pipeline:
- Generates Arxiv dump URLs
- Downloads Arxiv .tar files
- Extracts articles from the tar files
- Cleans and extracts text from LaTeX files
downloader
extractor
iterator
name
url_generator
Decompose this composite stage into its constituent stages.
Get a description of this composite stage.