stages.text.download.arxiv.stage
#
Module Contents#
Classes#
Composite stage for downloading and processing Arxiv data. |
API#
- class stages.text.download.arxiv.stage.ArxivDownloadExtractStage(
- download_dir: str = './arxiv_downloads',
- url_limit: int | None = None,
- record_limit: int | None = None,
- add_filename_column: bool | str = True,
- log_frequency: int = 1000,
- verbose: bool = False,
Bases:
nemo_curator.stages.text.download.DocumentDownloadExtractStage
Composite stage for downloading and processing Arxiv data.
This pipeline:
Generates Arxiv dump URLs
Downloads Arxiv .tar files
Extracts articles from the tar files
Cleans and extracts text from LaTeX files
Initialization
Download Arxiv tar files and extract the contained LaTeX projects.
This function obtains a list of Arxiv tar file URLs (via get_arxiv_urls), downloads the tar files, and then extracts the contained LaTeX source files. The resulting documents (after extraction) are assembled into a DocumentDataset.
Args: download_dir (str, optional): The directory where the raw downloaded tar files will be kept. Defaults to “./arxiv_downloads”. url_limit (Optional[int], optional): Limits the maximum number of Arxiv tar file URLs to download and process. If None, all available URLs (from get_arxiv_urls) are processed. record_limit (Optional[int], optional): Limits the maximum number of records to extract from each tar file. If None, all available records are extracted. add_filename_column (bool | str, optional): If True, adds a column to the output DataFrame with the filename of the tar file. If a string, adds a column with the specified name. Defaults to True. log_frequency (int, optional): How often to log progress. Defaults to 1000. verbose (bool, optional): If True, prints verbose output. Defaults to False. Returns: DocumentBatch: A batch object containing the extracted documents.
- decompose() list[nemo_curator.stages.base.ProcessingStage] #
Decompose this composite stage into its constituent stages.
- get_description() str #
Get a description of this composite stage.