download.arxiv#

Module Contents#

Classes#

ArxivDownloader

Abstract class for downloading remote data to disk

ArxivExtractor

Abstract class for extracting text from records read from disk

ArxivIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Functions#

download_arxiv

Download Arxiv tar files and extract the contained LaTeX projects.

API#

class download.arxiv.ArxivDownloader(download_dir: str, verbose: bool = False)#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

Abstract class for downloading remote data to disk

Initialization

download(tarfile: str) str#
class download.arxiv.ArxivExtractor#

Bases: nemo_curator.download.doc_builder.DocumentExtractor

Abstract class for extracting text from records read from disk

Initialization

extract(content: list[str]) dict[str, str] | None#
class download.arxiv.ArxivIterator(log_frequency: int = 1000)#

Bases: nemo_curator.download.doc_builder.DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

iterate(
file_path: str,
) collections.abc.Iterator[tuple[dict[str, str], list[str]]]#
download.arxiv.download_arxiv(
output_path: str,
output_type: Literal[jsonl, parquet] = 'jsonl',
raw_download_dir: str | None = None,
keep_raw_download: bool = False,
force_download: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
) nemo_curator.datasets.DocumentDataset#

Download Arxiv tar files and extract the contained LaTeX projects.

This function obtains a list of Arxiv tar file URLs (via get_arxiv_urls), downloads the tar files, and then extracts the contained LaTeX source files. The resulting documents (after extraction) are assembled into a DocumentDataset.

Args: output_path (str): The root directory where both the final extracted files and the raw download subdirectory will be stored. The extracted files (in the format specified by output_type) are eventually saved in this directory. output_type (Literal[“jsonl”, “parquet”], optional): The file format/extension used for saving the extracted documents (e.g., “jsonl” or “parquet”). Default is “jsonl”. This is not used for the output file, but is used to check if an extracted output already exists and read it if so. raw_download_dir (Optional[str], optional): The directory where the raw downloaded tar files will be kept. If None, a folder named “downloads” under output_path is used. keep_raw_download (bool, optional): If True, the raw tar files (before extraction) are not removed after processing. Default is False. force_download (bool, optional): If False, then if an output file already exists for a given URL, re-downloading and re-extraction will be skipped. Default is False. url_limit (Optional[int], optional): Limits the maximum number of Arxiv tar file URLs to download and process. If None, all available URLs (from get_arxiv_urls) are processed. record_limit (Optional[int], optional): Limits the maximum number of records to extract from each tar file. If None, all available records are extracted. Returns: DocumentDataset: A dataset object containing the extracted documents.