ArXiv#
Download and extract text from ArXiv papers using NeMo Curator utilities.
ArXiv is a free distribution service and open-access archive for scholarly articles, primarily in fields like physics, mathematics, computer science, and more. ArXiv contains millions of scholarly papers, most of them available in LaTeX source format.
How it Works#
NeMo Curator simplifies the process of:
Downloading ArXiv papers from S3
Extracting text from LaTeX source files
Converting the content to a standardized format for further processing
Before You Start#
ArXiv papers are hosted on Amazon S3, so you’ll need to have:
Properly configured AWS credentials in
~/.aws/config
s5cmd installed (pre-installed in the NVIDIA NeMo Framework Container)
Usage#
Here’s how to download and extract ArXiv data using NeMo Curator:
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.download import download_arxiv
# Initialize a Dask client
client = get_client(cluster_type="cpu")
# Download and extract ArXiv papers
arxiv_dataset = download_arxiv(output_path="/extracted/output/folder")
# Write the dataset to disk
arxiv_dataset.to_json(output_path="/extracted/output/folder", write_to_filename=True)
download_and_extract \
--input-url-file=./arxiv_urls.txt \
--builder-config-file=./config/arxiv_builder.yaml \
--output-json-dir=/datasets/arxiv/json
The config file should look like:
download_module: nemo_curator.download.arxiv.ArxivDownloader
download_params: {}
iterator_module: nemo_curator.download.arxiv.ArxivIterator
iterator_params: {}
extract_module: nemo_curator.download.arxiv.ArxivExtractor
extract_params: {}
If you’ve already downloaded and extracted ArXiv data to the specified output folder, NeMo Curator will read from those files instead of downloading them again.
Text Processing with Stop Words
When processing academic papers from ArXiv, you may want to customize text extraction and analysis using stop words. Stop words can help identify section boundaries, distinguish main content from references, and support language-specific processing. For a comprehensive guide to stop words in NeMo Curator, see Stop Words in Text Processing.
Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
Path where the extracted files will be placed |
Required |
|
Literal[“jsonl”, “parquet”] |
File format for storing data |
“jsonl” |
|
Optional[str] |
Directory to specify where to download the raw ArXiv files |
None |
|
bool |
Whether to keep the raw downloaded files |
False |
|
bool |
Whether to force re-download even if files exist |
False |
|
Optional[int] |
Limit the number of papers downloaded (useful for testing) |
None |
|
Optional[int] |
Limit the number of records processed |
None |
Output Format#
NeMo Curator extracts and processes the main text content from LaTeX source files. The extractor focuses on the body text of papers, automatically removing:
Comments and LaTeX markup
Content before the first section header
Bibliography and appendix sections
LaTeX macro definitions (while expanding their usage)
Limited Metadata Extraction
The current ArXiv implementation focuses on text extraction and does not parse document metadata like titles, authors, or categories from the LaTeX source. Only the processed text content and basic file identifiers are returned.
Field |
Type |
Description |
---|---|---|
|
str |
The main text content extracted from LaTeX files (cleaned and processed) |
|
str |
A unique identifier for the paper (formatted ArXiv ID) |
|
str |
The source tar file name where the paper was found |
|
str |
The filename used for the output file |