ArXiv#
Download and extract text from ArXiv LaTeX source bundles using Curator.
ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside .tar
archives under the s3://arxiv/src/
requester-pays bucket.
How it Works#
The ArXiv pipeline in Curator consists of four stages:
URL Generation: Lists available ArXiv source tar files from the S3 bucket
Download: Downloads
.tar
archives using s5cmd (Requester Pays)Iteration: Extracts LaTeX projects and yields per-paper records
Extraction: Cleans LaTeX and produces plain text
Before You Start#
You must have:
An AWS account with credentials configured (profile, environment, or instance role). Access to
s3://arxiv/src/
uses S3 Requester Pays; you incur charges for listing and data transfer. If you useaws s3
, include the flag--request-payer requester
and ensure your AWS credentials are active.
# Install s5cmd for requester-pays S3 downloads
pip install s5cmd
The examples on this page use s5cmd
, which supports Requester Pays automatically.
Usage#
Create and run an ArXiv processing pipeline and write outputs to JSONL:
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import ArxivDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter
def main():
pipeline = Pipeline(
name="arxiv_pipeline",
description="Download and process ArXiv LaTeX sources"
)
# Add ArXiv stage
arxiv_stage = ArxivDownloadExtractStage(
download_dir="./arxiv_downloads",
url_limit=5, # optional: number of tar files to process
record_limit=1000, # optional: max papers per tar
add_filename_column=True,
verbose=True,
)
pipeline.add_stage(arxiv_stage)
# Add writer stage
writer = JsonlWriter(path="./arxiv_output")
pipeline.add_stage(writer)
# Execute
results = pipeline.run()
print(f"Completed with {len(results) if results else 0} output files")
if __name__ == "__main__":
main()
For executor options and configuration, refer to Pipeline Execution Backends.
Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
Directory to store downloaded |
“./arxiv_downloads” |
|
int | None |
Maximum number of ArXiv tar files to download (useful for testing) |
None |
|
int | None |
Maximum number of papers to extract per tar file |
None |
|
bool | str |
Whether to add a source filename column to output; if str, use it as the column name |
True (column name defaults to |
|
int |
How often to log progress while iterating papers |
1000 |
|
bool |
Enable verbose logging during download |
False |
Output Format#
The extractor returns per-paper text; the filename column is optionally added by the pipeline:
{
"text": "Main body text extracted from LaTeX after cleaning...",
"file_name": "arXiv_src_2024_01.tar"
}
Field |
Description |
---|---|
|
Extracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed) |
|
Optional. Name of the source tar file (enabled by |
During iteration the pipeline yields id
(ArXiv identifier), source_id
(tar base name), and content
(a list of LaTeX file contents as strings; one element per .tex
file). The final extractor stage emits text
plus the optional filename column.