ArXiv
Download and extract text from ArXiv LaTeX source bundles using Curator.
ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside .tar archives under the s3://arxiv/src/ requester-pays bucket.
How it Works
The ArXiv pipeline in Curator consists of four stages:
- URL Generation: Lists available ArXiv source tar files from the S3 bucket
- Download: Downloads tar archives using
s5cmd(Requester Pays) - Iteration: Extracts LaTeX projects and yields per-paper records
- Extraction: Cleans LaTeX and produces plain text
Before You Start
You must have:
- An AWS account with credentials configured (profile, environment, or instance role). Access to
s3://arxiv/src/uses S3 Requester Pays; you incur charges for listing and data transfer. If you useaws s3, include the flag--request-payer requesterand ensure your AWS credentials are active. s5cmdinstalled
The examples on this page use s5cmd and include Requester Pays when running the pipeline.
Usage
Create and run an ArXiv processing pipeline and write outputs to JSONL:
For executor options and configuration, refer to Execution Backends.
Parameters
Output Format
The extractor returns per-paper text; the filename column is optionally added by the pipeline:
During iteration the pipeline yields id (ArXiv identifier), source_id (tar base name), and content (a list of LaTeX file contents as strings; one element per .tex file). The final extractor stage emits text plus the optional filename column.