Curate TextLoad Data
Download and extract text from ArXiv LaTeX source bundles using Curator.
ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside .tar archives under the s3://arxiv/src/ requester-pays bucket.
The ArXiv pipeline in Curator consists of four stages:
s5cmd (Requester Pays)You must have:
s3://arxiv/src/ uses S3 Requester Pays; you incur charges for listing and data transfer. If you use aws s3, include the flag --request-payer requester and ensure your AWS credentials are active.s5cmd installedThe examples on this page use s5cmd and include Requester Pays when running the pipeline.
Create and run an ArXiv processing pipeline and write outputs to JSONL:
For executor options and configuration, refer to Execution Backends.
The extractor returns per-paper text; the filename column is optionally added by the pipeline:
During iteration the pipeline yields id (ArXiv identifier), source_id (tar base name), and content (a list of LaTeX file contents as strings; one element per .tex file). The final extractor stage emits text plus the optional filename column.