Curate TextLoad Data

ArXiv

View as Markdown

Download and extract text from ArXiv LaTeX source bundles using Curator.

ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside .tar archives under the s3://arxiv/src/ requester-pays bucket.

How it Works

The ArXiv pipeline in Curator consists of four stages:

  1. URL Generation: Lists available ArXiv source tar files from the S3 bucket
  2. Download: Downloads tar archives using s5cmd (Requester Pays)
  3. Iteration: Extracts LaTeX projects and yields per-paper records
  4. Extraction: Cleans LaTeX and produces plain text

Before You Start

You must have:

  • An AWS account with credentials configured (profile, environment, or instance role). Access to s3://arxiv/src/ uses S3 Requester Pays; you incur charges for listing and data transfer. If you use aws s3, include the flag --request-payer requester and ensure your AWS credentials are active.
  • s5cmd installed
$# Install s5cmd for requester-pays S3 downloads
$pip install s5cmd

The examples on this page use s5cmd and include Requester Pays when running the pipeline.


Usage

Create and run an ArXiv processing pipeline and write outputs to JSONL:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.download import ArxivDownloadExtractStage
4from nemo_curator.stages.text.io.writer import JsonlWriter
5
6def main():
7 # Initialize Ray client
8 ray_client = RayClient()
9 ray_client.start()
10
11 pipeline = Pipeline(
12 name="arxiv_pipeline",
13 description="Download and process ArXiv LaTeX sources"
14 )
15
16 # Add ArXiv stage
17 arxiv_stage = ArxivDownloadExtractStage(
18 download_dir="./arxiv_downloads",
19 url_limit=5, # optional: number of tar files to process
20 record_limit=1000, # optional: max papers per tar
21 add_filename_column=True,
22 verbose=True,
23 )
24 pipeline.add_stage(arxiv_stage)
25
26 # Add writer stage
27 writer = JsonlWriter(path="./arxiv_output")
28 pipeline.add_stage(writer)
29
30 # Execute
31 results = pipeline.run()
32 print(f"Completed with {len(results) if results else 0} output files")
33
34 # Stop Ray client
35 ray_client.stop()
36
37if __name__ == "__main__":
38 main()

For executor options and configuration, refer to Execution Backends.

Parameters

ParameterTypeDescriptionDefault
download_dirstrDirectory to store downloaded .tar files”./arxiv_downloads”
url_limitint | NoneMaximum number of ArXiv tar files to download (useful for testing)None
record_limitint | NoneMaximum number of papers to extract per tar fileNone
add_filename_columnbool | strWhether to add a source filename column to output; if str, use it as the column nameTrue (column name defaults to file_name)
log_frequencyintHow often to log progress while iterating papers1000
verboseboolEnable verbose logging during downloadFalse

Output Format

The extractor returns per-paper text; the filename column is optionally added by the pipeline:

1{
2 "text": "Main body text extracted from LaTeX after cleaning...",
3 "file_name": "arXiv_src_2024_01.tar"
4}
FieldDescription
textExtracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed)
file_nameOptional. Name of the source tar file (enabled by add_filename_column)

During iteration the pipeline yields id (ArXiv identifier), source_id (tar base name), and content (a list of LaTeX file contents as strings; one element per .tex file). The final extractor stage emits text plus the optional filename column.