*** description: Download and extract text from arXiv using Curator. categories: * how-to-guides tags: * arxiv * academic-papers * latex * data-loading * scientific-data personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: how-to modality: text-only *** # ArXiv Download and extract text from ArXiv LaTeX source bundles using Curator. ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside `.tar` archives under the `s3://arxiv/src/` requester-pays bucket. ## How it Works The ArXiv pipeline in Curator consists of four stages: 1. **URL Generation**: Lists available ArXiv source tar files from the S3 bucket 2. **Download**: Downloads tar archives using `s5cmd` (Requester Pays) 3. **Iteration**: Extracts LaTeX projects and yields per-paper records 4. **Extraction**: Cleans LaTeX and produces plain text ## Before You Start You must have: * An AWS account with credentials configured (profile, environment, or instance role). Access to `s3://arxiv/src/` uses S3 Requester Pays; you incur charges for listing and data transfer. If you use `aws s3`, include the flag `--request-payer requester` and ensure your AWS credentials are active. * [`s5cmd` installed](https://github.com/peak/s5cmd) ```bash # Install s5cmd for requester-pays S3 downloads pip install s5cmd ``` The examples on this page use `s5cmd` and include Requester Pays when running the pipeline. *** ## Usage Create and run an ArXiv processing pipeline and write outputs to JSONL: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.download import ArxivDownloadExtractStage from nemo_curator.stages.text.io.writer import JsonlWriter def main(): # Initialize Ray client ray_client = RayClient() ray_client.start() pipeline = Pipeline( name="arxiv_pipeline", description="Download and process ArXiv LaTeX sources" ) # Add ArXiv stage arxiv_stage = ArxivDownloadExtractStage( download_dir="./arxiv_downloads", url_limit=5, # optional: number of tar files to process record_limit=1000, # optional: max papers per tar add_filename_column=True, verbose=True, ) pipeline.add_stage(arxiv_stage) # Add writer stage writer = JsonlWriter(path="./arxiv_output") pipeline.add_stage(writer) # Execute results = pipeline.run() print(f"Completed with {len(results) if results else 0} output files") # Stop Ray client ray_client.stop() if __name__ == "__main__": main() ``` For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends). ### Parameters | Parameter | Type | Description | Default | | --------------------- | ----------- | ------------------------------------------------------------------------------------ | ------------------------------------------ | | `download_dir` | str | Directory to store downloaded `.tar` files | "./arxiv\_downloads" | | `url_limit` | int \| None | Maximum number of ArXiv tar files to download (useful for testing) | None | | `record_limit` | int \| None | Maximum number of papers to extract per tar file | None | | `add_filename_column` | bool \| str | Whether to add a source filename column to output; if str, use it as the column name | True (column name defaults to `file_name`) | | `log_frequency` | int | How often to log progress while iterating papers | 1000 | | `verbose` | bool | Enable verbose logging during download | False | ## Output Format The extractor returns per-paper text; the filename column is optionally added by the pipeline: ```json { "text": "Main body text extracted from LaTeX after cleaning...", "file_name": "arXiv_src_2024_01.tar" } ``` | Field | Description | | ----------- | -------------------------------------------------------------------------------------------------------- | | `text` | Extracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed) | | `file_name` | Optional. Name of the source tar file (enabled by `add_filename_column`) | During iteration the pipeline yields `id` (ArXiv identifier), `source_id` (tar base name), and `content` (a list of LaTeX file contents as strings; one element per `.tex` file). The final extractor stage emits `text` plus the optional filename column.