***

title: ArXiv
description: Download and extract text from arXiv using Curator.
----------------------------------------------------------------

Download and extract text from ArXiv LaTeX source bundles using Curator.

ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside `.tar` archives under the `s3://arxiv/src/` requester-pays bucket.

## How it Works

The ArXiv pipeline in Curator consists of four stages:

1. **URL Generation**: Lists available ArXiv source tar files from the S3 bucket
2. **Download**: Downloads tar archives using `s5cmd` (Requester Pays)
3. **Iteration**: Extracts LaTeX projects and yields per-paper records
4. **Extraction**: Cleans LaTeX and produces plain text

## Before You Start

You must have:

* An AWS account with credentials configured (profile, environment, or instance role). Access to `s3://arxiv/src/` uses S3 Requester Pays; you incur charges for listing and data transfer. If you use `aws s3`, include the flag `--request-payer requester` and ensure your AWS credentials are active.
* [`s5cmd` installed](https://github.com/peak/s5cmd)

```bash
# Install s5cmd for requester-pays S3 downloads
pip install s5cmd
```

The examples on this page use `s5cmd` and include Requester Pays when running the pipeline.

***

## Usage

Create and run an ArXiv processing pipeline and write outputs to JSONL:

```python
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import ArxivDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

def main():
    # Initialize Ray client
    ray_client = RayClient()
    ray_client.start()

    pipeline = Pipeline(
        name="arxiv_pipeline",
        description="Download and process ArXiv LaTeX sources"
    )

    # Add ArXiv stage
    arxiv_stage = ArxivDownloadExtractStage(
        download_dir="./arxiv_downloads",
        url_limit=5,        # optional: number of tar files to process
        record_limit=1000,  # optional: max papers per tar
        add_filename_column=True,
        verbose=True,
    )
    pipeline.add_stage(arxiv_stage)

    # Add writer stage
    writer = JsonlWriter(path="./arxiv_output")
    pipeline.add_stage(writer)

    # Execute
    results = pipeline.run()
    print(f"Completed with {len(results) if results else 0} output files")

    # Stop Ray client
    ray_client.stop()

if __name__ == "__main__":
    main()
```

For executor options and configuration, refer to [Reference Execution Backends](/reference/infra/execution-backends).

### Parameters

**ArxivDownloadExtractStage Parameters**

| Parameter             | Type | Description                                      | Default                                                                              |                                            |
| --------------------- | ---- | ------------------------------------------------ | ------------------------------------------------------------------------------------ | ------------------------------------------ |
| `download_dir`        | str  | Directory to store downloaded `.tar` files       | "./arxiv\_downloads"                                                                 |                                            |
| `url_limit`           | int  | None                                             | Maximum number of ArXiv tar files to download (useful for testing)                   | None                                       |
| `record_limit`        | int  | None                                             | Maximum number of papers to extract per tar file                                     | None                                       |
| `add_filename_column` | bool | str                                              | Whether to add a source filename column to output; if str, use it as the column name | True (column name defaults to `file_name`) |
| `log_frequency`       | int  | How often to log progress while iterating papers | 1000                                                                                 |                                            |
| `verbose`             | bool | Enable verbose logging during download           | False                                                                                |                                            |

## Output Format

The extractor returns per-paper text; the filename column is optionally added by the pipeline:

```json
{
  "text": "Main body text extracted from LaTeX after cleaning...",
  "file_name": "arXiv_src_2024_01.tar"
}
```

**Output Fields**

| Field       | Description                                                                                              |
| ----------- | -------------------------------------------------------------------------------------------------------- |
| `text`      | Extracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed) |
| `file_name` | Optional. Name of the source tar file (enabled by `add_filename_column`)                                 |

During iteration the pipeline yields `id` (ArXiv identifier), `source_id` (tar base name), and `content` (a list of LaTeX file contents as strings; one element per `.tex` file). The final extractor stage emits `text` plus the optional filename column.
