> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Convert PDF datasets into interleaved Parquet output using NVIDIA's Nemotron-Parse VLM with the four-stage NemotronParsePDFReader composite pipeline

# Nemotron-Parse PDF Pipeline

Convert PDF datasets into interleaved Parquet output using NVIDIA's [Nemotron-Parse](https://huggingface.co/nvidia) vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the [interleaved dataset](/curate-text/process-data/interleaved) format.

## How it Works

`NemotronParsePDFReader` is a composite stage that expands into four underlying sub-stages:

1. **`PDFPartitioningStage`** — reads a JSONL manifest of PDF entries and packs them into `FileGroupTask` objects.
2. **`PDFPreprocessStage`** — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.
3. **`NemotronParseInferenceStage`** — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with `text_in_pic` and `enforce_eager` flags and free-port retry on collisions.
4. **`NemotronParsePostprocessStage`** — parses model output, aligns images and captions, crops images, and emits the final interleaved rows.

The output is interleaved Parquet ready to be filtered with [Interleaved Filters](/curate-text/process-data/interleaved/filters) and written to MINT-1T-style WebDataset shards.

## Before You Start

Choose your PDF source and confirm the prerequisites:

* **GPU**: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
* **vLLM**: Strongly recommended for throughput. Falls back to HF Transformers if `backend="hf"` is set.
* **`pypdfium2`**: Required Python dependency for PDF rendering. Installed automatically with the `interleaved_cpu` or `interleaved_cuda12` extras (e.g., `uv sync --extra interleaved_cuda12`).
* **Manifest**: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.

### Choosing a PDF Source

Pass exactly one of `pdf_dir`, `zip_base_dir`, or `jsonl_base_dir` so the preprocess stage knows where to find the PDF bytes:

| Parameter        | Source Layout                                                    | When to Use                                          |
| ---------------- | ---------------------------------------------------------------- | ---------------------------------------------------- |
| `pdf_dir`        | A directory of `.pdf` files                                      | Local or mounted directories of standalone PDFs      |
| `zip_base_dir`   | A `CC-MAIN-2021-31-PDF-UNTRUNCATED` zip hierarchy                | Common Crawl PDF dumps                               |
| `jsonl_base_dir` | JSONL-encoded PDF datasets where each line carries the PDF bytes | GitHub-hosted PDF datasets, custom JSONL collections |

### Backend Selection

| Backend              | When to Use                                                                                          |
| -------------------- | ---------------------------------------------------------------------------------------------------- |
| `vllm` (recommended) | High-throughput GPU inference with batching. Set `enforce_eager=True` if you hit compilation issues. |
| `hf`                 | Hugging Face Transformers fallback when vLLM is unavailable or for debugging.                        |

The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.

***

## Usage

A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader
from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter

pipeline = Pipeline(name="pdf_to_interleaved")

# 1. Parse PDFs into interleaved rows
pipeline.add_stage(
    NemotronParsePDFReader(
        manifest_path="./pdfs.jsonl",
        pdf_dir="/data/pdfs",
        backend="vllm",
        pdfs_per_task=10,
        max_pages=50,
        inference_batch_size=4,
    )
)

# 2. Write interleaved Parquet
pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs"))

executor = XennaExecutor()
pipeline.run(executor)
```

For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends).

### Example: CC-MAIN PDF Dump

Parse a Common Crawl PDF dump from its zip hierarchy:

```python
NemotronParsePDFReader(
    manifest_path="./cc_pdfs.jsonl",
    zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED",
    backend="vllm",
    file_names_field="cc_pdf_file_names",
    pdfs_per_task=20,
)
```

### Example: JSONL-Encoded PDFs

Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):

```python
NemotronParsePDFReader(
    manifest_path="./github_pdfs.jsonl",
    jsonl_base_dir="/data/github_pdfs",
    backend="vllm",
)
```

### Parameters

| Parameter              | Type        | Default               | Description                                                                          |
| ---------------------- | ----------- | --------------------- | ------------------------------------------------------------------------------------ |
| `manifest_path`        | str \| None | `None`                | JSONL manifest listing PDF entries.                                                  |
| `pdf_dir`              | str \| None | `None`                | Directory containing `.pdf` files.                                                   |
| `zip_base_dir`         | str \| None | `None`                | Root directory of CC-MAIN PDF zip hierarchy.                                         |
| `jsonl_base_dir`       | str \| None | `None`                | Root directory of JSONL-encoded PDF datasets.                                        |
| `model_path`           | str         | (default model)       | Local path or HF repo ID for the Nemotron-Parse weights.                             |
| `backend`              | str         | `"vllm"`              | Inference backend (`vllm` or `hf`).                                                  |
| `pdfs_per_task`        | int         | `10`                  | Number of PDFs grouped into each `FileGroupTask`.                                    |
| `max_pdfs`             | int \| None | `None`                | Hard cap on total PDFs processed (debug aid).                                        |
| `dpi`                  | int         | `300`                 | Render DPI for PDF pages.                                                            |
| `max_pages`            | int         | `50`                  | Maximum pages rendered per PDF; longer PDFs are truncated.                           |
| `inference_batch_size` | int         | `4`                   | vLLM/HF batch size.                                                                  |
| `max_num_seqs`         | int         | `64`                  | Maximum concurrent vLLM sequences.                                                   |
| `text_in_pic`          | bool        | `False`               | When `True`, treat embedded text within rendered images as part of the text content. |
| `enforce_eager`        | bool        | `False`               | Disable vLLM compilation for compatibility with restricted environments.             |
| `min_crop_px`          | int         | `10`                  | Minimum dimension (pixels) for cropped image regions.                                |
| `dataset_name`         | str         | `"pdf_dataset"`       | Logical dataset label written to output rows.                                        |
| `file_name_field`      | str         | `"file_name"`         | Manifest field naming a single PDF file.                                             |
| `file_names_field`     | str         | `"cc_pdf_file_names"` | Manifest field naming a list of PDF files (CC-MAIN layout).                          |
| `url_field`            | str         | `"url"`               | Manifest field for the source URL passthrough.                                       |

## Output Format

Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a `sample_id` belong to the same document. Example output JSON:

```json
{
  "sample_id": "doc_42",
  "position": 0,
  "modality": "text",
  "text_content": "# Introduction\n\nThis paper investigates...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"],
  "url": "https://example.com/pdf_42.pdf"
}
{
  "sample_id": "doc_42",
  "position": 1,
  "modality": "image",
  "text_content": null,
  "binary_content": "<bytes>",
  "source_files": ["pdf_42.pdf"]
}
{
  "sample_id": "doc_42",
  "position": 2,
  "modality": "text",
  "text_content": "Figure 1 shows the architecture...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"]
}
```

### Output Schema

| Column           | Type           | Description                                                               |
| ---------------- | -------------- | ------------------------------------------------------------------------- |
| `sample_id`      | string         | PDF identifier; rows sharing a `sample_id` belong to the same document.   |
| `position`       | int            | Zero-based item position within the sample, used to reconstruct ordering. |
| `modality`       | string         | One of `text`, `image`, or `metadata`.                                    |
| `text_content`   | string \| null | Text payload for `text` and `metadata` rows.                              |
| `binary_content` | bytes \| null  | Image payload for `image` rows.                                           |
| `source_files`   | list\[string]  | Source PDF files that produced this row (for lineage tracking).           |

The output is directly compatible with [Interleaved IO](/curate-text/process-data/interleaved/io) readers and writers — the schema matches `INTERLEAVED_SCHEMA` exactly.

## Render Timeout

The preprocess stage replaces `signal.SIGALRM` with a `multiprocessing` fork-based timeout (`_RENDER_TIMEOUT_S = 60` by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where `SIGALRM` raises `ValueError: signal only works in main thread`. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside `pypdfium2`.

You don't need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at `nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py`.

## Benchmarking

A standalone benchmark script ships at `benchmarking/scripts/nemotron_parse_pdf_benchmark.py`. Use it to measure throughput on representative datasets before scaling to your full corpus.

## Best Practices

* **Use vLLM unless you can't**: the `vllm` backend is substantially faster than `hf`. Only fall back to `hf` for debugging or in environments where vLLM is unavailable.
* **Cap `max_pages` for outliers**: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources.
* **Tune `pdfs_per_task` for parallelism**: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters.
* **Set `enforce_eager=True` in restricted environments**: vLLM's torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility.
* **Pair with interleaved filters**: PDF parsing produces noisy output. Chain with the [Interleaved Filters](/curate-text/process-data/interleaved/filters) (blur, CLIP score) to drop low-quality samples before training.

## Related Topics

* **[Interleaved IO](/curate-text/process-data/interleaved/io)** — readers and writers that consume the Parquet output of this pipeline.
* **[Interleaved Filters](/curate-text/process-data/interleaved/filters)** — sample-level filters to apply after parsing.
* **[Common Crawl](/curate-text/load-data/common-crawl)** — companion source for web-scale PDF input via CC-MAIN dumps.