Curate TextLoad Data

Nemotron-Parse PDF Pipeline

View as Markdown

Convert PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the interleaved dataset format.

How it Works

NemotronParsePDFReader is a composite stage that expands into four underlying sub-stages:

  1. PDFPartitioningStage — reads a JSONL manifest of PDF entries and packs them into FileGroupTask objects.
  2. PDFPreprocessStage — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.
  3. NemotronParseInferenceStage — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with text_in_pic and enforce_eager flags and free-port retry on collisions.
  4. NemotronParsePostprocessStage — parses model output, aligns images and captions, crops images, and emits the final interleaved rows.

The output is interleaved Parquet ready to be filtered with Interleaved Filters and written to MINT-1T-style WebDataset shards.

Before You Start

Choose your PDF source and confirm the prerequisites:

  • GPU: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
  • vLLM: Strongly recommended for throughput. Falls back to HF Transformers if backend="hf" is set.
  • pypdfium2: Required Python dependency for PDF rendering. Installed automatically with the interleaved_cpu or interleaved_cuda12 extras (e.g., uv sync --extra interleaved_cuda12).
  • Manifest: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.

Choosing a PDF Source

Pass exactly one of pdf_dir, zip_base_dir, or jsonl_base_dir so the preprocess stage knows where to find the PDF bytes:

ParameterSource LayoutWhen to Use
pdf_dirA directory of .pdf filesLocal or mounted directories of standalone PDFs
zip_base_dirA CC-MAIN-2021-31-PDF-UNTRUNCATED zip hierarchyCommon Crawl PDF dumps
jsonl_base_dirJSONL-encoded PDF datasets where each line carries the PDF bytesGitHub-hosted PDF datasets, custom JSONL collections

Backend Selection

BackendWhen to Use
vllm (recommended)High-throughput GPU inference with batching. Set enforce_eager=True if you hit compilation issues.
hfHugging Face Transformers fallback when vLLM is unavailable or for debugging.

The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.


Usage

A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader
4from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter
5
6pipeline = Pipeline(name="pdf_to_interleaved")
7
8# 1. Parse PDFs into interleaved rows
9pipeline.add_stage(
10 NemotronParsePDFReader(
11 manifest_path="./pdfs.jsonl",
12 pdf_dir="/data/pdfs",
13 backend="vllm",
14 pdfs_per_task=10,
15 max_pages=50,
16 inference_batch_size=4,
17 )
18)
19
20# 2. Write interleaved Parquet
21pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs"))
22
23executor = XennaExecutor()
24pipeline.run(executor)

For executor options and configuration, refer to Execution Backends.

Example: CC-MAIN PDF Dump

Parse a Common Crawl PDF dump from its zip hierarchy:

1NemotronParsePDFReader(
2 manifest_path="./cc_pdfs.jsonl",
3 zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED",
4 backend="vllm",
5 file_names_field="cc_pdf_file_names",
6 pdfs_per_task=20,
7)

Example: JSONL-Encoded PDFs

Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):

1NemotronParsePDFReader(
2 manifest_path="./github_pdfs.jsonl",
3 jsonl_base_dir="/data/github_pdfs",
4 backend="vllm",
5)

Parameters

ParameterTypeDefaultDescription
manifest_pathstr | NoneNoneJSONL manifest listing PDF entries.
pdf_dirstr | NoneNoneDirectory containing .pdf files.
zip_base_dirstr | NoneNoneRoot directory of CC-MAIN PDF zip hierarchy.
jsonl_base_dirstr | NoneNoneRoot directory of JSONL-encoded PDF datasets.
model_pathstr(default model)Local path or HF repo ID for the Nemotron-Parse weights.
backendstr"vllm"Inference backend (vllm or hf).
pdfs_per_taskint10Number of PDFs grouped into each FileGroupTask.
max_pdfsint | NoneNoneHard cap on total PDFs processed (debug aid).
dpiint300Render DPI for PDF pages.
max_pagesint50Maximum pages rendered per PDF; longer PDFs are truncated.
inference_batch_sizeint4vLLM/HF batch size.
max_num_seqsint64Maximum concurrent vLLM sequences.
text_in_picboolFalseWhen True, treat embedded text within rendered images as part of the text content.
enforce_eagerboolFalseDisable vLLM compilation for compatibility with restricted environments.
min_crop_pxint10Minimum dimension (pixels) for cropped image regions.
dataset_namestr"pdf_dataset"Logical dataset label written to output rows.
file_name_fieldstr"file_name"Manifest field naming a single PDF file.
file_names_fieldstr"cc_pdf_file_names"Manifest field naming a list of PDF files (CC-MAIN layout).
url_fieldstr"url"Manifest field for the source URL passthrough.

Output Format

Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a sample_id belong to the same document. Example output JSON:

1{
2 "sample_id": "doc_42",
3 "position": 0,
4 "modality": "text",
5 "text_content": "# Introduction\n\nThis paper investigates...",
6 "binary_content": null,
7 "source_files": ["pdf_42.pdf"],
8 "url": "https://example.com/pdf_42.pdf"
9}
10{
11 "sample_id": "doc_42",
12 "position": 1,
13 "modality": "image",
14 "text_content": null,
15 "binary_content": "<bytes>",
16 "source_files": ["pdf_42.pdf"]
17}
18{
19 "sample_id": "doc_42",
20 "position": 2,
21 "modality": "text",
22 "text_content": "Figure 1 shows the architecture...",
23 "binary_content": null,
24 "source_files": ["pdf_42.pdf"]
25}

Output Schema

ColumnTypeDescription
sample_idstringPDF identifier; rows sharing a sample_id belong to the same document.
positionintZero-based item position within the sample, used to reconstruct ordering.
modalitystringOne of text, image, or metadata.
text_contentstring | nullText payload for text and metadata rows.
binary_contentbytes | nullImage payload for image rows.
source_fileslist[string]Source PDF files that produced this row (for lineage tracking).

The output is directly compatible with Interleaved IO readers and writers — the schema matches INTERLEAVED_SCHEMA exactly.

Render Timeout

The preprocess stage replaces signal.SIGALRM with a multiprocessing fork-based timeout (_RENDER_TIMEOUT_S = 60 by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where SIGALRM raises ValueError: signal only works in main thread. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside pypdfium2.

You don’t need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py.

Benchmarking

A standalone benchmark script ships at benchmarking/scripts/nemotron_parse_pdf_benchmark.py. Use it to measure throughput on representative datasets before scaling to your full corpus.

Best Practices

  • Use vLLM unless you can’t: the vllm backend is substantially faster than hf. Only fall back to hf for debugging or in environments where vLLM is unavailable.
  • Cap max_pages for outliers: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources.
  • Tune pdfs_per_task for parallelism: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters.
  • Set enforce_eager=True in restricted environments: vLLM’s torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility.
  • Pair with interleaved filters: PDF parsing produces noisy output. Chain with the Interleaved Filters (blur, CLIP score) to drop low-quality samples before training.
  • Interleaved IO — readers and writers that consume the Parquet output of this pipeline.
  • Interleaved Filters — sample-level filters to apply after parsing.
  • Common Crawl — companion source for web-scale PDF input via CC-MAIN dumps.