Nemotron-Parse PDF Pipeline
Nemotron-Parse PDF Pipeline
Convert PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the interleaved dataset format.
How it Works
NemotronParsePDFReader is a composite stage that expands into four underlying sub-stages:
PDFPartitioningStage— reads a JSONL manifest of PDF entries and packs them intoFileGroupTaskobjects.PDFPreprocessStage— extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.NemotronParseInferenceStage— runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, withtext_in_picandenforce_eagerflags and free-port retry on collisions.NemotronParsePostprocessStage— parses model output, aligns images and captions, crops images, and emits the final interleaved rows.
The output is interleaved Parquet ready to be filtered with Interleaved Filters and written to MINT-1T-style WebDataset shards.
Before You Start
Choose your PDF source and confirm the prerequisites:
- GPU: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
- vLLM: Strongly recommended for throughput. Falls back to HF Transformers if
backend="hf"is set. pypdfium2: Required Python dependency for PDF rendering. Installed automatically with theinterleaved_cpuorinterleaved_cuda12extras (e.g.,uv sync --extra interleaved_cuda12).- Manifest: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.
Choosing a PDF Source
Pass exactly one of pdf_dir, zip_base_dir, or jsonl_base_dir so the preprocess stage knows where to find the PDF bytes:
Backend Selection
The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.
Usage
A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:
For executor options and configuration, refer to Execution Backends.
Example: CC-MAIN PDF Dump
Parse a Common Crawl PDF dump from its zip hierarchy:
Example: JSONL-Encoded PDFs
Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):
Parameters
Output Format
Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a sample_id belong to the same document. Example output JSON:
Output Schema
The output is directly compatible with Interleaved IO readers and writers — the schema matches INTERLEAVED_SCHEMA exactly.
Render Timeout
The preprocess stage replaces signal.SIGALRM with a multiprocessing fork-based timeout (_RENDER_TIMEOUT_S = 60 by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where SIGALRM raises ValueError: signal only works in main thread. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside pypdfium2.
You don’t need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py.
Benchmarking
A standalone benchmark script ships at benchmarking/scripts/nemotron_parse_pdf_benchmark.py. Use it to measure throughput on representative datasets before scaling to your full corpus.
Best Practices
- Use vLLM unless you can’t: the
vllmbackend is substantially faster thanhf. Only fall back tohffor debugging or in environments where vLLM is unavailable. - Cap
max_pagesfor outliers: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources. - Tune
pdfs_per_taskfor parallelism: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters. - Set
enforce_eager=Truein restricted environments: vLLM’s torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility. - Pair with interleaved filters: PDF parsing produces noisy output. Chain with the Interleaved Filters (blur, CLIP score) to drop low-quality samples before training.
Related Topics
- Interleaved IO — readers and writers that consume the Parquet output of this pipeline.
- Interleaved Filters — sample-level filters to apply after parsing.
- Common Crawl — companion source for web-scale PDF input via CC-MAIN dumps.