Nemotron-Parse PDF Pipeline
Nemotron-Parse PDF Pipeline
Nemotron-Parse PDF Pipeline
Convert PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the interleaved dataset format.
NemotronParsePDFReader is a composite stage that expands into four underlying sub-stages:
PDFPartitioningStage — reads a JSONL manifest of PDF entries and packs them into FileGroupTask objects.PDFPreprocessStage — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.NemotronParseInferenceStage — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with text_in_pic and enforce_eager flags and free-port retry on collisions.NemotronParsePostprocessStage — parses model output, aligns images and captions, crops images, and emits the final interleaved rows.The output is interleaved Parquet ready to be filtered with Interleaved Filters and written to MINT-1T-style WebDataset shards.
Choose your PDF source and confirm the prerequisites:
backend="hf" is set.pypdfium2: Required Python dependency for PDF rendering. Installed automatically with the interleaved_cpu or interleaved_cuda12 extras (e.g., uv sync --extra interleaved_cuda12).Pass exactly one of pdf_dir, zip_base_dir, or jsonl_base_dir so the preprocess stage knows where to find the PDF bytes:
The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.
A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:
For executor options and configuration, refer to Execution Backends.
Parse a Common Crawl PDF dump from its zip hierarchy:
Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):
Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a sample_id belong to the same document. Example output JSON:
The output is directly compatible with Interleaved IO readers and writers — the schema matches INTERLEAVED_SCHEMA exactly.
The preprocess stage replaces signal.SIGALRM with a multiprocessing fork-based timeout (_RENDER_TIMEOUT_S = 60 by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where SIGALRM raises ValueError: signal only works in main thread. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside pypdfium2.
You don’t need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py.
A standalone benchmark script ships at benchmarking/scripts/nemotron_parse_pdf_benchmark.py. Use it to measure throughput on representative datasets before scaling to your full corpus.
vllm backend is substantially faster than hf. Only fall back to hf for debugging or in environments where vLLM is unavailable.max_pages for outliers: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources.pdfs_per_task for parallelism: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters.enforce_eager=True in restricted environments: vLLM’s torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility.