Long-Document SDG#
The long-document SDG recipe generates synthetic VLM training data from PDF documents — improving long-document understanding capabilities measured against MMLongBench-Doc. Output is a corpus of question/answer pairs grounded in real document images, optionally judged for quality by a frontier model.
Built as nine PEP-723 uv run-able scripts that double as Nemotron CLI commands (nemotron data sdg long-document <stage>), and dispatchable to Slurm via NeMo-Run. Each producer stage can also auto-deploy its required vLLM endpoint via --serve — see below.
Pipeline#
01 seed ──┬─ 02 ocr ──── 03 text-qa ─────┐
├─ 04 classify ── 05 visual-qa ┤
├─ 06 single-page-qa ──────────┤── 09 judge
├─ 07 windowed-qa ─────────────┤
└─ 08 whole-doc-qa ────────────┘
Stage |
Description |
Model |
|---|---|---|
|
Download PDFs from FinePDFs, render pages to PNG, produce per-page / windowed / whole-document seed parquets |
CPU-only |
|
OCR with text + bbox metadata via Nemotron-Parse |
|
|
QA pairs from OCR-transcribed text |
|
|
Page-type and reasoning-complexity classification |
|
|
Visual QA grounded in page images |
|
|
Anchored single-page QA across Text/Table/Chart/Image/Layout |
same |
|
Multi-page sliding-window QA |
same |
|
Whole-document cross-page QA |
same |
|
LLM-as-a-judge scoring of any QA output |
any OpenAI-compatible frontier endpoint |
Stages 02–08 are CPU clients to a vLLM endpoint; stage 01 is CPU-only; stage 09 hits a third-party frontier API. Output of every stage is a parquet file consumable by downstream training recipes.
Two ways to run a stage#
Each stage works in two modes:
Standalone via
uv— drop into any environment withuvand a vLLM endpoint:uv run --no-project 02-nemotron-parse-ocr-sdg.py \ --config config/02-ocr.yaml \ vllm_endpoint=http://localhost:8000/v1 \ seed_path=./seed_data/seed_per_page.parquet \ num_records=100
Through the Nemotron CLI — same recipes, dispatched on Slurm:
nemotron data sdg long-document ocr --batch <profile> -c 02-ocr \ vllm_endpoint=http://compute-node:8000/v1 \ seed_path=/lustre/.../seed.parquet \ num_records=100
Configuration is YAML + Hydra-style key=value overrides validated by a Pydantic <Stage>Config class — nemotron data sdg long-document <stage> --help renders the full field table.
Auto-deploy with --serve#
Producer stages (ocr, text-qa, page-classification, visual-qa, single-page-qa, windowed-qa, whole-document-qa) accept --serve. When passed, the CLI composes a multi-task NeMo-Run experiment:
A serve task on a GPU partition brings vLLM up, picks a free TCP port at runtime, polls both
/healthand/v1/modelsto confirm the served model is registered, then publishes its endpoint to a sentinel file on shared storage.A client task (the recipe) waits on the sentinel, injects
vllm_endpoint=<url>into its config, runs the recipe, and on exit signals the serve task to clean up.
# OCR — auto-deploys nvidia/NVIDIA-Nemotron-Parse-v1.1 on a GPU node,
# runs the recipe against it, tears the deployment down on exit.
nemotron data sdg long-document ocr --batch prep --serve \
-c 02-ocr \
seed_path=/lustre/.../seed_per_page.parquet \
num_records=100
Override the default deployment with --serve-config <name> (configs live in recipes/data/sdg/long-document/deployment/).
--serve is not offered for seed (CPU-only, no model) or judge (frontier endpoint, third-party hosted).
Cluster operational guidance#
The seed stage and the --serve client tasks are CPU-only. On clusters whose default partitions require GPUs (e.g. NVIDIA’s dlw cluster, where interactive and batch reject CPU-only jobs), use a profile that extends the cluster profile with CPU partitions. dlw ships [prep]:
[prep]
extends = "dlw"
run_partition = "cpu"
batch_partition = "cpu"
Use --batch prep / --run prep for these recipes:
nemotron data sdg long-document seed --batch prep -c 01-seed ...
nemotron data sdg long-document ocr --batch prep --serve -c 02-ocr ...
The serve task always lands on a GPU partition (the cluster’s sdg_serve_partition from env.toml, defaulting to interactive); the client task uses the env profile’s regular run_partition / batch_partition.
Getting Started#
The recipe scripts live in:
src/nemotron/recipes/data/sdg/long-document/
Refer to the recipe README for full per-stage documentation, deployment-config schema, troubleshooting, and full-pipeline examples (both manual-vLLM and --serve styles).
Prerequisites#
uvinstalled (recipes resolve PEP 723 inline deps at run time).For producers (02–08): an OpenAI-compatible vLLM endpoint serving the recipe’s required model — either operator-launched, or auto-deployed by
--serve.For the judge (09): an OpenAI-compatible frontier endpoint with a valid API key in an env var.
For Slurm dispatch: the
nemotron-evaluator-launcher-style env.toml profile for your cluster.
After SDG#
Once the pipeline runs, the resulting parquet files can be:
Published to Hugging Face Hub as a public dataset.
Stored in internal Lustre and registered as a Nemotron / W&B artifact.
Consumed directly by training recipes via
dataset.pathor HF-dataset-id config.
The recipe README has copy-pasteable templates for both publish paths.
Further Reading#
Recipe README — comprehensive per-stage docs, config schemas, troubleshooting.
Deployment config schema —
--servedeployment YAML reference.MMLongBench-Doc paper — the benchmark this dataset targets.
Nemotron-CC — the sibling pretraining-data curation recipe.
Execution through NeMo-Run — how the
--run/--batchdispatch works.