For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
      • Seed Dataset Preparation
      • Nemotron Parse OCR
      • Text QA from OCR Transcripts
      • Page Classification
      • Visual QA
      • Single-Page QA
      • Multi-Page Windowed QA
      • Whole-Document QA
      • Frontier Judge QA Filter
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
RecipesVLM Long-Document Understanding

Whole-Document QA

||View as Markdown|
Previous

Multi-Page Windowed QA

Next

Frontier Judge QA Filter

Download Recipe

Download the complete recipe script

1# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2# SPDX-License-Identifier: Apache-2.0
3# /// script
4# requires-python = ">=3.10"
5# dependencies = [
6# "data-designer>=0.5.6",
7# ]
8# ///
9"""Long-Document Understanding Whole-Document QA Recipe
10
11Generate multi-page question-answer pairs that require combining information
12from across an entire PDF document. Targets the hardest long-document
13understanding skills: counting across pages, list extraction, repeated-layout
14aggregation, cross-page computation, and lookup chains. MMLongBench-Doc is
15used to track progress toward these capabilities.
16
17For each seed record the pipeline:
18
19 1. Samples a question type (multiple choice, yes/no, string, layout,
20 numerical int/float/percentage, list, not answerable)
21 2. Generates a question that requires examining 2+ pages (preferably 4-8)
22 3. Generates an answer with exhaustive page-by-page reasoning (captured
23 separately via extract_reasoning_content)
24 4. Evaluates overall quality including multi-page requirement, answer
25 correctness, reasoning thoroughness, and format compliance (0/1/2 score)
26
27Prerequisites:
28 - A seed parquet file containing:
29 * `png_images_base64` – JSON array of base64-encoded PNGs covering
30 all pages of each document (one entry per page).
31 - A vLLM-compatible deployment of the VLM
32 (default: Qwen/Qwen3-VL-235B-A22B-Thinking-FP8).
33 Recommended vLLM launch flags:
34 --tensor-parallel-size 4
35 --max-model-len 50000
36 --gpu-memory-utilization 0.90
37 --reasoning-parser deepseek_r1
38 --limit-mm-per-prompt '{"video": 0}'
39 --trust-remote-code
40
41 Example launch script for 4× H100:
42 docker run --gpus all \
43 -p 8000:8000 \
44 vllm/vllm-openai:latest \
45 --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 \
46 --tensor-parallel-size 4 \
47 --max-model-len 50000 \
48 --gpu-memory-utilization 0.90 \
49 --reasoning-parser deepseek_r1 \
50 --limit-mm-per-prompt '{"video": 0}' \
51 --trust-remote-code
52
53Run:
54 # Basic usage (generates 5 records by default)
55 uv run 08-whole-document-qa-sdg.py --vllm-endpoint http://localhost:8000/v1 --seed-path seed_data/seed_whole_document.parquet
56
57 # Custom model and record count
58 uv run 08-whole-document-qa-sdg.py --vllm-endpoint http://localhost:8000/v1 --seed-path seed_data/seed_whole_document.parquet --num-records 100
59
60 # For help message and available options
61 uv run 08-whole-document-qa-sdg.py --help
62"""
63
64from pathlib import Path
65
66import data_designer.config as dd
67from data_designer.interface import DataDesigner, DatasetCreationResults
68
69DEFAULT_VLM_MODEL = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8"
70VLLM_PROVIDER_NAME = "vllm"
71
72# =============================================================================
73# Image context helper
74# =============================================================================
75
76IMAGE_CONTEXT = [
77 dd.ImageContext(
78 column_name="png_images_base64",
79 data_type=dd.ModalityDataType.BASE64,
80 image_format=dd.ImageFormat.PNG,
81 ),
82]
83
84# =============================================================================
85# Prompt templates
86# =============================================================================
87
88PROMPT_QUESTION = """\
89<question-type>
90{{ question_type }}
91</question-type>
92
93You are given ALL pages of a PDF document. Create a question of the given <question-type> that requires examining these pages.
94
95Think step by step:
961. Scan ALL pages. Note every table, chart, figure, diagram, icon, and text section.
972. Identify connections ACROSS pages: a value on one page that relates to a value on another, a term defined on one page used in a chart on another, items to count across multiple pages.
983. Formulate a question that REQUIRES combining information from at least 2 pages (preferably 4-8). Verify: could someone answer this from any single page alone? If yes, revise.
99
100CORE RULES:
101- The question must require the ENTIRE document, including later pages. Do not create questions answerable from just the first half.
102- Reject any question that would still be answered correctly if the reader stopped after the first cluster of matching pages.
103- Use light grounding like the benchmark: "In this report, ...", "According to the slides, ...",
104 or "In the Appendix, ...". Do NOT use explicit table/figure numbers (e.g., "In Table 3").
105 Put the specificity in the CRITERIA of what to find, not where to find it.
106- Prefer questions that use visual elements (charts, figures, icons, diagrams) over plain text.
107- Do NOT include the answer or hints in the question. Output ONLY the question text.
108
109MULTI-PAGE QUESTION PATTERNS
110The model's accuracy drops with more evidence pages (1pg: 51%, 4pg: 23%, 6+: 15%).
111Its dominant failure is systematic undercounting. Generate questions that train exhaustive scanning.
112
113HIGH-VALUE TARGETS (based on failure analysis):
114- COUNTING across pages — model undercounts by 2-3x. Use clear, visual, objective criteria.
115- LIST EXTRACTION across pages — 81% zero rate on "list all". Items on later pages get missed.
116- REPEATED-LAYOUT AGGREGATION — brochure cards, museum entries, cover-page student profiles, FAQ cards, chart panels, etc. Require scanning all matching pages, not just the first.
117- CROSS-PAGE COMPUTATION — financial ratios, sums across tables.
118- INFOGRAPHIC SPATIAL — binding numbers/labels to correct regions on maps/diagrams.
119- LOOKUP CHAINS — find in one table, look up in another.
120
121GROUNDING (match benchmark style — light references, heavy criteria):
122The benchmark uses light grounding. Do NOT use explicit table/figure numbers.
123Allowed grounding, from lightest to heaviest:
124 1. Bare: "How many figures are there in total in the article?"
125 2. Document type: "In the slides, how many charts compare ONLY US and Europe?"
126 3. Section/part: "How many prompt examples are given in the Appendix?"
127 4. Page range (rare): "How many figures are in this paper from Page 3 to Page 10?"
128The CRITERIA (what to count/find) must be clear and visual — the LOCATION is left to the model.
129
130QUALIFIER FIDELITY (critical)
131- If multiple nearby answers exist, the question MUST include the qualifier that makes the target unique.
132- Prefer qualifiers like: strongly / somewhat / overall / net, displayed / shown vs listed / mentioned,
133 exact row / column / year / fiscal year / subgroup / legend item.
134- The question must not be answerable by selecting a nearby but broader fact.
135
136COUNTING DISCIPLINE: verify your own count by scanning every page. Enumerate, then recount.
137
138{% if "not answerable" in question_type %}
139Create a question relevant to the document whose answer is NOT present anywhere in the pages.
140Prefer hard, benchmark-like negatives over trivial missing-info questions.
141Good negative patterns:
142 - wrong year/date not shown anywhere in the document
143 - wrong subgroup/series not present anywhere in the document
144 - a condition that no entity satisfies anywhere in the document
145 - a lookup chain that never resolves anywhere in the document
146Templates:
147 - "In the Consolidated Balance Sheets, what is [METRIC] for [YEAR not shown]?"
148 - "According to the report, which [ENTITY] satisfies [CONDITION not met by any row]?"
149 - "In the chart comparing [REAL GROUPS], what is the value for [GROUP NOT IN THE CHART]?"
150
151{% elif "numerical" in question_type %}
152Use visible numbers from tables/charts. Require arithmetic or counting across pages.
153{% if "int" in question_type %}
154Add "Answer with an integer." to the question. Keep counts in range 3-30.
155Use clear, VISUAL, objective criteria — each instance must be unambiguously identifiable.
156Templates:
157 - "How many charts in this report compare the general public with Latinos/Hispanics?
158 Count each chart once, even if it has multiple sub-panels. Answer with an integer."
159 - "How many pages include at least one photograph of a building exterior?"
160 - "Across all tables in this paper, how many use 'F1' as a column header or metric?"
161 - "How many charts use 'year' (e.g., 2010, 2015) as the horizontal axis?"
162 - "How many distinct country/region names appear in the regulatory compliance sections?"
163 - "What is the total [METRIC] across all tables in the document?"
164 - "How many rows across all tables in this paper have [COLUMN] above [THRESHOLD]?"
165 - "Using the percentage on one page and the sample size on another page, how many [GROUP] does that correspond to? Round to the nearest hundred and answer with an integer."
166{% elif "float" in question_type %}
167Specify rounding (e.g., "Round to two decimal places.").
168Templates:
169 - "Using the Consolidated Statements of Income and Consolidated Balance Sheets,
170 what is the FY2021 inventory turnover ratio (Cost of Sales / Inventories)?
171 Round to two decimal places."
172 - "What is the sum of Accrued Liabilities (balance sheet) and Selling Expense
173 (income statement) for FY2021? Answer in millions."
174 - "What is the ratio of [METRIC A] from the income statement to [METRIC B] from the balance sheet?"
175 - "What is the difference between the largest and smallest 'Change' values
176 across all subgroup charts in this report?"
177{% elif "percentage" in question_type %}
178Add "Answer with a % sign."
179Templates:
180 - "According to the financial statements, what is [METRIC A] as a percentage of [METRIC B]?"
181 - "What is the percentage difference between [GROUP A] and [GROUP B] in the chart titled 'X'?"
182 - "How much did [METRIC] change between [YEAR A] and [YEAR B] according to the report?"
183{% endif %}
184
185{% elif "list" in question_type %}
186Answer should be 2-8 short items gathered from DIFFERENT pages.
187Add "Return a JSON array of strings, e.g., ["A", "B"]."
188The model's worst pattern is "list all" (81% zero rate) — items on later pages get missed.
189Specify scope and criteria explicitly.
190The question itself must be a natural language sentence — NEVER output a JSON array as the question.
191Templates:
192 - "List all organisations introduced with at least one paragraph of description.
193 Return a JSON array, e.g., ["Org A", "Org B"]."
194 - "What are all the examples given for 'what does not make you data-driven'?
195 Collect from every page. Return a JSON array."
196 - "List every college that uses a Year 3 student on its brochure cover. Return a JSON array."
197 - "What are the [FIELD] values for [ENTITY] mentioned across the document?"
198 - "What colors represent [CATEGORIES] in charts titled 'X' and 'Y'? Return a JSON array."
199
200{% elif "yes" in question_type %}
201Templates:
202 - "Is [METRIC] for [ENTITY A] greater than for [ENTITY B] according to the report?"
203 - "Is the revenue growth rate in the financial highlights consistent with the detailed
204 segment breakdown later in the report? Answer Yes or No."
205 - "Does the document contain more than N [ITEMS with clear criteria]?"
206
207{% elif "multiple choice" in question_type %}
208Provide exactly 4 options (A-D), plausible and mutually exclusive.
209Templates:
210 - "Which [ENTITY] has the highest [METRIC] in the report?
211 A. ... B. ... C. ... D. ..."
212 - "Based on the financial data in this report, which statement is true?
213 A. ... B. ... C. ... D. ..."
214
215{% elif "string:" in question_type %}
216Answer is a word, phrase, or short sentence requiring multi-page reasoning.
217Templates:
218 - "In the chart titled 'X', in the 'Change' column, which subgroup shows the largest increase?"
219 - "Which method has the highest BLEU score in Table 3? Look up that method's parameter
220 count in Table 1. What is it?"
221 - "According to the report, which region has the largest number of [ENTITY]?"
222 - "Compare the 2015 and 2008 values for 'Foreign born' Latinos in the economic optimism
223 chart vs. the personal finance chart. In which is the gap larger?"
224 - "In the support contacts table, what phone number is listed for the country mentioned
225 most often in the Regulatory Compliance sections?"
226
227{% elif "layout" in question_type %}
228Answer requires understanding visual/spatial structure. Answer is a number, word, or phrase.
229Templates:
230 - "What range does [COLOR] represent in the legend of the chart titled 'X'?"
231 - "In the network diagram, which nodes are connected to [LABEL]?"
232 - "In the flowchart in Figure N, what step follows [LABEL]?"
233 - "What text appears inside the [COLOR/POSITION] box in the slides?"
234 - "What are the colors of the icons for [ACTION A] and [ACTION B] in the document?"
235{% endif %}
236
237These templates are for inspiration. Create a question specific to the actual visible content.
238OUTPUT: Write ONLY the question. No answer, no explanation.\
239"""
240
241
242PROMPT_ANSWER = """\
243Your question is: {{ question }}
244
245You are given ALL pages of a PDF document. Answer the question using these pages.
246
247You MUST use this exact output structure:
248<think>
249[all reasoning here]
250</think>
251[bare final answer here — no explanation, no labels, no extra text]
252
253In your THINKING (inside <think> tags), follow this protocol.
254
255QUALIFIER LOCK (critical)
256Before extracting any answer, copy the restrictive qualifiers from the question and keep them fixed:
257- year / date / fiscal year
258- subgroup / series / legend item
259- exact metric (count vs percentage vs percentage-point difference)
260- displayed / shown / visible vs listed / mentioned
261- first / second / last / nearest / highest / lowest
262
263Do NOT substitute a nearby year, nearby subgroup, nearby series, nearby row, or nearby fact.
264If the question asks for a specific subgroup or metric, read exactly that one and no other.
265
266THINKING PROTOCOL:
2671. State what you need to find.
2682. Scan ALL pages first to last. For each page, note whether it contains relevant info.
269 Do NOT stop at the first match — the same content type often continues on later pages
270 (charts across pages 4-14, financial notes on pages 60-80, organisations on pages 10-35).
2713. For each relevant page, enumerate findings with explicit numbering:
272 "Page 28: Item 14 — Arven Gold..., Item 15 — Bergen Kunsthall..."
273 "Target B found on page 59 in Table 5."
274 Keep a running tally. Do NOT summarize or estimate.
2754. If the answer requires combining values from multiple pages, show the computation:
276 "Total Debt = Current portion (799) + Long-term (6,692) = 7,491"
2775. After the LAST page, verify:
278 - Counting: recount from the top. Compare tallies. Third pass if they differ.
279 - Computation: re-read source values and redo the arithmetic.
280 - List extraction: scan again for missed items, especially on later pages.
281 - Lookup chains: confirm intermediate results match the correct row/entity.
282 - Comparison: re-check you read from the correct charts/tables (match by title).
283
284UNIT DISCIPLINE
285- Preserve units exactly when present or requested (%, $, million, etc.).
286- Financial reports: parentheses = negative; check table header for unit scale (e.g., "In millions");
287 "how much higher/more" = positive number; "change" = positive for increase, negative for decrease.
288
289COUNT / PERCENT / DENOMINATOR DISCIPLINE
290- If one page provides a percentage and another page provides a sample size, the percentage is NOT the final answer until it is converted using the sample size.
291- Distinguish carefully between count, percentage, percentage-point difference, and ratio.
292- Only round after the final computation, never before.
293
294THINKING STABILITY (critical)
295- Follow the protocol once from top to bottom. Do NOT restart from step 1 after you already found the relevant pages.
296- Do at most one scan pass and one verification pass.
297- If there are two plausible candidates, compare them once using the question's qualifiers, choose the best-supported one, and continue. Do NOT keep generating new alternatives.
298- Do NOT repeat the same scan, recount, or conclusion more than once.
299- As soon as the answer is found and verified, stop thinking and produce the final answer.
300- Do NOT use filler loops such as repeating a phrase, title, entity name, or page reference many times.
301- If you have a complete answer supported by the required pages and qualifiers, commit to it. Do not reopen the search.
302
303EXHAUSTIVE REPEATED-LAYOUT SCAN
304If multiple pages in the document share the same layout or template, scan ALL matching pages before concluding.
305Do not stop after the first valid hit.
306For counts/lists, maintain a running page-by-page tally or item list until the last relevant page in the document.
307Bad: "I see items 14-29 on pages 28-31. Count = 29 - 14 + 1 = 16." (stopped early, missed pages 32-34)
308Good: "Page 28: items 14-21 (running total: 8). Page 30: items 22-29 (running total: 16). Page 32: items 30-37 (running total: 24). Page 34: items 38-44 (running total: 31). Final count: 31."
309
310PAGE-BREAK CONTINUATION
311If a sentence, paragraph, table row, caption, or figure explanation appears to continue onto the next page,
312combine the text before deciding the answer is missing.
313
314THINKING TRACE QUALITY (your thinking is used as training data):
315- Cite which page and which element (by title/heading) you found evidence on.
316- Quote the specific values you read.
317- For computation, show the formula with named operands.
318- For counting, enumerate every item, not just a total.
319
320FINAL ANSWER: After </think>, output ONLY the bare answer. No reasoning, no step labels,
321no "DECOMPOSE:", no "VERIFY:", no explanation. Just the answer value.
322
323{% if "not answerable" in question_type %}
324Only say "Not answerable" if the information is genuinely absent from ALL pages — you must have
325scanned every single page before concluding this. Do NOT refuse because:
326- The exact phrasing doesn't appear (look for equivalent information)
327- You need to compute the answer from available data (do the computation)
328- A figure or icon is hard to read (give your best reading)
329- You only checked the first half of the document (later pages may have the answer)
330{% endif %}
331
332- For figures/diagrams: examine visual elements directly (nodes, arrows, colors, spatial groupings), not just captions.
333- For list answers: format as ["item1", "item2"]. Include units when applicable.\
334"""
335
336
337PROMPT_QUALITY_SCORE = """\
338<question-type>{{ question_type }}</question-type>
339<question>{{ question }}</question>
340<answer>{{ answer }}</answer>
341<answer_reasoning>{{ answer__reasoning_content }}</answer_reasoning>
342
343You are given ALL pages of a PDF document. Evaluate the question-answer pair AND its reasoning.
344
345Filter out bad training data, especially:
346- looping or repetitive reasoning traces that keep rescanning the same pages/evidence without converging
347- unfinished or truncated reasoning traces that stop mid-thought
348- reasoning that stops after the first cluster of results when later pages still matter
349
350CHECKS (any failure => score 0):
351
3521. PAGES READABLE — document pages are clear and not low quality.
3532. QUESTION RELEVANT — question is about content in these pages.
354{% if "not answerable" in question_type %}
355 For "not answerable": question should be relevant but the answer must NOT be in the pages.
356{% endif %}
3573. ANSWER CORRECT — verify by examining the actual pages yourself. For counts, count the
358 items yourself. For chart values, find the specific chart and check. For lists, verify
359 each item exists.
360{% if "not answerable" in question_type %}
361 Correct answer must be exactly "Not answerable".
362{% endif %}
3634. QUESTION WELL-FORMED — unambiguous, doesn't contain the answer, not trivially easy.
3645. MULTI-PAGE REQUIRED — would someone need 2+ pages to answer? Score 0 if answerable
365 from a single page.
3666. FORMAT + REFUSAL —
367 - Lists must be JSON arrays, units included, correct specificity.
368 - Score 0 if the answer contains reasoning steps, protocol labels (e.g., "DECOMPOSE:",
369 "SCAN:", "VERIFY:"), explanations, or anything beyond the bare result.
370 {% if "not answerable" in question_type %}
371 - Answer must be exactly "Not answerable".
372 - Score 0 if the question is unanswerable only because it asks a trivially absent detail
373 (e.g., author's phone number). Must be a near-miss negative where a specific qualifier
374 (year, subgroup, row, condition) is absent from the document.
375 {% else %}
376 - Answer MUST NOT be "Not answerable" or any refusal.
377 {% endif %}
3787. REASONING QUALITY (critical — reasoning is used as chain-of-thought training data):
379 The <answer_reasoning> must demonstrate thorough multi-page scanning. Score 0 if:
380 - Reasoning does NOT mention specific pages where evidence was found.
381 - Reasoning stops scanning after the first cluster of results (e.g., finds 5 items on
382 pages 10-12 and stops, when more exist on pages 15-20).
383 - It repeats the same scan, recount, candidate answer, page reference, title, entity name,
384 or conclusion without adding new evidence.
385 - It restarts the reasoning process after already finding the relevant page(s) or elements.
386 - It contains obvious loop markers such as repeated "Wait, let me", "Actually",
387 "Let's look again", or "Let's look at" — especially if the same phrase appears
388 more than 3 times with no new information between repetitions.
389 - It keeps generating new alternatives after already having enough evidence to answer.
390 - It ends in an unfinished or truncated way, or appears to stop mid-thought.
391 - For repeated-layout or repeated-entry questions, it stops after the first valid hit instead
392 of scanning all matching pages.
393 - For counting: reasoning does NOT enumerate items explicitly (just states a total
394 without listing each instance).
395 - For computation: reasoning does NOT show the formula with named values from specific
396 pages (e.g., must show "Revenue (44,538 from income statement) / ..." not just "44,538 / ...").
397 - For cross-page computations, it does NOT clearly distinguish which page provides the key,
398 target value, denominator, or comparison value.
399 - It confuses count vs percentage vs percentage-point difference vs ratio, or rounds before
400 the final computation rather than after it.
401 - Reasoning uses only vague references ("the table", "the chart", "the page") without
402 identifying which specific element on which page.
403
404SCORING:
405- Score 0: any check fails.
406- Score 1: all checks pass.
407- Score 2: all pass AND question requires non-trivial cross-page reasoning (computation,
408 counting across 3+ pages, or lookup chain), AND reasoning demonstrates exhaustive scanning
409 with explicit page-by-page enumeration.
410Respond with ONLY: 0, 1, or 2.\
411"""
412
413
414# =============================================================================
415# Pipeline configuration
416# =============================================================================
417
418
419def build_config(
420 seed_path: str = "seed.parquet",
421 model_alias: str = "vl",
422 model_id: str = DEFAULT_VLM_MODEL,
423) -> dd.DataDesignerConfigBuilder:
424 model_configs = [
425 dd.ModelConfig(
426 alias=model_alias,
427 model=model_id,
428 provider=VLLM_PROVIDER_NAME,
429 inference_parameters=dd.ChatCompletionInferenceParams(
430 timeout=120,
431 temperature=1.0,
432 top_p=0.95,
433 max_parallel_requests=32,
434 extra_body={
435 "top_k": 20,
436 "min_p": 0.0,
437 "presence_penalty": 1.5,
438 "repetition_penalty": 1.0,
439 },
440 ),
441 ),
442 ]
443
444 config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
445
446 config_builder.with_seed_dataset(
447 dd.LocalFileSeedSource(path=seed_path),
448 sampling_strategy=dd.SamplingStrategy.ORDERED,
449 )
450
451 config_builder.add_column(
452 dd.SamplerColumnConfig(
453 name="question_type",
454 sampler_type=dd.SamplerType.CATEGORY,
455 params=dd.CategorySamplerParams(
456 values=[
457 "multiple choice",
458 "yes or no",
459 "string: word, phrase or short sentence",
460 "layout",
461 "numerical (int)",
462 "numerical (float)",
463 "numerical (percentage)",
464 "list of items (int, string, float or mixed)",
465 "not answerable",
466 ],
467 weights=[
468 0.001858736059479554,
469 0.001858736059479554,
470 0.07434944237918216,
471 0.14869888475836432,
472 0.2230483271375465,
473 0.14869888475836432,
474 0.14869888475836432,
475 0.2230483271375465,
476 0.029739776951672865,
477 ],
478 ),
479 )
480 )
481
482 config_builder.add_column(
483 dd.LLMTextColumnConfig(
484 name="question",
485 model_alias=model_alias,
486 prompt=PROMPT_QUESTION,
487 multi_modal_context=IMAGE_CONTEXT,
488 )
489 )
490
491 config_builder.add_column(
492 dd.LLMTextColumnConfig(
493 name="answer",
494 model_alias=model_alias,
495 prompt=PROMPT_ANSWER,
496 multi_modal_context=IMAGE_CONTEXT,
497 extract_reasoning_content=True,
498 )
499 )
500
501 config_builder.add_column(
502 dd.LLMTextColumnConfig(
503 name="quality_score",
504 model_alias=model_alias,
505 prompt=PROMPT_QUALITY_SCORE,
506 multi_modal_context=IMAGE_CONTEXT,
507 )
508 )
509
510 return config_builder
511
512
513def create_dataset(
514 config_builder: dd.DataDesignerConfigBuilder,
515 num_records: int,
516 vllm_endpoint: str,
517 artifact_path: Path | str | None = None,
518) -> DatasetCreationResults:
519 model_providers = [
520 dd.ModelProvider(
521 name=VLLM_PROVIDER_NAME,
522 endpoint=vllm_endpoint,
523 ),
524 ]
525 data_designer = DataDesigner(
526 artifact_path=artifact_path,
527 model_providers=model_providers,
528 )
529 data_designer.set_run_config(dd.RunConfig(progress_bar=True, disable_early_shutdown=True))
530 results = data_designer.create(config_builder, num_records=num_records, dataset_name="whole_document_qa")
531 return results
532
533
534if __name__ == "__main__":
535 from argparse import ArgumentParser
536
537 parser = ArgumentParser()
538 parser.add_argument(
539 "--vllm-endpoint",
540 type=str,
541 required=True,
542 help="Base URL of the vLLM server hosting the VLM (e.g. http://localhost:8000/v1)",
543 )
544 parser.add_argument("--seed-path", type=str, required=True, help="Path to the seed parquet file")
545 parser.add_argument("--model-alias", type=str, default="vl")
546 parser.add_argument("--model-id", type=str, default=DEFAULT_VLM_MODEL)
547 parser.add_argument("--num-records", type=int, default=5)
548 parser.add_argument("--artifact-path", type=str, default=None)
549 args = parser.parse_args()
550
551 config_builder = build_config(
552 seed_path=args.seed_path,
553 model_alias=args.model_alias,
554 model_id=args.model_id,
555 )
556 results = create_dataset(
557 config_builder,
558 num_records=args.num_records,
559 vllm_endpoint=args.vllm_endpoint,
560 artifact_path=args.artifact_path,
561 )
562
563 print(f"Dataset saved to: {results.artifact_storage.final_dataset_path}")
564
565 results.load_analysis().to_report()