Whole-Document QA
Whole-Document QA
Whole-Document QA
1 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. 2 # SPDX-License-Identifier: Apache-2.0 3 # /// script 4 # requires-python = ">=3.10" 5 # dependencies = [ 6 # "data-designer>=0.5.6", 7 # ] 8 # /// 9 """Long-Document Understanding Whole-Document QA Recipe 10 11 Generate multi-page question-answer pairs that require combining information 12 from across an entire PDF document. Targets the hardest long-document 13 understanding skills: counting across pages, list extraction, repeated-layout 14 aggregation, cross-page computation, and lookup chains. MMLongBench-Doc is 15 used to track progress toward these capabilities. 16 17 For each seed record the pipeline: 18 19 1. Samples a question type (multiple choice, yes/no, string, layout, 20 numerical int/float/percentage, list, not answerable) 21 2. Generates a question that requires examining 2+ pages (preferably 4-8) 22 3. Generates an answer with exhaustive page-by-page reasoning (captured 23 separately via extract_reasoning_content) 24 4. Evaluates overall quality including multi-page requirement, answer 25 correctness, reasoning thoroughness, and format compliance (0/1/2 score) 26 27 Prerequisites: 28 - A seed parquet file containing: 29 * `png_images_base64` – JSON array of base64-encoded PNGs covering 30 all pages of each document (one entry per page). 31 - A vLLM-compatible deployment of the VLM 32 (default: Qwen/Qwen3-VL-235B-A22B-Thinking-FP8). 33 Recommended vLLM launch flags: 34 --tensor-parallel-size 4 35 --max-model-len 50000 36 --gpu-memory-utilization 0.90 37 --reasoning-parser deepseek_r1 38 --limit-mm-per-prompt '{"video": 0}' 39 --trust-remote-code 40 41 Example launch script for 4× H100: 42 docker run --gpus all \ 43 -p 8000:8000 \ 44 vllm/vllm-openai:latest \ 45 --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 \ 46 --tensor-parallel-size 4 \ 47 --max-model-len 50000 \ 48 --gpu-memory-utilization 0.90 \ 49 --reasoning-parser deepseek_r1 \ 50 --limit-mm-per-prompt '{"video": 0}' \ 51 --trust-remote-code 52 53 Run: 54 # Basic usage (generates 5 records by default) 55 uv run 08-whole-document-qa-sdg.py --vllm-endpoint http://localhost:8000/v1 --seed-path seed_data/seed_whole_document.parquet 56 57 # Custom model and record count 58 uv run 08-whole-document-qa-sdg.py --vllm-endpoint http://localhost:8000/v1 --seed-path seed_data/seed_whole_document.parquet --num-records 100 59 60 # For help message and available options 61 uv run 08-whole-document-qa-sdg.py --help 62 """ 63 64 from pathlib import Path 65 66 import data_designer.config as dd 67 from data_designer.interface import DataDesigner, DatasetCreationResults 68 69 DEFAULT_VLM_MODEL = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8" 70 VLLM_PROVIDER_NAME = "vllm" 71 72 # ============================================================================= 73 # Image context helper 74 # ============================================================================= 75 76 IMAGE_CONTEXT = [ 77 dd.ImageContext( 78 column_name="png_images_base64", 79 data_type=dd.ModalityDataType.BASE64, 80 image_format=dd.ImageFormat.PNG, 81 ), 82 ] 83 84 # ============================================================================= 85 # Prompt templates 86 # ============================================================================= 87 88 PROMPT_QUESTION = """\ 89 <question-type> 90 {{ question_type }} 91 </question-type> 92 93 You are given ALL pages of a PDF document. Create a question of the given <question-type> that requires examining these pages. 94 95 Think step by step: 96 1. Scan ALL pages. Note every table, chart, figure, diagram, icon, and text section. 97 2. Identify connections ACROSS pages: a value on one page that relates to a value on another, a term defined on one page used in a chart on another, items to count across multiple pages. 98 3. Formulate a question that REQUIRES combining information from at least 2 pages (preferably 4-8). Verify: could someone answer this from any single page alone? If yes, revise. 99 100 CORE RULES: 101 - The question must require the ENTIRE document, including later pages. Do not create questions answerable from just the first half. 102 - Reject any question that would still be answered correctly if the reader stopped after the first cluster of matching pages. 103 - Use light grounding like the benchmark: "In this report, ...", "According to the slides, ...", 104 or "In the Appendix, ...". Do NOT use explicit table/figure numbers (e.g., "In Table 3"). 105 Put the specificity in the CRITERIA of what to find, not where to find it. 106 - Prefer questions that use visual elements (charts, figures, icons, diagrams) over plain text. 107 - Do NOT include the answer or hints in the question. Output ONLY the question text. 108 109 MULTI-PAGE QUESTION PATTERNS 110 The model's accuracy drops with more evidence pages (1pg: 51%, 4pg: 23%, 6+: 15%). 111 Its dominant failure is systematic undercounting. Generate questions that train exhaustive scanning. 112 113 HIGH-VALUE TARGETS (based on failure analysis): 114 - COUNTING across pages — model undercounts by 2-3x. Use clear, visual, objective criteria. 115 - LIST EXTRACTION across pages — 81% zero rate on "list all". Items on later pages get missed. 116 - REPEATED-LAYOUT AGGREGATION — brochure cards, museum entries, cover-page student profiles, FAQ cards, chart panels, etc. Require scanning all matching pages, not just the first. 117 - CROSS-PAGE COMPUTATION — financial ratios, sums across tables. 118 - INFOGRAPHIC SPATIAL — binding numbers/labels to correct regions on maps/diagrams. 119 - LOOKUP CHAINS — find in one table, look up in another. 120 121 GROUNDING (match benchmark style — light references, heavy criteria): 122 The benchmark uses light grounding. Do NOT use explicit table/figure numbers. 123 Allowed grounding, from lightest to heaviest: 124 1. Bare: "How many figures are there in total in the article?" 125 2. Document type: "In the slides, how many charts compare ONLY US and Europe?" 126 3. Section/part: "How many prompt examples are given in the Appendix?" 127 4. Page range (rare): "How many figures are in this paper from Page 3 to Page 10?" 128 The CRITERIA (what to count/find) must be clear and visual — the LOCATION is left to the model. 129 130 QUALIFIER FIDELITY (critical) 131 - If multiple nearby answers exist, the question MUST include the qualifier that makes the target unique. 132 - Prefer qualifiers like: strongly / somewhat / overall / net, displayed / shown vs listed / mentioned, 133 exact row / column / year / fiscal year / subgroup / legend item. 134 - The question must not be answerable by selecting a nearby but broader fact. 135 136 COUNTING DISCIPLINE: verify your own count by scanning every page. Enumerate, then recount. 137 138 {% if "not answerable" in question_type %} 139 Create a question relevant to the document whose answer is NOT present anywhere in the pages. 140 Prefer hard, benchmark-like negatives over trivial missing-info questions. 141 Good negative patterns: 142 - wrong year/date not shown anywhere in the document 143 - wrong subgroup/series not present anywhere in the document 144 - a condition that no entity satisfies anywhere in the document 145 - a lookup chain that never resolves anywhere in the document 146 Templates: 147 - "In the Consolidated Balance Sheets, what is [METRIC] for [YEAR not shown]?" 148 - "According to the report, which [ENTITY] satisfies [CONDITION not met by any row]?" 149 - "In the chart comparing [REAL GROUPS], what is the value for [GROUP NOT IN THE CHART]?" 150 151 {% elif "numerical" in question_type %} 152 Use visible numbers from tables/charts. Require arithmetic or counting across pages. 153 {% if "int" in question_type %} 154 Add "Answer with an integer." to the question. Keep counts in range 3-30. 155 Use clear, VISUAL, objective criteria — each instance must be unambiguously identifiable. 156 Templates: 157 - "How many charts in this report compare the general public with Latinos/Hispanics? 158 Count each chart once, even if it has multiple sub-panels. Answer with an integer." 159 - "How many pages include at least one photograph of a building exterior?" 160 - "Across all tables in this paper, how many use 'F1' as a column header or metric?" 161 - "How many charts use 'year' (e.g., 2010, 2015) as the horizontal axis?" 162 - "How many distinct country/region names appear in the regulatory compliance sections?" 163 - "What is the total [METRIC] across all tables in the document?" 164 - "How many rows across all tables in this paper have [COLUMN] above [THRESHOLD]?" 165 - "Using the percentage on one page and the sample size on another page, how many [GROUP] does that correspond to? Round to the nearest hundred and answer with an integer." 166 {% elif "float" in question_type %} 167 Specify rounding (e.g., "Round to two decimal places."). 168 Templates: 169 - "Using the Consolidated Statements of Income and Consolidated Balance Sheets, 170 what is the FY2021 inventory turnover ratio (Cost of Sales / Inventories)? 171 Round to two decimal places." 172 - "What is the sum of Accrued Liabilities (balance sheet) and Selling Expense 173 (income statement) for FY2021? Answer in millions." 174 - "What is the ratio of [METRIC A] from the income statement to [METRIC B] from the balance sheet?" 175 - "What is the difference between the largest and smallest 'Change' values 176 across all subgroup charts in this report?" 177 {% elif "percentage" in question_type %} 178 Add "Answer with a % sign." 179 Templates: 180 - "According to the financial statements, what is [METRIC A] as a percentage of [METRIC B]?" 181 - "What is the percentage difference between [GROUP A] and [GROUP B] in the chart titled 'X'?" 182 - "How much did [METRIC] change between [YEAR A] and [YEAR B] according to the report?" 183 {% endif %} 184 185 {% elif "list" in question_type %} 186 Answer should be 2-8 short items gathered from DIFFERENT pages. 187 Add "Return a JSON array of strings, e.g., ["A", "B"]." 188 The model's worst pattern is "list all" (81% zero rate) — items on later pages get missed. 189 Specify scope and criteria explicitly. 190 The question itself must be a natural language sentence — NEVER output a JSON array as the question. 191 Templates: 192 - "List all organisations introduced with at least one paragraph of description. 193 Return a JSON array, e.g., ["Org A", "Org B"]." 194 - "What are all the examples given for 'what does not make you data-driven'? 195 Collect from every page. Return a JSON array." 196 - "List every college that uses a Year 3 student on its brochure cover. Return a JSON array." 197 - "What are the [FIELD] values for [ENTITY] mentioned across the document?" 198 - "What colors represent [CATEGORIES] in charts titled 'X' and 'Y'? Return a JSON array." 199 200 {% elif "yes" in question_type %} 201 Templates: 202 - "Is [METRIC] for [ENTITY A] greater than for [ENTITY B] according to the report?" 203 - "Is the revenue growth rate in the financial highlights consistent with the detailed 204 segment breakdown later in the report? Answer Yes or No." 205 - "Does the document contain more than N [ITEMS with clear criteria]?" 206 207 {% elif "multiple choice" in question_type %} 208 Provide exactly 4 options (A-D), plausible and mutually exclusive. 209 Templates: 210 - "Which [ENTITY] has the highest [METRIC] in the report? 211 A. ... B. ... C. ... D. ..." 212 - "Based on the financial data in this report, which statement is true? 213 A. ... B. ... C. ... D. ..." 214 215 {% elif "string:" in question_type %} 216 Answer is a word, phrase, or short sentence requiring multi-page reasoning. 217 Templates: 218 - "In the chart titled 'X', in the 'Change' column, which subgroup shows the largest increase?" 219 - "Which method has the highest BLEU score in Table 3? Look up that method's parameter 220 count in Table 1. What is it?" 221 - "According to the report, which region has the largest number of [ENTITY]?" 222 - "Compare the 2015 and 2008 values for 'Foreign born' Latinos in the economic optimism 223 chart vs. the personal finance chart. In which is the gap larger?" 224 - "In the support contacts table, what phone number is listed for the country mentioned 225 most often in the Regulatory Compliance sections?" 226 227 {% elif "layout" in question_type %} 228 Answer requires understanding visual/spatial structure. Answer is a number, word, or phrase. 229 Templates: 230 - "What range does [COLOR] represent in the legend of the chart titled 'X'?" 231 - "In the network diagram, which nodes are connected to [LABEL]?" 232 - "In the flowchart in Figure N, what step follows [LABEL]?" 233 - "What text appears inside the [COLOR/POSITION] box in the slides?" 234 - "What are the colors of the icons for [ACTION A] and [ACTION B] in the document?" 235 {% endif %} 236 237 These templates are for inspiration. Create a question specific to the actual visible content. 238 OUTPUT: Write ONLY the question. No answer, no explanation.\ 239 """ 240 241 242 PROMPT_ANSWER = """\ 243 Your question is: {{ question }} 244 245 You are given ALL pages of a PDF document. Answer the question using these pages. 246 247 You MUST use this exact output structure: 248 <think> 249 [all reasoning here] 250 </think> 251 [bare final answer here — no explanation, no labels, no extra text] 252 253 In your THINKING (inside <think> tags), follow this protocol. 254 255 QUALIFIER LOCK (critical) 256 Before extracting any answer, copy the restrictive qualifiers from the question and keep them fixed: 257 - year / date / fiscal year 258 - subgroup / series / legend item 259 - exact metric (count vs percentage vs percentage-point difference) 260 - displayed / shown / visible vs listed / mentioned 261 - first / second / last / nearest / highest / lowest 262 263 Do NOT substitute a nearby year, nearby subgroup, nearby series, nearby row, or nearby fact. 264 If the question asks for a specific subgroup or metric, read exactly that one and no other. 265 266 THINKING PROTOCOL: 267 1. State what you need to find. 268 2. Scan ALL pages first to last. For each page, note whether it contains relevant info. 269 Do NOT stop at the first match — the same content type often continues on later pages 270 (charts across pages 4-14, financial notes on pages 60-80, organisations on pages 10-35). 271 3. For each relevant page, enumerate findings with explicit numbering: 272 "Page 28: Item 14 — Arven Gold..., Item 15 — Bergen Kunsthall..." 273 "Target B found on page 59 in Table 5." 274 Keep a running tally. Do NOT summarize or estimate. 275 4. If the answer requires combining values from multiple pages, show the computation: 276 "Total Debt = Current portion (799) + Long-term (6,692) = 7,491" 277 5. After the LAST page, verify: 278 - Counting: recount from the top. Compare tallies. Third pass if they differ. 279 - Computation: re-read source values and redo the arithmetic. 280 - List extraction: scan again for missed items, especially on later pages. 281 - Lookup chains: confirm intermediate results match the correct row/entity. 282 - Comparison: re-check you read from the correct charts/tables (match by title). 283 284 UNIT DISCIPLINE 285 - Preserve units exactly when present or requested (%, $, million, etc.). 286 - Financial reports: parentheses = negative; check table header for unit scale (e.g., "In millions"); 287 "how much higher/more" = positive number; "change" = positive for increase, negative for decrease. 288 289 COUNT / PERCENT / DENOMINATOR DISCIPLINE 290 - If one page provides a percentage and another page provides a sample size, the percentage is NOT the final answer until it is converted using the sample size. 291 - Distinguish carefully between count, percentage, percentage-point difference, and ratio. 292 - Only round after the final computation, never before. 293 294 THINKING STABILITY (critical) 295 - Follow the protocol once from top to bottom. Do NOT restart from step 1 after you already found the relevant pages. 296 - Do at most one scan pass and one verification pass. 297 - If there are two plausible candidates, compare them once using the question's qualifiers, choose the best-supported one, and continue. Do NOT keep generating new alternatives. 298 - Do NOT repeat the same scan, recount, or conclusion more than once. 299 - As soon as the answer is found and verified, stop thinking and produce the final answer. 300 - Do NOT use filler loops such as repeating a phrase, title, entity name, or page reference many times. 301 - If you have a complete answer supported by the required pages and qualifiers, commit to it. Do not reopen the search. 302 303 EXHAUSTIVE REPEATED-LAYOUT SCAN 304 If multiple pages in the document share the same layout or template, scan ALL matching pages before concluding. 305 Do not stop after the first valid hit. 306 For counts/lists, maintain a running page-by-page tally or item list until the last relevant page in the document. 307 Bad: "I see items 14-29 on pages 28-31. Count = 29 - 14 + 1 = 16." (stopped early, missed pages 32-34) 308 Good: "Page 28: items 14-21 (running total: 8). Page 30: items 22-29 (running total: 16). Page 32: items 30-37 (running total: 24). Page 34: items 38-44 (running total: 31). Final count: 31." 309 310 PAGE-BREAK CONTINUATION 311 If a sentence, paragraph, table row, caption, or figure explanation appears to continue onto the next page, 312 combine the text before deciding the answer is missing. 313 314 THINKING TRACE QUALITY (your thinking is used as training data): 315 - Cite which page and which element (by title/heading) you found evidence on. 316 - Quote the specific values you read. 317 - For computation, show the formula with named operands. 318 - For counting, enumerate every item, not just a total. 319 320 FINAL ANSWER: After </think>, output ONLY the bare answer. No reasoning, no step labels, 321 no "DECOMPOSE:", no "VERIFY:", no explanation. Just the answer value. 322 323 {% if "not answerable" in question_type %} 324 Only say "Not answerable" if the information is genuinely absent from ALL pages — you must have 325 scanned every single page before concluding this. Do NOT refuse because: 326 - The exact phrasing doesn't appear (look for equivalent information) 327 - You need to compute the answer from available data (do the computation) 328 - A figure or icon is hard to read (give your best reading) 329 - You only checked the first half of the document (later pages may have the answer) 330 {% endif %} 331 332 - For figures/diagrams: examine visual elements directly (nodes, arrows, colors, spatial groupings), not just captions. 333 - For list answers: format as ["item1", "item2"]. Include units when applicable.\ 334 """ 335 336 337 PROMPT_QUALITY_SCORE = """\ 338 <question-type>{{ question_type }}</question-type> 339 <question>{{ question }}</question> 340 <answer>{{ answer }}</answer> 341 <answer_reasoning>{{ answer__reasoning_content }}</answer_reasoning> 342 343 You are given ALL pages of a PDF document. Evaluate the question-answer pair AND its reasoning. 344 345 Filter out bad training data, especially: 346 - looping or repetitive reasoning traces that keep rescanning the same pages/evidence without converging 347 - unfinished or truncated reasoning traces that stop mid-thought 348 - reasoning that stops after the first cluster of results when later pages still matter 349 350 CHECKS (any failure => score 0): 351 352 1. PAGES READABLE — document pages are clear and not low quality. 353 2. QUESTION RELEVANT — question is about content in these pages. 354 {% if "not answerable" in question_type %} 355 For "not answerable": question should be relevant but the answer must NOT be in the pages. 356 {% endif %} 357 3. ANSWER CORRECT — verify by examining the actual pages yourself. For counts, count the 358 items yourself. For chart values, find the specific chart and check. For lists, verify 359 each item exists. 360 {% if "not answerable" in question_type %} 361 Correct answer must be exactly "Not answerable". 362 {% endif %} 363 4. QUESTION WELL-FORMED — unambiguous, doesn't contain the answer, not trivially easy. 364 5. MULTI-PAGE REQUIRED — would someone need 2+ pages to answer? Score 0 if answerable 365 from a single page. 366 6. FORMAT + REFUSAL — 367 - Lists must be JSON arrays, units included, correct specificity. 368 - Score 0 if the answer contains reasoning steps, protocol labels (e.g., "DECOMPOSE:", 369 "SCAN:", "VERIFY:"), explanations, or anything beyond the bare result. 370 {% if "not answerable" in question_type %} 371 - Answer must be exactly "Not answerable". 372 - Score 0 if the question is unanswerable only because it asks a trivially absent detail 373 (e.g., author's phone number). Must be a near-miss negative where a specific qualifier 374 (year, subgroup, row, condition) is absent from the document. 375 {% else %} 376 - Answer MUST NOT be "Not answerable" or any refusal. 377 {% endif %} 378 7. REASONING QUALITY (critical — reasoning is used as chain-of-thought training data): 379 The <answer_reasoning> must demonstrate thorough multi-page scanning. Score 0 if: 380 - Reasoning does NOT mention specific pages where evidence was found. 381 - Reasoning stops scanning after the first cluster of results (e.g., finds 5 items on 382 pages 10-12 and stops, when more exist on pages 15-20). 383 - It repeats the same scan, recount, candidate answer, page reference, title, entity name, 384 or conclusion without adding new evidence. 385 - It restarts the reasoning process after already finding the relevant page(s) or elements. 386 - It contains obvious loop markers such as repeated "Wait, let me", "Actually", 387 "Let's look again", or "Let's look at" — especially if the same phrase appears 388 more than 3 times with no new information between repetitions. 389 - It keeps generating new alternatives after already having enough evidence to answer. 390 - It ends in an unfinished or truncated way, or appears to stop mid-thought. 391 - For repeated-layout or repeated-entry questions, it stops after the first valid hit instead 392 of scanning all matching pages. 393 - For counting: reasoning does NOT enumerate items explicitly (just states a total 394 without listing each instance). 395 - For computation: reasoning does NOT show the formula with named values from specific 396 pages (e.g., must show "Revenue (44,538 from income statement) / ..." not just "44,538 / ..."). 397 - For cross-page computations, it does NOT clearly distinguish which page provides the key, 398 target value, denominator, or comparison value. 399 - It confuses count vs percentage vs percentage-point difference vs ratio, or rounds before 400 the final computation rather than after it. 401 - Reasoning uses only vague references ("the table", "the chart", "the page") without 402 identifying which specific element on which page. 403 404 SCORING: 405 - Score 0: any check fails. 406 - Score 1: all checks pass. 407 - Score 2: all pass AND question requires non-trivial cross-page reasoning (computation, 408 counting across 3+ pages, or lookup chain), AND reasoning demonstrates exhaustive scanning 409 with explicit page-by-page enumeration. 410 Respond with ONLY: 0, 1, or 2.\ 411 """ 412 413 414 # ============================================================================= 415 # Pipeline configuration 416 # ============================================================================= 417 418 419 def build_config( 420 seed_path: str = "seed.parquet", 421 model_alias: str = "vl", 422 model_id: str = DEFAULT_VLM_MODEL, 423 ) -> dd.DataDesignerConfigBuilder: 424 model_configs = [ 425 dd.ModelConfig( 426 alias=model_alias, 427 model=model_id, 428 provider=VLLM_PROVIDER_NAME, 429 inference_parameters=dd.ChatCompletionInferenceParams( 430 timeout=120, 431 temperature=1.0, 432 top_p=0.95, 433 max_parallel_requests=32, 434 extra_body={ 435 "top_k": 20, 436 "min_p": 0.0, 437 "presence_penalty": 1.5, 438 "repetition_penalty": 1.0, 439 }, 440 ), 441 ), 442 ] 443 444 config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs) 445 446 config_builder.with_seed_dataset( 447 dd.LocalFileSeedSource(path=seed_path), 448 sampling_strategy=dd.SamplingStrategy.ORDERED, 449 ) 450 451 config_builder.add_column( 452 dd.SamplerColumnConfig( 453 name="question_type", 454 sampler_type=dd.SamplerType.CATEGORY, 455 params=dd.CategorySamplerParams( 456 values=[ 457 "multiple choice", 458 "yes or no", 459 "string: word, phrase or short sentence", 460 "layout", 461 "numerical (int)", 462 "numerical (float)", 463 "numerical (percentage)", 464 "list of items (int, string, float or mixed)", 465 "not answerable", 466 ], 467 weights=[ 468 0.001858736059479554, 469 0.001858736059479554, 470 0.07434944237918216, 471 0.14869888475836432, 472 0.2230483271375465, 473 0.14869888475836432, 474 0.14869888475836432, 475 0.2230483271375465, 476 0.029739776951672865, 477 ], 478 ), 479 ) 480 ) 481 482 config_builder.add_column( 483 dd.LLMTextColumnConfig( 484 name="question", 485 model_alias=model_alias, 486 prompt=PROMPT_QUESTION, 487 multi_modal_context=IMAGE_CONTEXT, 488 ) 489 ) 490 491 config_builder.add_column( 492 dd.LLMTextColumnConfig( 493 name="answer", 494 model_alias=model_alias, 495 prompt=PROMPT_ANSWER, 496 multi_modal_context=IMAGE_CONTEXT, 497 extract_reasoning_content=True, 498 ) 499 ) 500 501 config_builder.add_column( 502 dd.LLMTextColumnConfig( 503 name="quality_score", 504 model_alias=model_alias, 505 prompt=PROMPT_QUALITY_SCORE, 506 multi_modal_context=IMAGE_CONTEXT, 507 ) 508 ) 509 510 return config_builder 511 512 513 def create_dataset( 514 config_builder: dd.DataDesignerConfigBuilder, 515 num_records: int, 516 vllm_endpoint: str, 517 artifact_path: Path | str | None = None, 518 ) -> DatasetCreationResults: 519 model_providers = [ 520 dd.ModelProvider( 521 name=VLLM_PROVIDER_NAME, 522 endpoint=vllm_endpoint, 523 ), 524 ] 525 data_designer = DataDesigner( 526 artifact_path=artifact_path, 527 model_providers=model_providers, 528 ) 529 data_designer.set_run_config(dd.RunConfig(progress_bar=True, disable_early_shutdown=True)) 530 results = data_designer.create(config_builder, num_records=num_records, dataset_name="whole_document_qa") 531 return results 532 533 534 if __name__ == "__main__": 535 from argparse import ArgumentParser 536 537 parser = ArgumentParser() 538 parser.add_argument( 539 "--vllm-endpoint", 540 type=str, 541 required=True, 542 help="Base URL of the vLLM server hosting the VLM (e.g. http://localhost:8000/v1)", 543 ) 544 parser.add_argument("--seed-path", type=str, required=True, help="Path to the seed parquet file") 545 parser.add_argument("--model-alias", type=str, default="vl") 546 parser.add_argument("--model-id", type=str, default=DEFAULT_VLM_MODEL) 547 parser.add_argument("--num-records", type=int, default=5) 548 parser.add_argument("--artifact-path", type=str, default=None) 549 args = parser.parse_args() 550 551 config_builder = build_config( 552 seed_path=args.seed_path, 553 model_alias=args.model_alias, 554 model_id=args.model_id, 555 ) 556 results = create_dataset( 557 config_builder, 558 num_records=args.num_records, 559 vllm_endpoint=args.vllm_endpoint, 560 artifact_path=args.artifact_path, 561 ) 562 563 print(f"Dataset saved to: {results.artifact_storage.final_dataset_path}") 564 565 results.load_analysis().to_report()