Use Case Recipes | NVIDIA NeMo Data Designer

Recipes are a collection of code examples that demonstrate how to leverage Data Designer in specific use cases. Each recipe is a self-contained example that can be run independently.

New to Data Designer?

Recipes provide working code for specific use cases without detailed explanations. If you’re learning Data Designer for the first time, start with our tutorial notebooks, which offer step-by-step guidance and explain core concepts. Once you’re familiar with the basics, return here for practical, ready-to-use implementations.

Prerequisite

These recipes use the OpenAI model provider by default. Ensure your OpenAI provider is set up via the Data Designer CLI before running a recipe.

Code Generation

Text to Python

Natural-language instructions paired with Python implementations across complexity levels and industries.

Python code generation · validation · LLM-as-judge

Text to SQL

Natural-language instructions paired with SQL implementations across complexity levels and industries.

SQL code generation · validation · LLM-as-judge

Nemotron Super Text to SQL

Enterprise-grade text-to-SQL training data — dialect-specific SQL, distractor injection, dirty data, 5 LLM judges with 15 scoring dimensions.

Multi-dialect SQL · SubcategorySamplerParams · 5 judges · 15 score columns

QA and Chat

Product Info QA

Product information paired with question/answer pairs.

Structured outputs · expression columns · LLM-as-judge

Multi-Turn Chat

Multi-turn chat conversations between a user and an AI assistant.

Structured outputs · expression columns · LLM-as-judge

Trace Ingestion

Agent Rollout Trace Distillation

Read agent rollout traces from disk and turn each one into a structured workflow record inside a Data Designer pipeline. See the ingestion guide for the trace format.

AgentRolloutSeedSource · ATIF, Claude Code, Codex, Hermes formats · trace-aware prompts

MCP and Tool Use

Basic MCP Tool Use

Minimal example of MCP tool calling — defines a simple MCP server and generates data that requires tool calls to complete.

LocalStdioMCPProvider · simple tool server · tool-augmented text

PDF Document QA

Grounded Q&A pairs from PDF documents using MCP tool calls and BM25 search.

LocalStdioMCPProvider · BM25 retrieval · per-column trace capture

Nemotron Super Search Agent

Multi-turn search agent trajectories — Tavily web search via MCP, Wikidata KG seeding, BrowseComp-style question generation.

Tavily MCP · Wikidata seeding · two-stage question generation · trajectory capture

Plugin Development

Markdown Section Seed Reader

Define a custom FileSystemSeedReader inline and turn Markdown files into one seed row per heading section.

Single-file custom reader · hydrate_row() fanout · DirectorySeedSource customization

VLM Long-Document Understanding

A 9-recipe pipeline for generating visual QA training data from long PDF documents: OCR, page classification, single-page / multi-page / whole-document QA, and frontier-model quality filtering.

Seed Dataset Preparation

Download PDFs, render page images, and prepare seed datasets for the downstream VLM recipes.

Nemotron Parse OCR

Run Nemotron Parse over document pages and save OCR transcripts for text-based QA generation.

Text QA from OCR Transcripts

Generate text-grounded question-answer pairs from OCR transcripts.

Page Classification

Classify pages by visual reasoning potential before running more expensive QA generation.

Visual QA

Generate visual question-answer pairs from classified page images.

Single-Page QA

Generate single-page VLM QA examples from page-level image seeds.

Multi-Page Windowed QA

Generate cross-page QA examples over fixed-size page windows.

Whole-Document QA

Generate document-level QA examples over grouped page images.

Frontier Judge QA Filter

Score and filter generated QA pairs with a stronger independent judge.