Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.

A Data Designer pipeline that generates thousands of regex-paired prompt preambles to manufacture format robustness — six diversity samplers, four quality judges, dual-purpose for SFT and RL.
The Data Designer plugin behind NVDocs and the bootstrap SDG stage for NeMo embedding and reranking recipes.

A plugin framework for the custom pieces every real project ends up needing.

An iterative SDG story: 11.4M synthetic visual QA pairs, document-reasoning pipelines, and long-context VLM training lessons.

Call .push_to_hub() and ship a generated dataset straight to a live HF dataset card. Done and dusted.

A pipeline with conditional sampling, three-stage LLM generation, code validators, and judge scoring — boosting Nemotron Super on BIRD from 26.77 → 41.80.

How async dispatch in the engine cuts wall time across deep dependency pipelines — same config, same prompts, 1.3× faster on average.

Adaptive concurrency, throttle keying, retry boundaries — owning the whole model client to discover provider capacity at runtime.

A CLI and skill workflow that lets agents drive Data Designer end-to-end — leaner context, fewer tool calls, the same output.

Multi-turn search agent trajectories for Nemotron Super post-training — Tavily web search, Wikidata KG seeding, BrowseComp-style obfuscation.

Schema-constrained outputs across CSV / JSON / TOML / XML / YAML — JSONSchemaBench and StructEval-Text results, plus the recipe.
MCP tool-use trajectories for training deep research agents — search, open, find, answer over a static BM25 corpus, no web APIs needed.

Why SDG is a systems problem, and the design principles behind a composable orchestration framework — declarative columns, imperative engine.

A massive collection of graduate-level reasoning samples seeded from Common Crawl — improves Nemotron 3 Nano on MMLU-Pro, Math 500, GSM8K.