Dev Notes

View as Markdown

Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.

Jun 1, 2026

Designing Nemotron-Personas

The compound-AI pipeline behind Nemotron-Personas seeding Nemotron training. Now open-sourced: fork it, add your data, ship personas for any use case.

May 28, 2026

Mitigating Prompt Sensitivity

A Data Designer pipeline that generates thousands of regex-paired prompt preambles to manufacture format robustness — six diversity samplers, four quality judges, dual-purpose for SFT and RL.

May 19, 2026

Retriever SDG Plugin

The Data Designer plugin behind NVDocs and the bootstrap SDG stage for NeMo embedding and reranking recipes.

May 5, 2026

Have It Your Way

A plugin framework for the custom pieces every real project ends up needing.

Apr 28, 2026

Training a VLM to Understand Long Documents

An iterative SDG story: 11.4M synthetic visual QA pairs, document-reasoning pipelines, and long-context VLM training lessons.

Apr 16, 2026

Push Datasets to Hugging Face Hub

Call .push_to_hub() and ship a generated dataset straight to a live HF dataset card. Done and dusted.

Apr 14, 2026

Engineering an Enterprise-Grade Text-to-SQL Dataset

A pipeline with conditional sampling, three-stage LLM generation, code validators, and judge scoring — boosting Nemotron Super on BIRD from 26.77 → 41.80.

Apr 2, 2026

Async All the Way Down

How async dispatch in the engine cuts wall time across deep dependency pipelines — same config, same prompts, 1.3× faster on average.

Mar 25, 2026

Owning the Model Stack

Adaptive concurrency, throttle keying, retry boundaries — owning the whole model client to discover provider capacity at runtime.

Mar 24, 2026

Data Designer Got Skills

A CLI and skill workflow that lets agents drive Data Designer end-to-end — leaner context, fewer tool calls, the same output.

Mar 12, 2026

Search Agent SFT Data

Multi-turn search agent trajectories for Nemotron Super post-training — Tavily web search, Wikidata KG seeding, BrowseComp-style obfuscation.

Feb 18, 2026

Structured Outputs from Nemotron

Schema-constrained outputs across CSV / JSON / TOML / XML / YAML — JSONSchemaBench and StructEval-Text results, plus the recipe.

Feb 10, 2026

Deep Research Trajectories

MCP tool-use trajectories for training deep research agents — search, open, find, answer over a static BM25 corpus, no web APIs needed.

Feb 10, 2026

Designing Data Designer

Why SDG is a systems problem, and the design principles behind a composable orchestration framework — declarative columns, imperative engine.

Feb 4, 2026

Graduate-Level Science Reasoning (RQA)

A massive collection of graduate-level reasoning samples seeded from Common Crawl — improves Nemotron 3 Nano on MMLU-Pro, Math 500, GSM8K.