For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Prompt Sensitivity
    • Retriever SDG Toolkit
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
Dev Notes

Dev Notes

||View as Markdown|
Previous

Column Generator API

Next

Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles

Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups, and insights from the team building NeMo Data Designer.

May 28, 2026

Mitigating Prompt Sensitivity

A Data Designer pipeline that generates thousands of regex-paired prompt preambles to manufacture format robustness — six diversity samplers, four quality judges, dual-purpose for SFT and RL.

Dhruv Nathawani
May 19, 2026

Retriever SDG Plugin

The Data Designer plugin behind NVDocs and the bootstrap SDG stage for NeMo embedding and reranking recipes.

Steve Han +3
May 5, 2026

Have It Your Way

A plugin framework for the custom pieces every real project ends up needing.

Johnny Greco +1
Apr 28, 2026

Training a VLM to Understand Long Documents

An iterative SDG story: 11.4M synthetic visual QA pairs, document-reasoning pipelines, and long-context VLM training lessons.

Nabin Mulepati +3
Apr 16, 2026

Push Datasets to Hugging Face Hub

Call .push_to_hub() and ship a generated dataset straight to a live HF dataset card. Done and dusted.

Nabin Mulepati +1
Apr 14, 2026

Engineering an Enterprise-Grade Text-to-SQL Dataset

A pipeline with conditional sampling, three-stage LLM generation, code validators, and judge scoring — boosting Nemotron Super on BIRD from 26.77 → 41.80.

Dhruv Nathawani +2
Apr 2, 2026

Async All the Way Down

How async dispatch in the engine cuts wall time across deep dependency pipelines — same config, same prompts, 1.3× faster on average.

Andre Manoel +3
Mar 25, 2026

Owning the Model Stack

Adaptive concurrency, throttle keying, retry boundaries — owning the whole model client to discover provider capacity at runtime.

Nabin Mulepati
Mar 24, 2026

Data Designer Got Skills

A CLI and skill workflow that lets agents drive Data Designer end-to-end — leaner context, fewer tool calls, the same output.

Johnny Greco
Mar 12, 2026

Search Agent SFT Data

Multi-turn search agent trajectories for Nemotron Super post-training — Tavily web search, Wikidata KG seeding, BrowseComp-style obfuscation.

Dhruv Nathawani
Feb 18, 2026

Structured Outputs from Nemotron

Schema-constrained outputs across CSV / JSON / TOML / XML / YAML — JSONSchemaBench and StructEval-Text results, plus the recipe.

Dhruv Nathawani
D
Feb 10, 2026

Deep Research Trajectories

MCP tool-use trajectories for training deep research agents — search, open, find, answer over a static BM25 corpus, no web APIs needed.

Eric Tramel
Feb 10, 2026

Designing Data Designer

Why SDG is a systems problem, and the design principles behind a composable orchestration framework — declarative columns, imperative engine.

Kirit Thadaka
Feb 4, 2026

Graduate-Level Science Reasoning (RQA)

A massive collection of graduate-level reasoning samples seeded from Common Crawl — improves Nemotron 3 Nano on MMLU-Pro, Math 500, GSM8K.

Dane Corneil +1