For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright Β© 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Install
  • Setup
  • Your First Dataset
  • πŸš€ Next Steps
  • Learn More
Getting Started

🎨 NeMo Data Designer

||View as Markdown|
Next

🎨✨ Contributing to NeMo Data Designer 🎨✨

GitHubLicenseNeMo Microservices

πŸ‘‹ Welcome! Data Designer is an orchestration framework for generating high-quality synthetic data. You provide LLM endpoints (NVIDIA, OpenAI, vLLM, etc.), and Data Designer handles batching, parallelism, validation, and more.

Configure columns and models β†’ Preview samples and iterate β†’ Create your full dataset at scale.

Unlike raw LLM calls, Data Designer gives you statistical diversity, field correlations, automated validation, and reproducible workflows. For details, see Architecture & Performance.

πŸ“ Want to hear from the team? Check out our Dev Notes for deep dives, best practices, and insights.

Install

$pip install data-designer

Setup

Get an API key from one of the default providers and set it as an environment variable:

$# NVIDIA (build.nvidia.com) - recommended
$export NVIDIA_API_KEY="your-api-key-here"
$
$# OpenAI (platform.openai.com)
$export OPENAI_API_KEY="your-openai-api-key-here"
$
$# OpenRouter (openrouter.ai)
$export OPENROUTER_API_KEY="your-openrouter-api-key-here"

Verify your configuration is ready:

$data-designer config list

This displays the pre-configured model providers and models. See CLI Configuration to customize.

Your First Dataset

Let’s generate multilingual greetings to see Data Designer in action:

1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4# Initialize with default model providers
5data_designer = DataDesigner()
6config_builder = dd.DataDesignerConfigBuilder()
7
8# Add a sampler column to randomly select a language
9config_builder.add_column(
10 dd.SamplerColumnConfig(
11 name="language",
12 sampler_type=dd.SamplerType.CATEGORY,
13 params=dd.CategorySamplerParams(
14 values=["English", "Spanish", "French", "German", "Italian"],
15 ),
16 )
17)
18
19# Add an LLM text generation column
20config_builder.add_column(
21 dd.LLMTextColumnConfig(
22 name="greeting",
23 model_alias="nvidia-text",
24 prompt="Write a casual and formal greeting in {{ language }}.",
25 )
26)
27
28# Generate a preview
29results = data_designer.preview(config_builder)
30results.display_sample_record()

πŸŽ‰ That’s it! You’ve just designed your first synthetic dataset.

πŸš€ Next Steps

Tutorials

Step-by-step notebooks covering core features

Recipes

Ready-to-use examples for common use cases

Concepts

Deep dive into columns, models, and configuration

Learn More

  • Deployment Options: Library vs. Microservice – Library vs. NeMo Microservice
  • Model Configuration – Configure LLM providers and models
  • Architecture & Performance – Optimize for throughput and scale