NeMo Data Designer (NDD) is a declarative data generation framework that integrates with NeMo Curator to scale synthetic data pipelines. Instead of writing imperative LLM call logic, you define a configuration that describes what columns to generate, how to sample structured fields, and which LLM to use. NDD handles execution, batching, and token metric collection automatically.
NeMo Curator wraps NDD through the DataDesignerStage, which accepts a DataDesignerConfigBuilder or a YAML config file. The stage:
DocumentBatchDataDesigner.preview() to generate new columns (samplers, expressions, LLM text)DocumentBatch with token usage metricsInstall the NDD dependency:
The data-designer package is included in the text extras. For local model serving, also install:
The DataDesignerStage is the core integration point between NeMo Curator and NDD.
DataDesignerStage automatically collects and reports:
ndd_running_time: Wall-clock time for the NDD preview() callnum_input_records / num_output_records: Record counts before and after generationinput_tokens_median_per_record / output_tokens_median_per_record: Median token counts across all LLM columnsNDD configurations use a builder pattern. You add columns of three types:
For full documentation for building NDD configuration, see the NDD config builder reference.
Generate structured data using built-in samplers (Faker names, UUIDs, dates):
Derive values from other columns using Jinja templates:
Generate text using an LLM with prompts that reference other columns:
This example generates synthetic medical notes from seed symptom data using a local InferenceServer:
To use NVIDIA NIM or another hosted endpoint instead of a local server, configure the ModelProvider with the remote URL and API key:
The Nemotron-CC synthetic data stages have NDD-backed equivalents that replace the AsyncOpenAIClient with NDD execution. These stages accept the same input_field, output_field, and prompt parameters, but route generation through DataDesignerStage internally.
These stages inherit from NDDBaseSyntheticStage, which auto-builds an NDD config from the prompt fields. You configure the LLM through model_configs and model_providers instead of an AsyncOpenAIClient:
Instead of building configs in Python, you can define the entire NDD configuration in a YAML file and pass it to DataDesignerStage:
This is useful for reproducible pipelines where the generation config is versioned alongside data artifacts.