Curate TextSynthetic Data

NeMo Data Designer Integration

View as Markdown

NeMo Data Designer (NDD) is a declarative data generation framework that integrates with NeMo Curator to scale synthetic data pipelines. Instead of writing imperative LLM call logic, you define a configuration that describes what columns to generate, how to sample structured fields, and which LLM to use. NDD handles execution, batching, and token metric collection automatically.

How It Works

NeMo Curator wraps NDD through the DataDesignerStage, which accepts a DataDesignerConfigBuilder or a YAML config file. The stage:

  1. Takes input records from a DocumentBatch
  2. Passes them to NDD as a seed dataset
  3. Calls DataDesigner.preview() to generate new columns (samplers, expressions, LLM text)
  4. Returns the enriched dataset as a new DocumentBatch with token usage metrics

Prerequisites

Install the NDD dependency:

$uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]

The data-designer package is included in the text extras. For local model serving, also install:

$uv pip install nemo-curator[inference_server]

DataDesignerStage

The DataDesignerStage is the core integration point between NeMo Curator and NDD.

Parameters

ParameterTypeDefaultDescription
config_builderDataDesignerConfigBuilderNoneNDD configuration builder. Mutually exclusive with data_designer_config_file.
data_designer_config_filestrNonePath to a YAML config file. Mutually exclusive with config_builder.
model_providerslistNoneCustom ModelProvider instances for local or test endpoints. If None, NDD uses its default providers.
verboseboolFalseWhen True, show full NDD log output.

Metrics

DataDesignerStage automatically collects and reports:

  • ndd_running_time: Wall-clock time for the NDD preview() call
  • num_input_records / num_output_records: Record counts before and after generation
  • input_tokens_median_per_record / output_tokens_median_per_record: Median token counts across all LLM columns

Building a Configuration

NDD configurations use a builder pattern. You add columns of three types:

For full documentation for building NDD configuration, see the NDD config builder reference.

Sampler Columns

Generate structured data using built-in samplers (Faker names, UUIDs, dates):

1import data_designer.config as dd
2
3config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
4
5config_builder.add_column(
6 dd.SamplerColumnConfig(
7 name="patient_name",
8 sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
9 params=dd.PersonFromFakerSamplerParams(),
10 )
11)
12
13config_builder.add_column(
14 dd.SamplerColumnConfig(
15 name="patient_id",
16 sampler_type=dd.SamplerType.UUID,
17 params=dd.UUIDSamplerParams(prefix="PT-", short_form=True, uppercase=True),
18 )
19)

Expression Columns

Derive values from other columns using Jinja templates:

1config_builder.add_column(
2 dd.ExpressionColumnConfig(
3 name="first_name",
4 expr="{{ patient_name.first_name }}",
5 )
6)

LLM Text Columns

Generate text using an LLM with prompts that reference other columns:

1config_builder.add_column(
2 dd.LLMTextColumnConfig(
3 name="physician_notes",
4 prompt="""\
5You are a primary-care physician who just had an appointment with {{ first_name }}.
6{{ patient_summary }}
7Write careful notes about your visit. Respond with only the notes.
8""",
9 model_alias="local-llm",
10 )
11)

End-to-End Example

This example generates synthetic medical notes from seed symptom data using a local InferenceServer:

1import data_designer.config as dd
2
3from nemo_curator.backends.ray_data import RayDataExecutor
4from nemo_curator.core.client import RayClient
5from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
6from nemo_curator.pipeline import Pipeline
7from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
8from nemo_curator.stages.text.io.reader.jsonl import JsonlReader
9from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
10
11# Start Ray cluster
12client = RayClient(num_cpus=16, num_gpus=4)
13client.start()
14
15# Start local inference server
16server_config = InferenceModelConfig(
17 model_identifier="google/gemma-3-27b-it",
18 deployment_config={"autoscaling_config": {"min_replicas": 1, "max_replicas": 1}},
19 engine_kwargs={"tensor_parallel_size": 4},
20)
21inference_server = InferenceServer(models=[server_config])
22inference_server.start()
23
24# Configure NDD model
25model_config = dd.ModelConfig(
26 alias="local-llm",
27 model="google/gemma-3-27b-it",
28 provider="local",
29 skip_health_check=True,
30 inference_parameters=dd.ChatCompletionInferenceParams(
31 temperature=1.0, top_p=1.0, max_tokens=2048,
32 ),
33)
34
35model_provider = dd.ModelProvider(
36 name="local",
37 endpoint=inference_server.endpoint,
38 api_key="unused",
39)
40
41# Build config with sampler and LLM columns
42config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
43
44config_builder.add_column(
45 dd.SamplerColumnConfig(
46 name="patient_name",
47 sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
48 params=dd.PersonFromFakerSamplerParams(),
49 )
50)
51
52config_builder.add_column(
53 dd.LLMTextColumnConfig(
54 name="physician_notes",
55 prompt="You are a physician. Write notes for {{ patient_name.first_name }} "
56 "who has {{ diagnosis }}. {{ patient_summary }}",
57 model_alias="local-llm",
58 )
59)
60
61# Build and run pipeline
62pipeline = Pipeline(name="ndd_medical_notes")
63pipeline.add_stage(JsonlReader(file_paths="seed_data/*.jsonl", fields=["diagnosis", "patient_summary"]))
64pipeline.add_stage(DataDesignerStage(config_builder=config_builder, model_providers=[model_provider]))
65pipeline.add_stage(JsonlWriter(path="./synthetic_output"))
66
67pipeline.run(executor=RayDataExecutor())
68
69inference_server.stop()
70client.stop()

Using a Remote Provider

To use NVIDIA NIM or another hosted endpoint instead of a local server, configure the ModelProvider with the remote URL and API key:

1import os
2
3import data_designer.config as dd
4
5from nemo_curator.stages.synthetic.nemo_data_designer.data_designer import DataDesignerStage
6
7model_config = dd.ModelConfig(
8 alias="nim-llm",
9 model="meta/llama-3.3-70b-instruct",
10 provider="nvidia",
11 inference_parameters=dd.ChatCompletionInferenceParams(
12 temperature=0.5, top_p=0.9, max_tokens=1600,
13 ),
14)
15
16model_provider = dd.ModelProvider(
17 name="nvidia",
18 endpoint="https://integrate.api.nvidia.com/v1",
19 provider_type="openai",
20 api_key=os.environ["NVIDIA_API_KEY"],
21)
22
23config_builder = dd.DataDesignerConfigBuilder(model_configs=[model_config])
24# Add columns as needed...
25
26stage = DataDesignerStage(
27 config_builder=config_builder,
28 model_providers=[model_provider],
29)

NDD-Backed Nemotron-CC Stages

The Nemotron-CC synthetic data stages have NDD-backed equivalents that replace the AsyncOpenAIClient with NDD execution. These stages accept the same input_field, output_field, and prompt parameters, but route generation through DataDesignerStage internally.

StageImport PathOutput Field
WikipediaParaphrasingStagenemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_ccrephrased
DiverseQAStagenemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_ccdiverse_qa
DistillStagenemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_ccdistill
ExtractKnowledgeStagenemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_ccextract_knowledge
KnowledgeListStagenemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_ccknowledge_list

These stages inherit from NDDBaseSyntheticStage, which auto-builds an NDD config from the prompt fields. You configure the LLM through model_configs and model_providers instead of an AsyncOpenAIClient:

1import os
2
3import data_designer.config as dd
4
5from nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc import DiverseQAStage
6
7model_config = dd.ModelConfig(
8 alias="meta/llama-3.3-70b-instruct",
9 model="meta/llama-3.3-70b-instruct",
10 provider="nvidia",
11 inference_parameters=dd.ChatCompletionInferenceParams(
12 temperature=0.5, top_p=0.9, max_tokens=1600,
13 ),
14)
15
16model_provider = dd.ModelProvider(
17 name="nvidia",
18 endpoint="https://integrate.api.nvidia.com/v1",
19 provider_type="openai",
20 api_key=os.environ["NVIDIA_API_KEY"],
21)
22
23stage = DiverseQAStage(
24 input_field="text",
25 output_field="diverse_qa",
26 model_alias="meta/llama-3.3-70b-instruct",
27 model_configs=[model_config],
28 model_providers=[model_provider],
29)

YAML Configuration

Instead of building configs in Python, you can define the entire NDD configuration in a YAML file and pass it to DataDesignerStage:

1stage = DataDesignerStage(data_designer_config_file="config.yaml")

This is useful for reproducible pipelines where the generation config is versioned alongside data artifacts.


Next Steps