About Designing Synthetic Data From Scratch or Seeds#

Important

NVIDIA NeMo Data Designer is released with early access availability and is subject to limited support and potential API changes in future releases.

NeMo Data Designer is purpose-built for AI developers to design high-quality, domain-specific synthetic data at scale–unlike one-size-fits-all LLMs that struggle to deliver consistent, reliable results. You can start from scratch or from your own seed datasets to accelerate AI development with greater accuracy and performance.

Getting started with Data Designer requires the following:

  1. Data Designer deployed on your laptop or compute instance.

  2. The NeMo Microservices SDK installed with the data-designer extra option (pip install nemo-microservices[data-designer]).

  3. Connectivity to models that are available via API or deployed in the same environment as Data Designer.

Note

If you already have a dataset and want to remove PII from it or use differential privacy to create a synthetic version of your dataset, refer to About Generating Private Synthetic Data. The private synthetic data service provides enhanced security features for sensitive datasets.

Data Designer Architecture

Synthetic Data Generation Workflow#

Once you have access to a deployment of the NeMo Data Designer microservice, the synthetic data generation workflow consists of the following steps:

  1. Configure the models you want to use for Synthetic Data Generation (SDG)

  2. Configure the seed datasets and columns you want to use to diversify your dataset.

  3. Configure your LLM generated columns with prompts and structured outputs.

  4. Preview your dataset and iterate on your configuration.

  5. Generate data at scale.

  6. Evaluate the quality of your data.


Installation Options#

Try out this beta microservice using Docker Compose or deploying the NeMo Microservices Helm chart.

Docker Compose

Deploy the NeMo Data Designer microservice using Docker. Easiest for local testing.

Deploy NeMo Data Designer with Docker
Helm Chart

Deploy the NeMo Microservices Helm Chart, which includes NeMo Data Designer.

NeMo Data Designer Deployment Guide

Task Guides#

Follow the synthetic data generation workflow from model setup to data production.

Tip

The tutorials reference a NEMO_MICROSERVICES_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

NEMO_MICROSERVICES_BASE_URL

Configure Models

Set up AI models for synthetic data generation. Connect to NVIDIA-hosted models, manage model aliases, and tune the default inference parameters.

Configure Models
Define Your Data Columns

Create column definitions with various data types, constraints, and LLM-generated content using prompts templates and structured outputs.

Define Your Data Columns
Seeding Generation with External Data

Seed the SDG process with existing datasets to steer the content and diversity of the generated data.

Seeding SDG with External Data
Generate Realistic Personal Details

Create synthetic person entities with demographics, personality traits, and synthetic personas for comprehensive character modeling.

Generate Realistic Persons
Generate Data

Create synthetic datasets at scale using jobs, preview generations, and manage the data production process.

Generate Data
Data Quality

Validate and evaluate your synthetic data quality using automated checks and assessment metrics.

Data Quality

References#

Explore advanced configuration management, examples, and learning resources.

Data Designer Configuration

Save, load, and manage Data Designer configurations for reproducible synthetic data workflows.

Data Designer Configuration