Is this page helpful?

About Designing Synthetic Data From Scratch or Seeds#

NeMo Data Designer is purpose-built for AI developers to design high-quality, domain-specific synthetic data at scale–unlike one-size-fits-all LLMs that struggle to deliver consistent, reliable results. You can start from scratch or from your own seed datasets to accelerate AI development with greater accuracy and performance.

Getting started with Data Designer requires the following:

Data Designer deployed on your laptop or compute instance.
The NeMo Microservices SDK installed with the data-designer extra option (pip install nemo-microservices[data-designer]).
Connectivity to models that are available via API or deployed in the same environment as Data Designer.

Note

If you already have a dataset and want to remove PII from it or use differential privacy to create a synthetic version of your dataset, refer to About Generating Private Synthetic Data. The private synthetic data service provides enhanced security features for sensitive datasets.

Synthetic Data Generation Workflow#

Once you have access to a deployment of the NeMo Data Designer microservice, the synthetic data generation workflow consists of the following steps:

Configure the models you want to use for Synthetic Data Generation (SDG)
Configure the seed datasets and columns you want to use to diversify your dataset.
Configure your LLM generated columns with prompts and structured outputs.
Preview your dataset and iterate on your configuration.
Generate data at scale.
Evaluate the quality of your data.

Installation Options#

Deploy the NeMo Data Designer microservice using Docker Compose or the NeMo Microservices Helm chart.

Docker Compose

Deploy the NeMo Data Designer microservice using Docker. Easiest for local testing.

standalone

Deploy NeMo Data Designer with Docker

Helm Chart

Deploy the NeMo Microservices Helm Chart, which includes NeMo Data Designer.

helm-chart

NeMo Data Designer Deployment Guide

Task Guides#

Follow the synthetic data generation workflow from model setup to data production.

Tip

The tutorials reference a NEMO_MICROSERVICES_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

Configure Models

Set up AI models for synthetic data generation. Connect to NVIDIA-hosted models, manage model aliases, and tune the default inference parameters.

model-setup aliases parameters

Configure Models

Define Your Data Columns

Create column definitions with various data types, constraints, and LLM-generated content using prompts templates and structured outputs.

column-types prompts constraints

Define Your Data Columns

Seeding Generation with External Data

Seed the SDG process with existing datasets to steer the content and diversity of the generated data.

seed-data sampling referencing

Seeding SDG with External Data

Generate Realistic Personal Details

Create synthetic person entities with demographics, personality traits, and synthetic personas for comprehensive character modeling.

person-data demographics personas

generate-realistic-personal-details

Generate Data

Create synthetic datasets at scale using jobs, preview generations, and manage the data production process.

jobs preview production

Generate Data

Data Quality

Validate and evaluate your synthetic data quality using automated checks and assessment metrics.

validation evaluation quality-metrics

Data Quality

References#

Explore advanced configuration management, examples, and learning resources.

Data Designer Configuration

Save, load, and manage Data Designer configurations for reproducible synthetic data workflows.

save-configs load-configs examples

Data Designer Configuration