About Generating Synthetic Data#

Important

NVIDIA NeMo Data Designer is released with early access availability and is subject to limited support and potential API changes in future releases.

The Data Designer Early Access release is available only via Docker Compose and is not yet part of the NeMo Microservices Platform Helm Chart for Kubernetes deployment.

NeMo Data Designer is purpose-built for AI developers to design high-quality, domain-specific synthetic data at scale–unlike one-size-fits-all LLMs that struggle to deliver consistent, reliable results. You can start from scratch or from your own seed datasets to accelerate AI development with greater accuracy and performance.

Getting started with Data Designer requires the following:

  1. Deploy Data Designer on your laptop or compute instance.

  2. Install the NeMo Microservices SDK with the data-designer extra option.

  3. Connect to models that are available via API or deployed in the same environment as Data Designer.

  4. Start generating synthetic data.

Data Designer Architecture

Synthetic Data Generation Workflow#

Once you have access to a deployment of the NeMo Data Designer microservice, the synthetic data generation workflow consists of the following steps:

  1. Configure the models you want to use for Synthetic Data Generation (SDG)

  2. Configure the seed datasets and columns you want to use to diversify your dataset.

  3. Configure your LLM generated columns with prompts and structured outputs.

  4. Preview your dataset and iterate on your configuration.

  5. Generate data at scale.

  6. Evaluate the quality of your data.


Installation Options#

Try out this beta microservice using Docker Compose.

Docker Compose

Deploy the NeMo Data Designer microservice using Docker. Easiest for local testing.

Deploy NeMo Data Designer Using Docker Compose

Task Guides#

Follow the synthetic data generation workflow from model setup to data production.

Configure Models

Set up AI models for synthetic data generation. Connect to NVIDIA-hosted models, manage model aliases, and tune the default inference parameters.

Configure Models
Define Your Data Columns

Create column definitions with various data types, constraints, and LLM-generated content using prompts templates and structured outputs.

Define Your Data Columns
Seeding Generation with External Data

Seed the SDG process with existing datasets to steer the content and diversity of the generated data.

Seeding SDG with External Data
Generate Realistic Personal Details

Create synthetic person entities with demographics, personality traits, and synthetic personas for comprehensive character modeling.

Generate Realistic Persons
Generate Data

Create synthetic datasets at scale using jobs, preview generations, and manage the data production process.

Generate Data
Data Quality

Validate and evaluate your synthetic data quality using automated checks and assessment metrics.

Data Quality

References#

Explore advanced configuration management, examples, and learning resources.

Data Designer Configuration

Save, load, and manage Data Designer configurations for reproducible synthetic data workflows.

Data Designer Configuration