Synthetic Data Generation
NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator’s built-in Inference Server (Ray Serve + vLLM), or other inference providers.
Use Cases
- Data Augmentation: Expand limited datasets by generating diverse variations
- Multilingual Generation: Create Q&A pairs and text in multiple languages
- Knowledge Extraction: Convert raw text into structured knowledge formats
- Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
- Training Data Creation: Generate instruction-following data for model fine-tuning
Core Concepts
Synthetic data generation in NeMo Curator operates in two primary modes:
Generation Mode
Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
Transformation Mode
Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
- Paraphrased text in Wikipedia style
- Diverse Q&A pairs derived from document content
- Condensed knowledge distillations
- Extracted factual content
Declarative Mode (NeMo Data Designer)
Define data generation pipelines declaratively using NeMo Data Designer (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.
Architecture
The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
Prerequisites
Before using synthetic data generation, ensure you have:
-
NVIDIA API Key (for cloud endpoints)
- Obtain from NVIDIA Build
- Set as environment variable:
export NVIDIA_API_KEY="your-key"
-
NeMo Curator with text extras
-
Local inference (optional) — to serve models alongside your pipeline:
Refer to the Inference Server guide for setup details.
Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator core dependencies.
Available SDG Stages
Topics
Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints configuration performance
Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines ray-serve local-inference
Generate synthetic Q&A pairs across multiple languages quickstart tutorial
Declarative data generation with structured columns and NDD-backed Nemotron-CC stages ndd declarative
Advanced text transformation and knowledge extraction workflows advanced paraphrasing