Synthetic Data Generation#
NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers.
Use Cases#
Data Augmentation: Expand limited datasets by generating diverse variations
Multilingual Generation: Create Q&A pairs and text in multiple languages
Knowledge Extraction: Convert raw text into structured knowledge formats
Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
Training Data Creation: Generate instruction-following data for model fine-tuning
Core Concepts#
Synthetic data generation in NeMo Curator operates in two primary modes:
Generation Mode#
Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
Transformation Mode#
Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
Paraphrased text in Wikipedia style
Diverse Q&A pairs derived from document content
Condensed knowledge distillations
Extracted factual content
Architecture#
The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
flowchart LR
A["Input Documents<br/>(Parquet/JSONL)"] --> B["Preprocessing<br/>(Tokenization,<br/>Segmentation)"]
B --> C["LLM Generation<br/>(OpenAI-compatible)"]
C --> D["Postprocessing<br/>(Cleanup, Filtering)"]
D --> E["Output Dataset<br/>(Parquet/JSONL)"]
F["LLM Client<br/>(NVIDIA API,<br/>vLLM, TGI)"] -.->|"API Calls"| C
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B,C,D stage
class E output
class F infra
Prerequisites#
Before using synthetic data generation, ensure you have:
NVIDIA API Key (for cloud endpoints)
Obtain from NVIDIA Build
Set as environment variable:
export NVIDIA_API_KEY="your-key"
NeMo Curator with text extras
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
Note
Nemotron-CC pipelines use the
transformerslibrary for tokenization, which is included in NeMo Curator’s core dependencies.
Available SDG Stages#
Stage |
Purpose |
Input Type |
|---|---|---|
|
Generate multilingual Q&A pairs |
Empty (generates from scratch) |
|
Rewrite text as Wikipedia-style prose |
Document text |
|
Generate diverse Q&A pairs from documents |
Document text |
|
Create condensed, information-dense paraphrases |
Document text |
|
Extract knowledge as textbook-style passages |
Document text |
|
Extract structured fact lists |
Document text |
Topics#
Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints
Generate synthetic Q&A pairs across multiple languages
Advanced text transformation and knowledge extraction workflows