Overview | NeMo Curator

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers.

Use Cases

Data Augmentation: Expand limited datasets by generating diverse variations
Multilingual Generation: Create Q&A pairs and text in multiple languages
Knowledge Extraction: Convert raw text into structured knowledge formats
Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
Training Data Creation: Generate instruction-following data for model fine-tuning

Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

Generation Mode

Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.

Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

Paraphrased text in Wikipedia style
Diverse Q&A pairs derived from document content
Condensed knowledge distillations
Extracted factual content

Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

Prerequisites

Before using synthetic data generation, ensure you have:

NVIDIA API Key (for cloud endpoints)
- Obtain from NVIDIA Build
- Set as environment variable: export NVIDIA_API_KEY="your-key"

NeMo Curator with text extras

$ uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]

Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator’s core dependencies.

Available SDG Stages

Stage	Purpose	Input Type
`QAMultilingualSyntheticStage`	Generate multilingual Q&A pairs	Empty (generates from scratch)
`WikipediaParaphrasingStage`	Rewrite text as Wikipedia-style prose	Document text
`DiverseQAStage`	Generate diverse Q&A pairs from documents	Document text
`DistillStage`	Create condensed, information-dense paraphrases	Document text
`ExtractKnowledgeStage`	Extract knowledge as textbook-style passages	Document text
`KnowledgeListStage`	Extract structured fact lists	Document text

Topics

LLM Client Setup

Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints configuration performance

Multilingual Q&A Generation

Generate synthetic Q&A pairs across multiple languages quickstart tutorial

Nemotron-CC Pipelines

Advanced text transformation and knowledge extraction workflows advanced paraphrasing