Overview | NeMo Curator

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator’s built-in Inference Server (Ray Serve + vLLM), or other inference providers.

Use Cases

Data Augmentation: Expand limited datasets by generating diverse variations
Multilingual Generation: Create Q&A pairs and text in multiple languages
Knowledge Extraction: Convert raw text into structured knowledge formats
Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
Training Data Creation: Generate instruction-following data for model fine-tuning

Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

Generation Mode

Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.

Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

Paraphrased text in Wikipedia style
Diverse Q&A pairs derived from document content
Condensed knowledge distillations
Extracted factual content

Declarative Mode (NeMo Data Designer)

Define data generation pipelines declaratively using NeMo Data Designer (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.

Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

Prerequisites

Before using synthetic data generation, ensure you have:

NVIDIA API Key (for cloud endpoints)
- Obtain from NVIDIA Build
- Set as environment variable: export NVIDIA_API_KEY="your-key"

NeMo Curator with text extras

$ uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]

Local inference (optional) — to serve models alongside your pipeline:
```
$ uv pip install nemo-curator[inference_server]
```
Refer to the Inference Server guide for setup details.

Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator core dependencies.

Available SDG Stages

Stage	Purpose	Input Type
`QAMultilingualSyntheticStage`	Generate multilingual Q&A pairs	Empty (generates from scratch)
`WikipediaParaphrasingStage`	Rewrite text as Wikipedia-style prose	Document text
`DiverseQAStage`	Generate diverse Q&A pairs from documents	Document text
`DistillStage`	Create condensed, information-dense paraphrases	Document text
`ExtractKnowledgeStage`	Extract knowledge as textbook-style passages	Document text
`KnowledgeListStage`	Extract structured fact lists	Document text
`DataDesignerStage`	Declarative generation via NeMo Data Designer	Seed data (any schema)

Topics

LLM Client Setup

Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints configuration performance

Inference Server

Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines ray-serve local-inference

Multilingual Q&A Generation

Generate synthetic Q&A pairs across multiple languages quickstart tutorial

NeMo Data Designer

Declarative data generation with structured columns and NDD-backed Nemotron-CC stages ndd declarative

Nemotron-CC Pipelines

Advanced text transformation and knowledge extraction workflows advanced paraphrasing