> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Generate and augment training data using LLMs with NeMo Curator's synthetic data generation pipeline

# Synthetic Data Generation

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator's built-in [Inference Server](/curate-text/synthetic/inference-server) (Ray Serve + vLLM), or other inference providers.

## Use Cases

* **Data Augmentation**: Expand limited datasets by generating diverse variations
* **Multilingual Generation**: Create Q\&A pairs and text in multiple languages
* **Knowledge Extraction**: Convert raw text into structured knowledge formats
* **Quality Improvement**: Paraphrase low-quality text into higher-quality Wikipedia-style prose
* **Training Data Creation**: Generate instruction-following data for model fine-tuning

## Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

### Generation Mode

Create new data from scratch without requiring input documents. The `QAMultilingualSyntheticStage` demonstrates this pattern—it generates Q\&A pairs based on a prompt template without needing seed documents.

### Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

* Paraphrased text in Wikipedia style
* Diverse Q\&A pairs derived from document content
* Condensed knowledge distillations
* Extracted factual content

### Declarative Mode (NeMo Data Designer)

Define data generation pipelines declaratively using [NeMo Data Designer](/curate-text/synthetic/nemo-data-designer) (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.

## Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

```mermaid
flowchart LR
    A["Input Documents<br />(Parquet/JSONL)"] --> B["Preprocessing<br />(Tokenization,<br />Segmentation)"]
    B --> C["LLM Generation<br />(OpenAI-compatible)"]
    C --> D["Postprocessing<br />(Cleanup, Filtering)"]
    D --> E["Output Dataset<br />(Parquet/JSONL)"]
    
    F["LLM Client<br />(NVIDIA API,<br />InferenceServer,<br />vLLM, TGI)"] -.->|"API Calls"| C
    
    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
    
    class A,B,C,D stage
    class E output
    class F infra
```

## Prerequisites

Before using synthetic data generation, ensure you have:

1. **NVIDIA API Key** (for cloud endpoints)
   * Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys)
   * Set as environment variable: `export NVIDIA_API_KEY="your-key"`

2. **NeMo Curator with text extras**

   ```bash
   uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
   ```

3. **Local inference** (optional) — to serve models alongside your pipeline:

   ```bash
   uv pip install nemo-curator[inference_server]
   ```

   Refer to the [Inference Server](/curate-text/synthetic/inference-server) guide for setup details.

Nemotron-CC pipelines use the `transformers` library for tokenization, which is included in NeMo Curator core dependencies.

## Available SDG Stages

| Stage                          | Purpose                                         | Input Type                     |
| ------------------------------ | ----------------------------------------------- | ------------------------------ |
| `QAMultilingualSyntheticStage` | Generate multilingual Q\&A pairs                | Empty (generates from scratch) |
| `WikipediaParaphrasingStage`   | Rewrite text as Wikipedia-style prose           | Document text                  |
| `DiverseQAStage`               | Generate diverse Q\&A pairs from documents      | Document text                  |
| `DistillStage`                 | Create condensed, information-dense paraphrases | Document text                  |
| `ExtractKnowledgeStage`        | Extract knowledge as textbook-style passages    | Document text                  |
| `KnowledgeListStage`           | Extract structured fact lists                   | Document text                  |
| `DataDesignerStage`            | Declarative generation via NeMo Data Designer   | Seed data (any schema)         |

***

## Topics

Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints
configuration
performance

Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines
ray-serve
local-inference

Generate synthetic Q\&A pairs across multiple languages
quickstart
tutorial

Declarative data generation with structured columns and NDD-backed Nemotron-CC stages
ndd
declarative

Advanced text transformation and knowledge extraction workflows
advanced
paraphrasing