For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
      • Overview
      • LLM Client Setup
      • Inference Server
      • NeMo Data Designer
      • Multilingual Q&A
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Use Cases
  • Core Concepts
  • Generation Mode
  • Transformation Mode
  • Declarative Mode (NeMo Data Designer)
  • Architecture
  • Prerequisites
  • Available SDG Stages
  • Topics
Curate TextSynthetic Data

Synthetic Data Generation

||View as Markdown|
Previous

Save and Export

Next

LLM Client Setup

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator’s built-in Inference Server (Ray Serve + vLLM), or other inference providers.

Use Cases

  • Data Augmentation: Expand limited datasets by generating diverse variations
  • Multilingual Generation: Create Q&A pairs and text in multiple languages
  • Knowledge Extraction: Convert raw text into structured knowledge formats
  • Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
  • Training Data Creation: Generate instruction-following data for model fine-tuning

Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

Generation Mode

Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.

Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

  • Paraphrased text in Wikipedia style
  • Diverse Q&A pairs derived from document content
  • Condensed knowledge distillations
  • Extracted factual content

Declarative Mode (NeMo Data Designer)

Define data generation pipelines declaratively using NeMo Data Designer (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.

Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

Prerequisites

Before using synthetic data generation, ensure you have:

  1. NVIDIA API Key (for cloud endpoints)

    • Obtain from NVIDIA Build
    • Set as environment variable: export NVIDIA_API_KEY="your-key"
  2. NeMo Curator with text extras

    $uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
  3. Local inference (optional) — to serve models alongside your pipeline:

    $uv pip install nemo-curator[inference_server]

    Refer to the Inference Server guide for setup details.

Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator core dependencies.

Available SDG Stages

StagePurposeInput Type
QAMultilingualSyntheticStageGenerate multilingual Q&A pairsEmpty (generates from scratch)
WikipediaParaphrasingStageRewrite text as Wikipedia-style proseDocument text
DiverseQAStageGenerate diverse Q&A pairs from documentsDocument text
DistillStageCreate condensed, information-dense paraphrasesDocument text
ExtractKnowledgeStageExtract knowledge as textbook-style passagesDocument text
KnowledgeListStageExtract structured fact listsDocument text
DataDesignerStageDeclarative generation via NeMo Data DesignerSeed data (any schema)

Topics

LLM Client Setup

Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints configuration performance

Inference Server

Serve LLMs locally via Ray Serve and vLLM alongside curation pipelines ray-serve local-inference

Multilingual Q&A Generation

Generate synthetic Q&A pairs across multiple languages quickstart tutorial

NeMo Data Designer

Declarative data generation with structured columns and NDD-backed Nemotron-CC stages ndd declarative

Nemotron-CC Pipelines

Advanced text transformation and knowledge extraction workflows advanced paraphrasing