Curate TextSynthetic Data

Synthetic Data Generation

View as Markdown

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, NeMo Curator’s built-in Inference Server (Ray Serve + vLLM), or other inference providers.

Use Cases

  • Data Augmentation: Expand limited datasets by generating diverse variations
  • Multilingual Generation: Create Q&A pairs and text in multiple languages
  • Knowledge Extraction: Convert raw text into structured knowledge formats
  • Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
  • Training Data Creation: Generate instruction-following data for model fine-tuning

Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

Generation Mode

Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.

Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

  • Paraphrased text in Wikipedia style
  • Diverse Q&A pairs derived from document content
  • Condensed knowledge distillations
  • Extracted factual content

Declarative Mode (NeMo Data Designer)

Define data generation pipelines declaratively using NeMo Data Designer (NDD). Instead of writing imperative LLM call logic, you configure structured column generation (samplers, expressions, LLM text columns) through a builder API or YAML file. NDD handles execution, batching, and token metric collection. This mode supports both standalone generation and NDD-backed versions of Nemotron-CC stages.

Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

Prerequisites

Before using synthetic data generation, ensure you have:

  1. NVIDIA API Key (for cloud endpoints)

    • Obtain from NVIDIA Build
    • Set as environment variable: export NVIDIA_API_KEY="your-key"
  2. NeMo Curator with text extras

    $uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
  3. Local inference (optional) — to serve models alongside your pipeline:

    $uv pip install nemo-curator[inference_server]

    Refer to the Inference Server guide for setup details.

Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator core dependencies.

Available SDG Stages

StagePurposeInput Type
QAMultilingualSyntheticStageGenerate multilingual Q&A pairsEmpty (generates from scratch)
WikipediaParaphrasingStageRewrite text as Wikipedia-style proseDocument text
DiverseQAStageGenerate diverse Q&A pairs from documentsDocument text
DistillStageCreate condensed, information-dense paraphrasesDocument text
ExtractKnowledgeStageExtract knowledge as textbook-style passagesDocument text
KnowledgeListStageExtract structured fact listsDocument text
DataDesignerStageDeclarative generation via NeMo Data DesignerSeed data (any schema)

Topics