Curate TextSynthetic Data

Synthetic Data Generation

View as Markdown

NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers.

Use Cases

  • Data Augmentation: Expand limited datasets by generating diverse variations
  • Multilingual Generation: Create Q&A pairs and text in multiple languages
  • Knowledge Extraction: Convert raw text into structured knowledge formats
  • Quality Improvement: Paraphrase low-quality text into higher-quality Wikipedia-style prose
  • Training Data Creation: Generate instruction-following data for model fine-tuning

Core Concepts

Synthetic data generation in NeMo Curator operates in two primary modes:

Generation Mode

Create new data from scratch without requiring input documents. The QAMultilingualSyntheticStage demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.

Transformation Mode

Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:

  • Paraphrased text in Wikipedia style
  • Diverse Q&A pairs derived from document content
  • Condensed knowledge distillations
  • Extracted factual content

Architecture

The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:

Prerequisites

Before using synthetic data generation, ensure you have:

  1. NVIDIA API Key (for cloud endpoints)

    • Obtain from NVIDIA Build
    • Set as environment variable: export NVIDIA_API_KEY="your-key"
  2. NeMo Curator with text extras

    $uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]

    Nemotron-CC pipelines use the transformers library for tokenization, which is included in NeMo Curator’s core dependencies.

Available SDG Stages

StagePurposeInput Type
QAMultilingualSyntheticStageGenerate multilingual Q&A pairsEmpty (generates from scratch)
WikipediaParaphrasingStageRewrite text as Wikipedia-style proseDocument text
DiverseQAStageGenerate diverse Q&A pairs from documentsDocument text
DistillStageCreate condensed, information-dense paraphrasesDocument text
ExtractKnowledgeStageExtract knowledge as textbook-style passagesDocument text
KnowledgeListStageExtract structured fact listsDocument text

Topics