Data Generation Concepts#

This document covers the core concepts for synthetic data generation in NVIDIA NeMo Curator.

Synthetic Data Architecture#

NVIDIA NeMo Curator provides a comprehensive, modular architecture for generating high-quality synthetic text data. The architecture consists of several key components working together to enable flexible and efficient synthetic data creation.

What’s Synthetic Data?#

Synthetic data refers to artificially created text that mimics the properties of natural text but is generated by algorithms rather than humans. In NeMo Curator, synthetic data is typically generated using large language models (LLMs) to create high-quality training examples.

Component Model#

NeMo Curator’s synthetic data generation framework follows a component-based design defined in nemo_curator.synthetic with several core implementations:

Component

Implementation

Responsibilities

SyntheticDataGenerator

Abstract base class in generator.py

• Define the interface for text generation
• Establish common generation parameters
• Provide base functionality for all generators

NeMoTronGenerator

Implementation in nemotron.py

• Connect to NeMoTron model endpoints
• Handle prompt formatting and validation
• Process generation outputs

AsyncNeMoTronGenerator

Implementation in async_nemotron.py

• Provide asynchronous generation capabilities
• Enable parallel request handling
• Optimize throughput for batch generation

Prompt Templates

Constants and templates in prompts.py

• Provide predefined prompt templates for different tasks
• Support prompt parameterization
• Enable consistent prompt formatting

Why Synthetic Data Matters#

Synthetic data serves multiple important purposes in language model training:

Purpose

Implementation

Benefits

Data augmentation

Scale enhancement

• Expand limited datasets for better model generalization
• Create variations on existing examples
• Fill gaps in data distribution

Domain adaptation

Domain-specific generation

• Create training examples for specialized or low-resource domains
• Support adaptation to technical or niche fields
• Enable custom domain knowledge infusion

Task-specific examples

Instruction tuning datasets

• Generate instruction-following examples for fine-tuning
• Create diverse task formulations
• Support multi-step reasoning examples

Privacy protection

Synthetic alternatives

• Create training data without exposing real-world private information
• Support privacy-preserving learning
• Reduce dependence on sensitive data sources

Balance and diversity

Distribution manipulation

• Address imbalances in existing datasets
• Generate examples for underrepresented categories
• Create controlled demographic distributions

Knowledge distillation

Model-to-model transfer

• Transfer capabilities from larger to smaller models
• Extract knowledge from foundation models
• Enable task-specific specialization

Synthetic Data Generation Approaches#

NeMo Curator supports several approaches to synthetic data generation:

Model-Driven Generation#

Using LLMs to create new examples from scratch or based on prompts:

Technique

Implementation

Capabilities

Question-answer pairs

nemotron.py with QA prompt templates

• Generate diverse question formats
• Create factually accurate answers
• Support multiple knowledge domains

Conversational dialogues

nemotron.py with dialogue prompt templates

• Create multi-turn conversations
• Support varied dialogue styles and formats
• Generate role-based interactions

Code examples

nemotron.py with code generation templates

• Generate programming examples in multiple languages
• Create code explanations and comments
• Support various programming paradigms

Task-specific instructions

nemotron.py with instruction templates

• Create explicit instructions for models to follow
• Generate instruction-response pairs
• Support complex multi-step tasks

Content Transformation#

Converting existing content into new formats:

Technique

Implementation

Capabilities

Paraphrasing

prompts.py paraphrasing templates

• Transform text while preserving meaning
• Generate stylistic variations
• Create complexity-adjusted versions

Format conversion

prompts.py format transformation templates

• Transform web content into educational format
• Convert unstructured text to structured formats
• Create different document structures from source material

Structured data creation

prompts.py structure extraction templates

• Extract and organize information from text
• Create JSON, tables, or other structured formats
• Support schema-driven data extraction

Style transformation

prompts.py style templates

• Generate variations with different styles or tones
• Adjust formality and technical complexity
• Create voice and perspective variations

Augmentation and Enhancement#

Enhancing existing data:

Technique

Implementation

Capabilities

Context addition

nemotron.py with context templates

• Add context or details to sparse examples
• Expand brief descriptions into comprehensive content
• Enrich existing data with additional information

Multilingual content

nemotron.py with translation templates

• Translate content to multiple languages
• Create parallel multilingual datasets
• Support cross-lingual learning

Simplification

prompts.py simplification templates

• Simplify complex text
• Create reading-level adjusted versions
• Generate accessible content variations

Explanation generation

prompts.py explanation templates

• Create explanations for examples
• Generate step-by-step reasoning
• Provide educational context for content

Key Components in Synthetic Data Generation#

NeMo Curator’s synthetic data generation system consists of several key components:

LLM Service Integration#

NeMo Curator implements model service integration through dedicated components:

Component

Implementation

Capabilities

Model client

nemotron.py, async_nemotron.py

• Connect to LLM inference endpoints using standard APIs
• Handle authentication and request formatting
• Manage connection pooling and retries

Request manager

async_nemotron_cc.py

• Format generation parameters
• Handle batching and request optimization
• Support streaming and non-streaming interfaces

Response processor

nemotron.py, nemotron_cc.py, async_nemotron_cc.py

• Parse and validate model responses
• Extract generated text from API responses
• Apply post-processing to raw outputs

Generation Pipelines#

NeMo Curator provides conceptual frameworks for different generation approaches:

Approach

Implementation

Capabilities

Direct generation

SyntheticDataGenerator with prompt templates

• Create content from scratch with prompts
• Support various completion styles
• Control generation parameters (temperature, length, etc.)

Content transformation

nemotron.py with transformation prompts

• Convert existing content to new formats
• Apply controlled modifications to source material
• Preserve key information during transformation

Quality-focused generation

nemotron.py with quality constraints

• Use reinforcement learning or filtering to ensure quality
• Implement multi-step generation with refinement
• Support human-in-the-loop feedback integration

Quality Assessment#

NeMo Curator’s synthetic data quality assessment includes:

Component

Implementation

Capabilities

Output filters

Integration with nemo_curator.filters

• Filter low-quality outputs
• Apply quality criteria to generated content
• Remove unsuitable examples automatically

Model-based evaluation

nemotron.py evaluation modes

• Apply reward models to assess helpfulness and safety
• Calculate quality scores for generated content
• Rank outputs by quality metrics

Diversity analysis

Statistical analysis tools

• Ensure diversity in the generated dataset
• Measure and control text similarity
• Prevent repetitive or redundant outputs

Synthetic Data Generation Workflow#

The synthetic data generation process in NeMo Curator follows these stages:

Stage

Implementation

Description

1. Task definition

Configuration and parameters

• Define the generation objective
• Select appropriate task templates
• Configure generation parameters

2. Prompt preparation

Prompt template constants in prompts.py

• Select or use predefined prompt templates
• Parameterize templates with specific requirements
• Prepare instruction formats

3. Generation

nemotron.py or async_nemotron.py

• Connect to model endpoints
• Execute generation requests
• Receive and parse responses

4. Post-processing

Data transformation utilities

• Extract relevant content from responses
• Apply formatting and structure
• Prepare for integration with datasets

5. Quality filtering

Integration with filter modules

• Apply quality checks to generated content
• Filter based on relevance and coherence
• Remove problematic content

6. Dataset integration

Integration with DocumentDataset

• Add generated content to DocumentDataset
• Combine with authentic data
• Apply appropriate weighting and balancing

Conceptual Framework#

The synthetic data generation concept in NeMo Curator follows these principles:

Principle

Implementation

Capabilities

Quality

Quality controls in generation pipelines

• Focus on generating high-quality, useful training data
• Apply multi-stage quality verification
• Implement filtering for substandard outputs

Integration

Pipeline compatibility

• Work seamlessly with data processing pipelines
• Support dataset augmentation workflows
• Enable integration with filtering and classification

Scale

Distributed generation capabilities

• Support large-scale generation for comprehensive datasets
• Enable parallel processing across resources
• Optimize for generation throughput

Control

Parameterized generation

• Provide fine-grained control over generation parameters
• Enable task-specific configuration
• Support varied generation strategies

Evaluation

Quality assessment integration

• Include mechanisms to assess generated content quality
• Track generation success metrics
• Support iterative quality improvement

Considerations for Synthetic Data#

When working with synthetic data, consider:

Consideration

Implementation

Strategies

Quality control

Quality filters and metrics

• Generated content may vary in quality
• Implement robust filtering mechanisms
• Apply multiple quality criteria

Diversity

Diversity analysis and constraints

• Generated examples might lack diversity without careful prompting
• Use techniques to ensure varied outputs
• Monitor and control similarity metrics

Model bias

Bias detection and mitigation

• Synthetic data may inherit biases from the generating model
• Apply bias detection to outputs
• Implement fairness criteria in generation

Evaluation

Separate evaluation protocols

• Use separate evaluation metrics for synthetic vs. natural data
• Benchmark against authentic datasets
• Measure performance impact of synthetic data inclusion

Blending strategy

Mixture configurations

• Determine optimal mix of synthetic and authentic data
• Experiment with different weighting strategies
• Apply domain-specific mixing ratios

Implementation Model#

NeMo Curator’s synthetic data generation follows a layered implementation:

Layer

Components

Responsibilities

Interface layer

generator.py, API definitions

• Define consistent interfaces for generation
• Establish common parameter conventions
• Enable component interchangeability

Model integration layer

nemotron.py, async_nemotron.py, mixtral.py

• Connect to specific model endpoints
• Handle model-specific requirements
• Optimize for particular model capabilities

Prompt management layer

prompts.py

• Maintain constants and templates for effective prompts
• Support prompt customization and parameterization
• Enable domain-specific prompt adaptation

Error handling layer

error.py

• Manage generation failures gracefully
• Provide informative error messages
• Support retry and fallback strategies

Format adaptation layer

no_format.py

• Handle variations in output formats
• Standardize generation results
• Support multiple output structures