Data Generation Concepts#

This document covers the core concepts for synthetic data generation in NVIDIA NeMo Curator.

Synthetic Data Architecture#

NVIDIA NeMo Curator provides a comprehensive, modular architecture for generating high-quality synthetic text data. The architecture consists of several key components working together to enable flexible and efficient synthetic data creation.

What’s Synthetic Data?#

Synthetic data refers to artificially created text that mimics the properties of natural text but is generated by algorithms rather than humans. In NeMo Curator, synthetic data is typically generated using large language models (LLMs) to create high-quality training examples.

Component Model#

NeMo Curator’s synthetic data generation framework follows a component-based design defined in nemo_curator.synthetic with several core implementations:

Component	Implementation	Responsibilities
`SyntheticDataGenerator`	Abstract base class in `generator.py`	• Define the interface for text generation • Establish common generation parameters • Provide base functionality for all generators
`NeMoTronGenerator`	Implementation in `nemotron.py`	• Connect to NeMoTron model endpoints • Handle prompt formatting and validation • Process generation outputs
`AsyncNeMoTronGenerator`	Implementation in `async_nemotron.py`	• Provide asynchronous generation capabilities • Enable parallel request handling • Optimize throughput for batch generation
Prompt Templates	Constants and templates in `prompts.py`	• Provide predefined prompt templates for different tasks • Support prompt parameterization • Enable consistent prompt formatting

Why Synthetic Data Matters#

Synthetic data serves multiple important purposes in language model training:

Purpose	Implementation	Benefits
Data augmentation	Scale enhancement	• Expand limited datasets for better model generalization • Create variations on existing examples • Fill gaps in data distribution
Domain adaptation	Domain-specific generation	• Create training examples for specialized or low-resource domains • Support adaptation to technical or niche fields • Enable custom domain knowledge infusion
Task-specific examples	Instruction tuning datasets	• Generate instruction-following examples for fine-tuning • Create diverse task formulations • Support multi-step reasoning examples
Privacy protection	Synthetic alternatives	• Create training data without exposing real-world private information • Support privacy-preserving learning • Reduce dependence on sensitive data sources
Balance and diversity	Distribution manipulation	• Address imbalances in existing datasets • Generate examples for underrepresented categories • Create controlled demographic distributions
Knowledge distillation	Model-to-model transfer	• Transfer capabilities from larger to smaller models • Extract knowledge from foundation models • Enable task-specific specialization

Synthetic Data Generation Approaches#

NeMo Curator supports several approaches to synthetic data generation:

Model-Driven Generation#

Using LLMs to create new examples from scratch or based on prompts:

Technique	Implementation	Capabilities
Question-answer pairs	`nemotron.py` with QA prompt templates	• Generate diverse question formats • Create factually accurate answers • Support multiple knowledge domains
Conversational dialogues	`nemotron.py` with dialogue prompt templates	• Create multi-turn conversations • Support varied dialogue styles and formats • Generate role-based interactions
Code examples	`nemotron.py` with code generation templates	• Generate programming examples in multiple languages • Create code explanations and comments • Support various programming paradigms
Task-specific instructions	`nemotron.py` with instruction templates	• Create explicit instructions for models to follow • Generate instruction-response pairs • Support complex multi-step tasks

Content Transformation#

Converting existing content into new formats:

Technique	Implementation	Capabilities
Paraphrasing	`prompts.py` paraphrasing templates	• Transform text while preserving meaning • Generate stylistic variations • Create complexity-adjusted versions
Format conversion	`prompts.py` format transformation templates	• Transform web content into educational format • Convert unstructured text to structured formats • Create different document structures from source material
Structured data creation	`prompts.py` structure extraction templates	• Extract and organize information from text • Create JSON, tables, or other structured formats • Support schema-driven data extraction
Style transformation	`prompts.py` style templates	• Generate variations with different styles or tones • Adjust formality and technical complexity • Create voice and perspective variations

Augmentation and Enhancement#

Enhancing existing data:

Technique	Implementation	Capabilities
Context addition	`nemotron.py` with context templates	• Add context or details to sparse examples • Expand brief descriptions into comprehensive content • Enrich existing data with additional information
Multilingual content	`nemotron.py` with translation templates	• Translate content to multiple languages • Create parallel multilingual datasets • Support cross-lingual learning
Simplification	`prompts.py` simplification templates	• Simplify complex text • Create reading-level adjusted versions • Generate accessible content variations
Explanation generation	`prompts.py` explanation templates	• Create explanations for examples • Generate step-by-step reasoning • Provide educational context for content

Key Components in Synthetic Data Generation#

NeMo Curator’s synthetic data generation system consists of several key components:

LLM Service Integration#

NeMo Curator implements model service integration through dedicated components:

Component	Implementation	Capabilities
Model client	`nemotron.py`, `async_nemotron.py`	• Connect to LLM inference endpoints using standard APIs • Handle authentication and request formatting • Manage connection pooling and retries
Request manager	`async_nemotron_cc.py`	• Format generation parameters • Handle batching and request optimization • Support streaming and non-streaming interfaces
Response processor	`nemotron.py`, `nemotron_cc.py`, `async_nemotron_cc.py`	• Parse and validate model responses • Extract generated text from API responses • Apply post-processing to raw outputs

Generation Pipelines#

NeMo Curator provides conceptual frameworks for different generation approaches:

Approach	Implementation	Capabilities
Direct generation	`SyntheticDataGenerator` with prompt templates	• Create content from scratch with prompts • Support various completion styles • Control generation parameters (temperature, length, etc.)
Content transformation	`nemotron.py` with transformation prompts	• Convert existing content to new formats • Apply controlled modifications to source material • Preserve key information during transformation
Quality-focused generation	`nemotron.py` with quality constraints	• Use reinforcement learning or filtering to ensure quality • Implement multi-step generation with refinement • Support human-in-the-loop feedback integration

Quality Assessment#

NeMo Curator’s synthetic data quality assessment includes:

Component	Implementation	Capabilities
Output filters	Integration with `nemo_curator.filters`	• Filter low-quality outputs • Apply quality criteria to generated content • Remove unsuitable examples automatically
Model-based evaluation	`nemotron.py` evaluation modes	• Apply reward models to assess helpfulness and safety • Calculate quality scores for generated content • Rank outputs by quality metrics
Diversity analysis	Statistical analysis tools	• Ensure diversity in the generated dataset • Measure and control text similarity • Prevent repetitive or redundant outputs

Synthetic Data Generation Workflow#

The synthetic data generation process in NeMo Curator follows these stages:

Stage	Implementation	Description
1. Task definition	Configuration and parameters	• Define the generation objective • Select appropriate task templates • Configure generation parameters
2. Prompt preparation	Prompt template constants in `prompts.py`	• Select or use predefined prompt templates • Parameterize templates with specific requirements • Prepare instruction formats
3. Generation	`nemotron.py` or `async_nemotron.py`	• Connect to model endpoints • Execute generation requests • Receive and parse responses
4. Post-processing	Data transformation utilities	• Extract relevant content from responses • Apply formatting and structure • Prepare for integration with datasets
5. Quality filtering	Integration with filter modules	• Apply quality checks to generated content • Filter based on relevance and coherence • Remove problematic content
6. Dataset integration	Integration with `DocumentDataset`	• Add generated content to DocumentDataset • Combine with authentic data • Apply appropriate weighting and balancing

Conceptual Framework#

The synthetic data generation concept in NeMo Curator follows these principles:

Principle	Implementation	Capabilities
Quality	Quality controls in generation pipelines	• Focus on generating high-quality, useful training data • Apply multi-stage quality verification • Implement filtering for substandard outputs
Integration	Pipeline compatibility	• Work seamlessly with data processing pipelines • Support dataset augmentation workflows • Enable integration with filtering and classification
Scale	Distributed generation capabilities	• Support large-scale generation for comprehensive datasets • Enable parallel processing across resources • Optimize for generation throughput
Control	Parameterized generation	• Provide fine-grained control over generation parameters • Enable task-specific configuration • Support varied generation strategies
Evaluation	Quality assessment integration	• Include mechanisms to assess generated content quality • Track generation success metrics • Support iterative quality improvement

Considerations for Synthetic Data#

When working with synthetic data, consider:

Consideration	Implementation	Strategies
Quality control	Quality filters and metrics	• Generated content may vary in quality • Implement robust filtering mechanisms • Apply multiple quality criteria
Diversity	Diversity analysis and constraints	• Generated examples might lack diversity without careful prompting • Use techniques to ensure varied outputs • Monitor and control similarity metrics
Model bias	Bias detection and mitigation	• Synthetic data may inherit biases from the generating model • Apply bias detection to outputs • Implement fairness criteria in generation
Evaluation	Separate evaluation protocols	• Use separate evaluation metrics for synthetic vs. natural data • Benchmark against authentic datasets • Measure performance impact of synthetic data inclusion
Blending strategy	Mixture configurations	• Determine optimal mix of synthetic and authentic data • Experiment with different weighting strategies • Apply domain-specific mixing ratios

Implementation Model#

NeMo Curator’s synthetic data generation follows a layered implementation:

Layer	Components	Responsibilities
Interface layer	`generator.py`, API definitions	• Define consistent interfaces for generation • Establish common parameter conventions • Enable component interchangeability
Model integration layer	`nemotron.py`, `async_nemotron.py`, `mixtral.py`	• Connect to specific model endpoints • Handle model-specific requirements • Optimize for particular model capabilities
Prompt management layer	`prompts.py`	• Maintain constants and templates for effective prompts • Support prompt customization and parameterization • Enable domain-specific prompt adaptation
Error handling layer	`error.py`	• Manage generation failures gracefully • Provide informative error messages • Support retry and fallback strategies
Format adaptation layer	`no_format.py`	• Handle variations in output formats • Standardize generation results • Support multiple output structures