Generate Data#

Generate synthetic text data using large language models (LLMs) for pre-training, fine-tuning, and evaluation tasks. Create high-quality training data for low-resource languages and domains, or perform knowledge distillation from existing models.

How it Works#

NeMo Curator’s synthetic data generation capabilities are organized into several components:

Model Integration: Connect to OpenAI-compatible model endpoints or self-hosted models
Generation Pipelines: Use pre-built pipelines for common generation tasks
Custom Workflows: Combine components to create specialized generation pipelines
Quality Control: Filter and validate generated data using NeMo Curator’s processing tools

Service Connections#

Connect your data generation workflows to powerful language models and scoring services. Choose from cloud-based APIs or deploy models in your own infrastructure.

OpenAI Integration

Connect to OpenAI’s API endpoints for GPT models and other services

openai gpt api

OpenAI Compatible Services

NeMo Deploy Integration

Deploy and connect to models using NVIDIA NeMo Deploy

nemo-deploy self-hosted deployment

NeMo Deploy

Reward Model Integration

Integrate reward models for quality scoring and filtering

reward-model quality scoring

Reward Models

Generation Pipelines#

Transform your data needs into production-ready synthetic datasets using specialized generation pipelines.

Q&A Generation Pipelines#

Use these pipelines to generate question-and-answer data for training, evaluation, and comprehension tasks.

Closed Q&A Generation Pipeline

Generate closed-ended questions about a given document. Ideal for creating evaluation or comprehension datasets.

closed-qa document

Closed Q&A Generation Pipeline

Open Q&A Generation Pipeline

Generate open-ended questions (“openlines”) for dialogue data, including macro topics, subtopics, and detailed revisions.

open-qa question-generation

Open Q&A Generation Pipeline

Diverse QA Generation Pipeline

Generate diverse question-answer pairs from documents for QA datasets.

qa-pairs diverse

Diverse QA Generation Pipeline

Content Transformation & Summarization#

Transform, rewrite, and summarize documents to create clear, concise, and structured text data.

Wikipedia Style Rewrite Pipeline

Rewrite documents into a style similar to Wikipedia, improving clarity and scholarly tone.

wikipedia rewrite

Wikipedia Style Rewrite Pipeline

Distillation Pipeline

Distill documents to concise summaries, removing redundancy and focusing on key information.

distillation summarization

Distillation Pipeline

Knowledge Extraction Pipeline

Extract key knowledge and facts from documents for summarization and analysis.

knowledge extraction

Knowledge Extraction Pipeline

Knowledge List Generation Pipeline

Extract structured knowledge lists from documents for downstream use.

knowledge-lists extraction

Knowledge List Generation Pipeline

Dialogue & Writing#

Create synthetic dialogues and writing tasks to support conversational and creative data generation.

Dialogue Generation Pipeline

Generate multi-turn dialogues and two-turn prompts for preference data. Synthesize conversations where an LLM plays both user and assistant.

dialogue multi-turn

Dialogue Generation Pipeline

Writing Task Generation Pipeline

Generate writing prompts (essays, poems, etc.) and revise them for detail and diversity. Useful for creative and instructional datasets.

writing creative

Writing Task Generation Pipeline

STEM & Coding#

Generate math and coding problems, as well as classify entities for STEM-related datasets.

Math Generation Pipeline

Generate math questions for dialogue data, including macro topics, subtopics, and problems at various school levels.

math education

Math Generation Pipeline

Python Generation Pipeline

Generate Python coding problems for dialogue data, including macro topics, subtopics, and problems for various skill levels.

python coding

Python Generation Pipeline

Entity Classification Pipeline

Classify entities (for example, Wikipedia entries) as math- or Python-related using an LLM. Useful for filtering or labeling data for downstream tasks.

entity-classification math python

Entity Classification Pipeline

Infrastructure & Customization#

Leverage asynchronous pipelines and customizable prompts to scale and tailor your data generation workflows.

Asynchronous Generation Pipeline

Generate synthetic data in parallel using asynchronous pipelines for maximum efficiency. Ideal for large-scale prompt generation and working with rate-limited LLM APIs. Provides async alternatives to all major text data generation pipelines in NeMo Curator.

async parallel LLM

Asynchronous Generation Pipeline

Integrations#

Combine generation with powerful filtering and processing capabilities.

Integration with NeMo Curator

Combine synthetic data generation with other NeMo Curator modules for filtering and processing

filtering processing pipeline

Integration with NeMo Curator