Text Data Generation Pipelines#

NeMo Curator provides pre-built pipelines for generating high-quality synthetic text data. These pipelines implement proven approaches for creating training data across different formats and styles.


Q&A Generation Pipelines#

Use these pipelines to generate question-and-answer data for training, evaluation, and comprehension tasks.

Closed Q&A Generation Pipeline

Generate closed-ended questions about a given document. Ideal for creating evaluation or comprehension datasets.

Closed Q&A Generation Pipeline
Open Q&A Generation Pipeline

Generate open-ended questions (“openlines”) for dialogue data, including macro topics, subtopics, and detailed revisions.

Open Q&A Generation Pipeline
Diverse QA Generation Pipeline

Generate diverse question-answer pairs from documents for QA datasets.

Diverse QA Generation Pipeline

Content Transformation & Summarization#

Transform, rewrite, and summarize documents to create clear, concise, and structured text data.

Wikipedia Style Rewrite Pipeline

Rewrite documents into a style similar to Wikipedia, improving clarity and scholarly tone.

Wikipedia Style Rewrite Pipeline
Distillation Pipeline

Distill documents to concise summaries, removing redundancy and focusing on key information.

Distillation Pipeline
Knowledge Extraction Pipeline

Extract key knowledge and facts from documents for summarization and analysis.

Knowledge Extraction Pipeline
Knowledge List Generation Pipeline

Extract structured knowledge lists from documents for downstream use.

Knowledge List Generation Pipeline

Dialogue & Writing#

Create synthetic dialogues and writing tasks to support conversational and creative data generation.

Dialogue Generation Pipeline

Generate multi-turn dialogues and two-turn prompts for preference data. Synthesize conversations where an LLM plays both user and assistant.

Dialogue Generation Pipeline
Writing Task Generation Pipeline

Generate writing prompts (essays, poems, etc.) and revise them for detail and diversity. Useful for creative and instructional datasets.

Writing Task Generation Pipeline

STEM & Coding#

Generate math and coding problems, as well as classify entities for STEM-related datasets.

Math Generation Pipeline

Generate math questions for dialogue data, including macro topics, subtopics, and problems at various school levels.

Math Generation Pipeline
Python Generation Pipeline

Generate Python coding problems for dialogue data, including macro topics, subtopics, and problems for various skill levels.

Python Generation Pipeline
Entity Classification Pipeline

Classify entities (for example, Wikipedia entries) as math- or Python-related using an LLM. Useful for filtering or labeling data for downstream tasks.

Entity Classification Pipeline

Infrastructure & Customization#

Leverage asynchronous pipelines and customizable prompts to scale and tailor your data generation workflows.

Asynchronous Generation Pipeline

Generate synthetic data in parallel using asynchronous pipelines for maximum efficiency. Ideal for large-scale prompt generation and working with rate-limited LLM APIs. Provides async alternatives to all major text data generation pipelines in NeMo Curator.

Asynchronous Generation Pipeline

References#

Find additional resources and guidance for customizing prompts and using generation pipelines effectively.

Customizing Prompts

Customize prompt templates for any generation step. Use built-in or user-defined templates to control LLM behavior.

Customizing Prompt Templates