Nemotron-CC Task Reference
Nemotron-CC Task Reference
Nemotron-CC Task Reference
This reference documents each Nemotron-CC synthetic data generation stage, including prompt templates, configuration options, and post-processing details.
System Prompt Usage: Not all stages use a system prompt.
WikipediaParaphrasingStage and DistillStage include system promptsDiverseQAStage, ExtractKnowledgeStage, and KnowledgeListStage use only user prompts (no system prompt)Rewrites low-quality text in Wikipedia-style prose, improving readability and structure.
Transform noisy or poorly-written web data into high-quality, encyclopedic text suitable for training language models.
The stage uses a system prompt establishing the assistant persona and a user prompt requesting paraphrasing:
Refer to the full prompt in source.
The Wikipedia post-processing pipeline:
Generates diverse question-answer pairs from document content.
Create reading comprehension training data with varied question types and cognitive complexity levels.
The stage requests up to eight diverse Q&A pairs with specific formatting:
The DiverseQAPostProcessingStage performs specialized parsing:
Post-processing logic:
The number of Q&A pairs sampled is proportional to input length:
Creates condensed, information-dense paraphrases while preserving key concepts.
Generate training data that captures essential knowledge in a more accessible format, suitable for knowledge distillation.
Extracts and rewrites knowledge as textbook-style passages.
Convert raw text into educational-quality passages organized by domain, suitable for building knowledge bases.
Extracts structured fact lists from documents.
Generate bullet-pointed factual content for structured knowledge extraction.
Post-processing logic:
To use custom prompts while maintaining Nemotron-CC infrastructure, subclass BaseSyntheticStage:
The {document} placeholder is replaced with the content from input_field.
The following example shows the conceptual configuration structure for Nemotron-CC tasks. For production pipelines, see the tutorial examples.
nemo_curator/stages/synthetic/nemotron_cc/prompts.pynemo_curator/stages/synthetic/nemotron_cc/nemotron_cc.pynemo_curator/stages/synthetic/nemotron_cc/base.pytutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.pytutorials/synthetic/nemotron_cc/