Nemotron-CC Task Reference
This reference documents each Nemotron-CC synthetic data generation stage, including prompt templates, configuration options, and post-processing details.
System Prompt Usage: Not all stages use a system prompt.
WikipediaParaphrasingStageandDistillStageinclude system promptsDiverseQAStage,ExtractKnowledgeStage, andKnowledgeListStageuse only user prompts (no system prompt)
WikipediaParaphrasingStage
Rewrites low-quality text in Wikipedia-style prose, improving readability and structure.
Purpose
Transform noisy or poorly-written web data into high-quality, encyclopedic text suitable for training language models.
Configuration
Prompt Template
The stage uses a system prompt establishing the assistant persona and a user prompt requesting paraphrasing:
Refer to the full prompt in source.
Post-Processing
The Wikipedia post-processing pipeline:
- Filters by token count (max 510 tokens)
- Removes markdown formatting
- Validates prefix “Here is a paraphrased version:”
- Removes the prefix from output
- Removes quotation marks
- Joins document segments
- Filters documents below 50 tokens
DiverseQAStage
Generates diverse question-answer pairs from document content.
Purpose
Create reading comprehension training data with varied question types and cognitive complexity levels.
Configuration
Prompt Template
The stage requests up to eight diverse Q&A pairs with specific formatting:
Post-Processing with DiverseQAPostProcessingStage
The DiverseQAPostProcessingStage performs specialized parsing:
Post-processing logic:
- Parse Q&A pairs from bullet-formatted output
- Merge question and answer lines
- Shuffle pairs randomly
- Sample pairs based on input document length (using tokenizer)
- Concatenate original document with selected Q&A pairs
The number of Q&A pairs sampled is proportional to input length:
DistillStage
Creates condensed, information-dense paraphrases while preserving key concepts.
Purpose
Generate training data that captures essential knowledge in a more accessible format, suitable for knowledge distillation.
Configuration
Prompt Template
Post-Processing
- Filter by token count (max 1598 tokens)
- Remove markdown formatting
- Validate “Paraphrased Text:” prefix
- Remove the prefix
- Remove quotation marks
- Filter documents below 50 tokens
ExtractKnowledgeStage
Extracts and rewrites knowledge as textbook-style passages.
Purpose
Convert raw text into educational-quality passages organized by domain, suitable for building knowledge bases.
Configuration
Prompt Template
Post-Processing
- Filter by token count (max 1398 tokens)
- Remove markdown formatting
- Remove passage labels (“Passage:”, “Passage 1:”, etc.)
- Filter documents below 50 tokens
KnowledgeListStage
Extracts structured fact lists from documents.
Purpose
Generate bullet-pointed factual content for structured knowledge extraction.
Configuration
Prompt Template
Post-Processing with KnowledgeListPostProcessingStage
Post-processing logic:
- Skip the first line if it doesn’t start with a bullet marker
- Remove leading bullet markers (”- ”) and indentation prefixes (” ”)
- Join lines with newlines
Customizing Prompts
To use custom prompts while maintaining Nemotron-CC infrastructure, subclass BaseSyntheticStage:
The {document} placeholder is replaced with the content from input_field.
Complete Configuration Example
The following example shows the conceptual configuration structure for Nemotron-CC tasks. For production pipelines, see the tutorial examples.
Source Code References
- Prompts:
nemo_curator/stages/synthetic/nemotron_cc/prompts.py - Stages:
nemo_curator/stages/synthetic/nemotron_cc/nemotron_cc.py - Base Class:
nemo_curator/stages/synthetic/nemotron_cc/base.py - Pipeline Helpers:
tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py - Full Tutorial Examples:
tutorials/synthetic/nemotron_cc/