Task Reference | NeMo Curator

This reference documents each Nemotron-CC synthetic data generation stage, including prompt templates, configuration options, and post-processing details.

System Prompt Usage: Not all stages use a system prompt.

WikipediaParaphrasingStage and DistillStage include system prompts
DiverseQAStage, ExtractKnowledgeStage, and KnowledgeListStage use only user prompts (no system prompt)

WikipediaParaphrasingStage

Rewrites low-quality text in Wikipedia-style prose, improving readability and structure.

Purpose

Transform noisy or poorly-written web data into high-quality, encyclopedic text suitable for training language models.

Configuration

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import WikipediaParaphrasingStage
2 
3 stage = WikipediaParaphrasingStage(
4     client=llm_client,
5     model_name="meta/llama-3.3-70b-instruct",
6     generation_config=generation_config,
7     input_field="text",
8     output_field="rephrased",
9 )

Prompt Template

The stage uses a system prompt establishing the assistant persona and a user prompt requesting paraphrasing:

System: A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the questions.
User: For the following paragraph give me a diverse paraphrase of the same in
high quality English language as in sentences on Wikipedia. Begin your answer
on a separate line with "Here is a paraphrased version:".
Text: {document}

Refer to the full prompt in source.

Post-Processing

The Wikipedia post-processing pipeline:

Filters by token count (max 510 tokens)
Removes markdown formatting
Validates prefix “Here is a paraphrased version:”
Removes the prefix from output
Removes quotation marks
Joins document segments
Filters documents below 50 tokens

DiverseQAStage

Generates diverse question-answer pairs from document content.

Purpose

Create reading comprehension training data with varied question types and cognitive complexity levels.

Configuration

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage
2 
3 stage = DiverseQAStage(
4     client=llm_client,
5     model_name="meta/llama-3.3-70b-instruct",
6     generation_config=generation_config,
7     input_field="text",
8     output_field="diverse_qa",
9 )

Prompt Template

The stage requests up to eight diverse Q&A pairs with specific formatting:

Task: Read the text, ask questions and answer them.
Follow these instructions:
1. Ask diverse questions that require different cognitive skills
2. Ask questions in various forms:
   - Yes/No questions
   - Open-ended questions (what, how, when, where, why, who)
   - Multi-choice questions with options
   - Comparison questions
   - Reading comprehension questions
   - Problem-solving questions
3. Focus on factual information and key concepts
4. Use clear and concise language
5. Use plain text (no Markdown)
6. Format: Question: [question] Answer: [answer]
Text: {document}

Post-Processing with DiverseQAPostProcessingStage

The DiverseQAPostProcessingStage performs specialized parsing:

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAPostProcessingStage
2 
3 post_stage = DiverseQAPostProcessingStage(
4     input_field="text",
5     qa_field="diverse_qa",
6     tokenizer=tokenizer,  # For length-based sampling
7     prefix="Here are the questions and answers based on the provided text:",
8     max_num_pairs=10,
9 )

Post-processing logic:

Parse Q&A pairs from bullet-formatted output
Merge question and answer lines
Shuffle pairs randomly
Sample pairs based on input document length (using tokenizer)
Concatenate original document with selected Q&A pairs

The number of Q&A pairs sampled is proportional to input length:

1 num_pairs = random.randint(1, max(1, int(max_num_pairs * num_tokens / 150)))

DistillStage

Creates condensed, information-dense paraphrases while preserving key concepts.

Purpose

Generate training data that captures essential knowledge in a more accessible format, suitable for knowledge distillation.

Configuration

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DistillStage
2 
3 stage = DistillStage(
4     client=llm_client,
5     model_name="meta/llama-3.3-70b-instruct",
6     generation_config=generation_config,
7     input_field="text",
8     output_field="distill",
9 )

Prompt Template

System: You are an artificial intelligence assistant. You carefully provide
accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning.
User: Your task is to read and paraphrase the provided text following these instructions:
- Create a condensed but accurate and informative version
- Preserve crucial information, key concepts, important values, factual details
- Retain technical terms and specialized vocabulary
- Retain examples and explanations of reasoning
- Only include information present in the original text
- Write in plain text without formatting
Text: {document}
Task: Paraphrase in high-quality English. Begin with "Paraphrased Text:".

Post-Processing

Filter by token count (max 1598 tokens)
Remove markdown formatting
Validate “Paraphrased Text:” prefix
Remove the prefix
Remove quotation marks
Filter documents below 50 tokens

ExtractKnowledgeStage

Extracts and rewrites knowledge as textbook-style passages.

Purpose

Convert raw text into educational-quality passages organized by domain, suitable for building knowledge bases.

Configuration

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import ExtractKnowledgeStage
2 
3 stage = ExtractKnowledgeStage(
4     client=llm_client,
5     model_name="meta/llama-3.3-70b-instruct",
6     generation_config=generation_config,
7     input_field="text",
8     output_field="extract_knowledge",
9 )

Prompt Template

Your task is to rewrite knowledge from the provided text following these instructions:
- Rewrite as passages using easy-to-understand, high-quality English
  like sentences in textbooks and Wikipedia
- Focus on content in disciplines: humanities, social sciences, natural sciences,
  technology, engineering, math, law, business, management, art, education,
  agricultural sciences, politics, and history
- Disregard content without useful facts or knowledge
- Retain examples and supporting evidence
- Do not add or alter details
- Write in plain text
- Do not add titles or comments
Text: {document}
Task: Rewrite facts and knowledge as passages following the instructions.

Post-Processing

Filter by token count (max 1398 tokens)
Remove markdown formatting
Remove passage labels (“Passage:”, “Passage 1:”, etc.)
Filter documents below 50 tokens

KnowledgeListStage

Extracts structured fact lists from documents.

Purpose

Generate bullet-pointed factual content for structured knowledge extraction.

Configuration

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import KnowledgeListStage
2 
3 stage = KnowledgeListStage(
4     client=llm_client,
5     model_name="meta/llama-3.3-70b-instruct",
6     generation_config=generation_config,
7     input_field="text",
8     output_field="knowledge_list",
9 )

Prompt Template

Review the text and extract the key information. Follow these instructions:
- Provide a concise and organized list of factual information
- Include concrete details, key concepts, and important statistics
- Ensure each point is clear, specific, and supported by the original text
- Ensure extracted text is information-dense
- Do not add titles or headings
Text: {document}
Task: Extract factual information, concrete details, and key concepts.

Post-Processing with KnowledgeListPostProcessingStage

1 from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import KnowledgeListPostProcessingStage
2 
3 post_stage = KnowledgeListPostProcessingStage(
4     input_field="knowledge_list",
5 )

Post-processing logic:

Skip the first line if it doesn’t start with a bullet marker
Remove leading bullet markers (”- ”) and indentation prefixes (” ”)
Join lines with newlines

Customizing Prompts

To use custom prompts while maintaining Nemotron-CC infrastructure, subclass BaseSyntheticStage:

1 from dataclasses import dataclass
2 from nemo_curator.stages.synthetic.nemotron_cc.base import BaseSyntheticStage
3 
4 @dataclass
5 class CustomSyntheticStage(BaseSyntheticStage):
6     system_prompt: str = "You are a helpful assistant specialized in..."
7     prompt: str = """Your custom prompt template here.
8 
9 Text: {document}
10 
11 Instructions: ..."""
12     input_field: str = "text"
13     output_field: str = "custom_output"
14 
15     @property
16     def name(self) -> str:
17         return "CustomSyntheticStage"

The {document} placeholder is replaced with the content from input_field.

Complete Configuration Example

The following example shows the conceptual configuration structure for Nemotron-CC tasks. For production pipelines, see the tutorial examples.

1 TASK_CONFIG = {
2     "diverse_qa": {
3         "system_prompt": None,  # DiverseQAStage uses no system prompt
4         "prompt_template": DIVERSE_QA_PROMPT_TEMPLATE,
5         "min_document_tokens": 30,
6         "min_segment_tokens": 30,
7         "max_input_tokens": 1000,
8         "max_output_tokens": 598,
9     },
10     "distill": {
11         "system_prompt": NEMOTRON_CC_DISTILL_SYSTEM_PROMPT,
12         "prompt_template": DISTILL_PROMPT_TEMPLATE,
13         "min_document_tokens": 30,
14         "min_segment_tokens": 10,
15         "max_input_tokens": 2000,
16         "max_output_tokens": 1598,
17     },
18     "extract_knowledge": {
19         "system_prompt": None,  # ExtractKnowledgeStage uses no system prompt
20         "prompt_template": EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE,
21         "min_document_tokens": 30,
22         "min_segment_tokens": 30,
23         "max_input_tokens": 1400,
24         "max_output_tokens": 1400,
25     },
26     "knowledge_list": {
27         "system_prompt": None,  # KnowledgeListStage uses no system prompt
28         "prompt_template": KNOWLEDGE_LIST_PROMPT_TEMPLATE,
29         "min_document_tokens": 30,
30         "min_segment_tokens": 30,
31         "max_input_tokens": 1000,
32         "max_output_tokens": 598,
33     },
34     "wikipedia_paraphrasing": {
35         "system_prompt": NEMOTRON_CC_SYSTEM_PROMPT,
36         "prompt_template": WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE,
37         "min_document_tokens": 5,
38         "min_segment_tokens": 5,
39         "max_input_tokens": 512,
40         "max_output_tokens": 510,
41     },
42 }
43 
44 GENERATION_CONFIG = {
45     "MAX_INPUT_TOKENS": 2000,
46     "MAX_OUTPUT_TOKENS": 1600,
47     "TOP_K": 0,
48     "TOP_P": 0.9,
49     "TEMPERATURE": 0.5,
50 }

Source Code References

Prompts: nemo_curator/stages/synthetic/nemotron_cc/prompts.py
Stages: nemo_curator/stages/synthetic/nemotron_cc/nemotron_cc.py
Base Class: nemo_curator/stages/synthetic/nemotron_cc/base.py
Pipeline Helpers: tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py
Full Tutorial Examples: tutorials/synthetic/nemotron_cc/