stages.synthetic.nemotron_cc.nemotron_cc#

Module Contents#

Classes#

DistillStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

DiverseQAPostProcessingStage

Post-processing stage for DiverseQA outputs. It parses the raw generated QA list, normalizes bullets, optionally samples pairs based on input length/tokenizer, and concatenates the original document text with the selected QA pairs.

DiverseQAStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

ExtractKnowledgeStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

KnowledgeListPostProcessingStage

Post-processing stage that formats knowledge list outputs generated by the LLM. It normalizes leading bullet markers and trims indentation, producing a clean newline-separated list.

KnowledgeListStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

WikipediaParaphrasingStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

API#

class stages.synthetic.nemotron_cc.nemotron_cc.DistillStage#

Bases: nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

input_field: str#

‘text’

output_field: str#

‘distill’

prompt: str#

None

system_prompt: str#

None

class stages.synthetic.nemotron_cc.nemotron_cc.DiverseQAPostProcessingStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Post-processing stage for DiverseQA outputs. It parses the raw generated QA list, normalizes bullets, optionally samples pairs based on input length/tokenizer, and concatenates the original document text with the selected QA pairs.

input_field: str#

‘text’

max_num_pairs: int#

10

property name: str#
prefix: str#

‘Here are the questions and answers based on the provided text:’

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch#

Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

qa_field: str#

‘diverse_qa’

tokenizer: transformers.AutoTokenizer | None#

None

class stages.synthetic.nemotron_cc.nemotron_cc.DiverseQAStage#

Bases: nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

input_field: str#

‘text’

max_num_pairs: int#

10

output_field: str#

‘diverse_qa’

prefix: str#

‘Here are the questions and answers based on the provided text:’

prompt: str#

None

system_prompt: str#

None

tokenizer: transformers.AutoTokenizer#

None

class stages.synthetic.nemotron_cc.nemotron_cc.ExtractKnowledgeStage#

Bases: nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

input_field: str#

‘text’

output_field: str#

‘extract_knowledge’

prompt: str#

None

system_prompt: str#

None

class stages.synthetic.nemotron_cc.nemotron_cc.KnowledgeListPostProcessingStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Post-processing stage that formats knowledge list outputs generated by the LLM. It normalizes leading bullet markers and trims indentation, producing a clean newline-separated list.

input_field: str#

‘knowledge_list’

property name: str#
process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch#

Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

class stages.synthetic.nemotron_cc.nemotron_cc.KnowledgeListStage#

Bases: nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

input_field: str#

‘text’

output_field: str#

‘knowledge_list’

prompt: str#

None

system_prompt: str#

None

class stages.synthetic.nemotron_cc.nemotron_cc.WikipediaParaphrasingStage#

Bases: nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

input_field: str#

‘text’

output_field: str#

‘rephrased’

prompt: str#

None

system_prompt: str#

None