stages.synthetic.nemotron_cc.nemotron_cc#
Module Contents#
Classes#
A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch. |
|
Post-processing stage for DiverseQA outputs. It parses the raw generated QA list, normalizes bullets, optionally samples pairs based on input length/tokenizer, and concatenates the original document text with the selected QA pairs. |
|
A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch. |
|
A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch. |
|
Post-processing stage that formats knowledge list outputs generated by the LLM. It normalizes leading bullet markers and trims indentation, producing a clean newline-separated list. |
|
A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch. |
|
A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch. |
API#
- class stages.synthetic.nemotron_cc.nemotron_cc.DistillStage#
Bases:
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.
- input_field: str#
‘text’
- output_field: str#
‘distill’
- prompt: str#
None
- system_prompt: str#
None
- class stages.synthetic.nemotron_cc.nemotron_cc.DiverseQAPostProcessingStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch,nemo_curator.tasks.DocumentBatch]Post-processing stage for DiverseQA outputs. It parses the raw generated QA list, normalizes bullets, optionally samples pairs based on input length/tokenizer, and concatenates the original document text with the selected QA pairs.
- input_field: str#
‘text’
- max_num_pairs: int#
10
- property name: str#
- prefix: str#
‘Here are the questions and answers based on the provided text:’
- process(
- batch: nemo_curator.tasks.DocumentBatch,
Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out
- qa_field: str#
‘diverse_qa’
- tokenizer: transformers.AutoTokenizer | None#
None
- class stages.synthetic.nemotron_cc.nemotron_cc.DiverseQAStage#
Bases:
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.
- input_field: str#
‘text’
- max_num_pairs: int#
10
- output_field: str#
‘diverse_qa’
- prefix: str#
‘Here are the questions and answers based on the provided text:’
- prompt: str#
None
- system_prompt: str#
None
- tokenizer: transformers.AutoTokenizer#
None
- class stages.synthetic.nemotron_cc.nemotron_cc.ExtractKnowledgeStage#
Bases:
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.
- input_field: str#
‘text’
- output_field: str#
‘extract_knowledge’
- prompt: str#
None
- system_prompt: str#
None
- class stages.synthetic.nemotron_cc.nemotron_cc.KnowledgeListPostProcessingStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch,nemo_curator.tasks.DocumentBatch]Post-processing stage that formats knowledge list outputs generated by the LLM. It normalizes leading bullet markers and trims indentation, producing a clean newline-separated list.
- input_field: str#
‘knowledge_list’
- property name: str#
- process(
- batch: nemo_curator.tasks.DocumentBatch,
Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out
- class stages.synthetic.nemotron_cc.nemotron_cc.KnowledgeListStage#
Bases:
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.
- input_field: str#
‘text’
- output_field: str#
‘knowledge_list’
- prompt: str#
None
- system_prompt: str#
None
- class stages.synthetic.nemotron_cc.nemotron_cc.WikipediaParaphrasingStage#
Bases:
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.
- input_field: str#
‘text’
- output_field: str#
‘rephrased’
- prompt: str#
None
- system_prompt: str#
None