nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc
nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc
Module Contents
Classes
API
Dataclass
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Post-processing stage for DiverseQA outputs. It parses the raw generated QA list, normalizes bullets, optionally samples pairs based on input length/tokenizer, and concatenates the original document text with the selected QA pairs.
input_field
max_num_pairs
name
prefix
qa_field
tokenizer
Dataclass
Bases: BaseSyntheticStage
input_field
max_num_pairs
output_field
prefix
prompt
system_prompt
tokenizer
Dataclass
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Post-processing stage that formats knowledge list outputs generated by the LLM. It normalizes leading bullet markers and trims indentation, producing a clean newline-separated list.
input_field
name