Nemotron-CC Pipelines
Nemotron-CC Pipelines
Nemotron-CC Pipelines
Nemotron-CC provides advanced synthetic data generation workflows for transforming and extracting knowledge from existing text documents. Unlike simple generation, these pipelines use sophisticated preprocessing, LLM-based transformation, and postprocessing to create high-quality training data.
Nemotron-CC stages follow a composable pattern with three distinct phases:
This separation enables fine-grained control over each phase while providing reusable helper functions for common patterns.
Before running a Nemotron-CC pipeline, prepare your input data as Parquet files with the required schema.
The bucketed_results field contains quality scores that determine which pipeline processes each document:
bucketed_results >11): Process with DiverseQA, Distill, ExtractKnowledge, or KnowledgeList tasksbucketed_results <= 11): Process with WikipediaParaphrasing to improve text qualityUse NeMo Curator’s quality assessment tools to generate quality scores before running SDG pipelines:
The example above uses FineWebEduClassifier which outputs scores 0-5. For the Nemotron-CC threshold of 11, you can either:
For detailed information on quality scoring options, see Quality Assessment & Filtering .
An example Parquet file with the correct schema is available in the tutorials directory:
You can inspect its structure:
Nemotron-CC provides five specialized generation tasks, each designed for specific data transformation needs:
Nemotron-CC pipelines are designed to process data based on quality scores. The typical approach:
For documents with high quality scores, use tasks that leverage the existing quality:
For documents with lower quality scores, use Wikipedia Paraphrasing to improve text quality:
The recommended approach is to use the helper functions in nemotron_cc_pipelines.py:
The nemotron_cc_pipelines helper functions are provided in the tutorials directory, not as part of the installed package. Copy the nemotron_cc_pipelines.py file to your project or reference the patterns when building custom pipelines.
Each task has specific token count and preprocessing requirements:
All five Nemotron-CC tasks have NDD-backed equivalents that replace the AsyncOpenAIClient with NeMo Data Designer execution. These stages share the same input_field, output_field, and prompt interface, but configure the LLM through NDD’s ModelConfig and ModelProvider instead of an AsyncOpenAIClient.
Import the NDD-backed stages from nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc:
The NDD backend provides automatic token metric collection and supports both local InferenceServer and remote NVIDIA NIM endpoints. See the NeMo Data Designer guide for full configuration details.