Nemotron-CC Pipelines
Nemotron-CC provides advanced synthetic data generation workflows for transforming and extracting knowledge from existing text documents. Unlike simple generation, these pipelines use sophisticated preprocessing, LLM-based transformation, and postprocessing to create high-quality training data.
The Composable Pipeline Pattern
Nemotron-CC stages follow a composable pattern with three distinct phases:
- Preprocessing: Segment documents, filter by length, and prepare inputs for the LLM
- Generation: Apply task-specific prompts to transform text using the LLM
- Postprocessing: Clean outputs, remove formatting artifacts, and filter low-quality results
This separation enables fine-grained control over each phase while providing reusable helper functions for common patterns.
Pipeline Architecture
Input Data Requirements
Before running a Nemotron-CC pipeline, prepare your input data as Parquet files with the required schema.
Required Schema
Quality Score Field
The bucketed_results field contains quality scores that determine which pipeline processes each document:
- High-quality documents (
bucketed_results >11): Process with DiverseQA, Distill, ExtractKnowledge, or KnowledgeList tasks - Low-quality documents (
bucketed_results <= 11): Process with WikipediaParaphrasing to improve text quality
Generating Quality Scores
Use NeMo Curator’s quality assessment tools to generate quality scores before running SDG pipelines:
The example above uses FineWebEduClassifier which outputs scores 0-5. For the Nemotron-CC threshold of 11, you can either:
- Scale the scores (e.g., multiply by 4)
- Adjust the filter threshold in your SDG pipeline
- Use a different classifier that outputs scores in the 0-20 range
For detailed information on quality scoring options, see Quality Assessment & Filtering .
Example Data
An example Parquet file with the correct schema is available in the tutorials directory:
You can inspect its structure:
Available Tasks
Nemotron-CC provides five specialized generation tasks, each designed for specific data transformation needs:
Quality-Based Processing Strategy
Nemotron-CC pipelines are designed to process data based on quality scores. The typical approach:
High-Quality Data Pipeline
For documents with high quality scores, use tasks that leverage the existing quality:
- DiverseQA: Generate Q&A pairs from well-structured content
- Distill: Create condensed versions preserving key information
- ExtractKnowledge: Extract factual passages
- KnowledgeList: Extract structured facts
Low-Quality Data Pipeline
For documents with lower quality scores, use Wikipedia Paraphrasing to improve text quality:
Using Helper Functions
The recommended approach is to use the helper functions in nemotron_cc_pipelines.py:
The nemotron_cc_pipelines helper functions are provided in the tutorials directory, not as part of the installed package. Copy the nemotron_cc_pipelines.py file to your project or reference the patterns when building custom pipelines.
Task Configuration
Each task has specific token count and preprocessing requirements: