*** description: >- Advanced synthetic data generation using Nemotron-CC pipelines for text transformation and knowledge extraction categories: * workflows tags: * nemotron-cc * paraphrasing * knowledge-extraction * distillation personas: * data-scientist-focused * mle-focused difficulty: advanced content\_type: workflow modality: text-only *** # Nemotron-CC Pipelines Nemotron-CC provides advanced synthetic data generation workflows for transforming and extracting knowledge from existing text documents. Unlike simple generation, these pipelines use sophisticated preprocessing, LLM-based transformation, and postprocessing to create high-quality training data. ## The Composable Pipeline Pattern Nemotron-CC stages follow a composable pattern with three distinct phases: 1. **Preprocessing**: Segment documents, filter by length, and prepare inputs for the LLM 2. **Generation**: Apply task-specific prompts to transform text using the LLM 3. **Postprocessing**: Clean outputs, remove formatting artifacts, and filter low-quality results This separation enables fine-grained control over each phase while providing reusable helper functions for common patterns. ## Pipeline Architecture ```mermaid flowchart TB subgraph "Preprocessing" A[Input Documents] --> B[Token Count Filter] B --> C[Document Splitter] C --> D[Segment Filter] D --> E[Document Joiner] end subgraph "LLM Generation" E --> F[Task-Specific Stage
WikiPara/DiverseQA/Distill/etc.] end subgraph "Postprocessing" F --> G[Token Count Filter] G --> H[Markdown Remover] H --> I[Task-Specific Cleanup] I --> J[Quality Filter] end J --> K[Output Dataset] ``` ## Input Data Requirements Before running a Nemotron-CC pipeline, prepare your input data as Parquet files with the required schema. ### Required Schema | Column | Type | Description | | ------------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | `id` | `int64` | Unique document identifier. Required by the preprocessing pipeline to reassemble document segments after splitting. | | `text` | `string` | Document content to transform. This is the primary input field for all Nemotron-CC stages. | | `bucketed_results` | `int64` | Quality score used to route documents to appropriate pipelines. Values typically range from 0-20, where higher scores indicate higher quality content. | ### Quality Score Field The `bucketed_results` field contains quality scores that determine which pipeline processes each document: * **High-quality documents** (`bucketed_results >11`): Process with DiverseQA, Distill, ExtractKnowledge, or KnowledgeList tasks * **Low-quality documents** (`bucketed_results <= 11`): Process with WikipediaParaphrasing to improve text quality ### Generating Quality Scores Use NeMo Curator's quality assessment tools to generate quality scores before running SDG pipelines: ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.io.writer import ParquetWriter from nemo_curator.stages.text.classifiers import FineWebEduClassifier from nemo_curator.stages.text.modules import AddId # Create pipeline to score documents pipeline = Pipeline(name="quality_scoring") # Read raw documents pipeline.add_stage(JsonlReader(file_paths="raw_data/*.jsonl", fields=["text"])) # Add unique document IDs pipeline.add_stage(AddId(id_field="id")) # Score document quality (outputs int score 0-5) pipeline.add_stage( FineWebEduClassifier( int_score_field="bucketed_results", # Use this as quality score ) ) # Save as Parquet for SDG pipeline pipeline.add_stage(ParquetWriter(path="scored_data/")) results = pipeline.run() ``` The example above uses `FineWebEduClassifier` which outputs scores 0-5. For the Nemotron-CC threshold of 11, you can either: * Scale the scores (e.g., multiply by 4) * Adjust the filter threshold in your SDG pipeline * Use a different classifier that outputs scores in the 0-20 range For detailed information on quality scoring options, see [Quality Assessment & Filtering ](/curate-text/process-data/quality-assessment/heuristic). ### Example Data An example Parquet file with the correct schema is available in the tutorials directory: ```bash tutorials/synthetic/nemotron_cc/example_data/data.parquet ``` You can inspect its structure: ```python import pandas as pd df = pd.read_parquet("tutorials/synthetic/nemotron_cc/example_data/data.parquet") print(df.columns.tolist()) # ['id', 'text', 'bucketed_results'] print(df.head(2)) ``` *** ## Available Tasks Nemotron-CC provides five specialized generation tasks, each designed for specific data transformation needs: | Task | Stage Class | Purpose | Use Case | | ---------------------- | ---------------------------- | ----------------------------------------- | ------------------------------ | | Wikipedia Paraphrasing | `WikipediaParaphrasingStage` | Rewrite text as Wikipedia-style prose | Improving noisy web data | | Diverse QA | `DiverseQAStage` | Generate diverse Q\&A pairs | Reading comprehension training | | Distill | `DistillStage` | Create condensed, informative paraphrases | Knowledge distillation | | Extract Knowledge | `ExtractKnowledgeStage` | Extract factual content as passages | Knowledge base creation | | Knowledge List | `KnowledgeListStage` | Extract structured fact lists | Fact extraction | ## Quality-Based Processing Strategy Nemotron-CC pipelines are designed to process data based on quality scores. The typical approach: ### High-Quality Data Pipeline For documents with high quality scores, use tasks that leverage the existing quality: * **DiverseQA**: Generate Q\&A pairs from well-structured content * **Distill**: Create condensed versions preserving key information * **ExtractKnowledge**: Extract factual passages * **KnowledgeList**: Extract structured facts ```python from nemo_curator.stages.text.modules.score_filter import Filter # Filter for high-quality documents (score >11) pipeline.add_stage( Filter( filter_fn=lambda x: int(x) >11, filter_field="bucketed_results", ), ) ``` ### Low-Quality Data Pipeline For documents with lower quality scores, use Wikipedia Paraphrasing to improve text quality: ```python # Filter for low-quality documents (score <= 11) pipeline.add_stage( Filter( filter_fn=lambda x: int(x) <= 11, filter_field="bucketed_results", ), ) ``` ## Using Helper Functions The recommended approach is to use the helper functions in `nemotron_cc_pipelines.py`: The `nemotron_cc_pipelines` helper functions are provided in the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py), not as part of the installed package. Copy the `nemotron_cc_pipelines.py` file to your project or reference the patterns when building custom pipelines. ```python from nemotron_cc_pipelines import ( add_preprocessing_pipeline, add_diverse_qa_postprocessing_pipeline, ) from nemo_curator.pipeline import Pipeline from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage pipeline = Pipeline(name="diverse_qa_pipeline") # Add preprocessing pipeline = add_preprocessing_pipeline( pipeline=pipeline, text_field="text", system_prompt=SYSTEM_PROMPT, user_prompt_template=PROMPT_TEMPLATE, min_document_tokens=30, min_segment_tokens=30, max_input_tokens=1000, args=args, # Contains tokenizer config ) # Add generation stage pipeline.add_stage( DiverseQAStage( client=llm_client, model_name="meta/llama-3.3-70b-instruct", generation_config=generation_config, input_field="text", output_field="diverse_qa", ) ) # Add postprocessing pipeline = add_diverse_qa_postprocessing_pipeline( pipeline=pipeline, llm_response_field="diverse_qa", args=args, ) ``` ## Task Configuration Each task has specific token count and preprocessing requirements: | Task | Min Doc Tokens | Min Segment Tokens | Max Input Tokens | Max Output Tokens | | ---------------------- | -------------- | ------------------ | ---------------- | ----------------- | | Diverse QA | 30 | 30 | 1000 | 600 | | Distill | 30 | 10 | 2000 | 1600 | | Extract Knowledge | 30 | 30 | 1400 | 1400 | | Knowledge List | 30 | 30 | 1000 | 600 | | Wikipedia Paraphrasing | 5 | 5 | 512 | 512 | ## Quick Example ```python import os from transformers import AutoTokenizer from nemo_curator.core.client import RayClient from nemo_curator.backends.xenna import XennaExecutor from nemo_curator.models.client import AsyncOpenAIClient from nemo_curator.models.client.llm_client import GenerationConfig from nemo_curator.pipeline import Pipeline from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage from nemo_curator.stages.text.io.reader.parquet import ParquetReader from nemo_curator.stages.text.io.writer.parquet import ParquetWriter # Initialize client = RayClient(include_dashboard=False) client.start() tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct") # Create LLM client llm_client = AsyncOpenAIClient( api_key=os.environ["NVIDIA_API_KEY"], base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=5, ) # Build pipeline (see "Using Helper Functions" section for preprocessing/postprocessing) pipeline = Pipeline(name="nemotron_cc_diverse_qa") pipeline.add_stage(ParquetReader(file_paths=["./input_data/*.parquet"])) # Add preprocessing stages using helper function: # pipeline = add_preprocessing_pipeline(pipeline, text_field="text", ...) # Add generation stage pipeline.add_stage( DiverseQAStage( client=llm_client, model_name="meta/llama-3.3-70b-instruct", generation_config=GenerationConfig(temperature=0.5, top_p=0.9), input_field="text", output_field="diverse_qa", ) ) # Add postprocessing stages using helper function: # pipeline = add_diverse_qa_postprocessing_pipeline(pipeline, llm_response_field="diverse_qa", ...) pipeline.add_stage(ParquetWriter(path="./output/")) # Execute executor = XennaExecutor() results = pipeline.run(executor) client.stop() ``` *** ## Detailed Reference Detailed reference for each Nemotron-CC stage, prompts, and post-processing reference api