Specialized Processing#

Domain-specific processing for code, bitext, synthetic data, and advanced curation tasks using NeMo Curator’s specialized modules.

This section covers advanced processing techniques for specific data types and use cases that require specialized handling beyond general text processing. These tools are designed for specific domains like programming content, parallel text, AI-generated content, and benchmark contamination.

How it Works#

Specialized processing modules in NeMo Curator are designed for specific data types and use cases:

  • Code Processing: Handles programming languages with syntax-aware filtering

  • Bitext Processing: Manages parallel text for translation quality assessment

  • Synthetic Data Detection: Identifies AI-generated or synthetic content

  • Task Decontamination: Removes benchmark data from training sets

Each specialized processor understands the unique characteristics of its target domain and applies appropriate metrics and thresholds.


Available Specialized Tools#

Code Processing

Specialized filters for programming content and source code

Code Filtering
Parallel Text (Bitext)

Filter parallel text for translation quality and alignment

Bitext Filtering
Synthetic Data Detection

Identify AI-generated or synthetic content in datasets

Synthetic Text Detection
Task Decontamination

Remove downstream task data from training datasets

Downstream Task Decontamination

Usage#

Quick Examples#

from nemo_curator import Sequential, ScoreFilter
from nemo_curator.filters import PythonCommentToCodeFilter, NumberOfLinesOfCodeFilter

# Filter Python code based on quality metrics
code_pipeline = Sequential([
    ScoreFilter(
        PythonCommentToCodeFilter(
            min_comment_to_code_ratio=0.01,
            max_comment_to_code_ratio=0.8
        ),
        text_field="content",
        score_field="comment_ratio"
    ),
    ScoreFilter(
        NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000),
        text_field="content", 
        score_field="line_count"
    )
])

filtered_code = code_pipeline(code_dataset)
from nemo_curator.filters import QualityEstimationFilter

# Filter translation pairs for quality
qe_filter = QualityEstimationFilter(
    model_name="comet-qe",
    cutoff=0.5,
    mode="always_en_x",
    src_field="source",
    tgt_field="target",
    metadata_fields=["src_lang", "tgt_lang"]
)

high_quality_translations = qe_filter(parallel_dataset)
from nemo_curator.filters.synthetic import EasinessFilter, AnswerabilityFilter

# Detect synthetic QA pairs
synthetic_pipeline = Sequential([
    ScoreFilter(
        EasinessFilter(
            base_url="https://api-endpoint",
            percentile=0.7,
            text_fields=["context", "question"]
        ),
        text_field=["context", "question"],
        score_field="easiness_score"
    ),
    ScoreFilter(
        AnswerabilityFilter(
            base_url="https://llm-endpoint",
            text_fields=["context", "question"]
        ),
        text_field=["context", "question"],
        score_field="answerability_score"
    )
])

authentic_qa = synthetic_pipeline(qa_dataset)
from nemo_curator import TaskDecontamination
from nemo_curator.tasks import Squad, TriviaQA, Winogrande

# Remove benchmark contamination
decontaminate = TaskDecontamination([
    Squad(),
    TriviaQA(), 
    Winogrande()
])

clean_dataset = decontaminate(training_dataset)

When to Use Specialized Processing#

  • Code datasets: When working with programming content that needs syntax-aware filtering

  • Multilingual datasets: When processing parallel text for machine translation

  • Synthetic data: When detecting AI-generated content in training datasets

  • Benchmark preparation: When ensuring training data doesn’t contain evaluation tasks

Performance Considerations#

  • Code processing: Fast heuristic-based filtering, suitable for large code repositories

  • Bitext processing: May require API calls for quality estimation, consider rate limits

  • Synthetic detection: API-dependent, can be computationally expensive for large datasets

  • Task decontamination: One-time preprocessing step, cache results for reuse