Synthetic Text Detection#

NVIDIA NeMo Curator provides specialized filters for identifying and filtering synthetic or AI-generated content in your dataset. These filters help ensure the quality and authenticity of training data, particularly for question-answering systems and other applications where data provenance is important.

NVIDIA NeMo Curator’s synthetic text detection addresses several key challenges, including identifying content generated by language models, filtering out trivial questions, ensuring questions are actually answerable given their context, and maintaining diversity in training datasets. These filters are particularly valuable when creating high-quality datasets for question-answering systems, retrieval tasks, and other applications where the relationship between questions and answers is important.

How It Works#

Synthetic text detection identifies AI-generated content by targeting specific patterns and characteristics that are common in machine-generated text. Large language models often produce content that exhibits certain detectable properties:

  1. Lexical Similarity: AI-generated questions often have high lexical overlap with their contexts, making them too trivial for meaningful model training

  2. Superficial Questions: Synthetic text may contain questions that appear reasonable but lack substantive depth

  3. Context Mismatches: Generated QA pairs may include questions that can’t actually be answered from the provided context

NeMo Curator addresses these issues using two complementary approaches:

EasinessFilter identifies synthetic content by detecting excessive lexical similarity between questions and contexts. AI-generated questions often reuse phrases directly from the context with minor modifications, resulting in trivial retrieval tasks. These questions don’t provide meaningful training signals and can be identified using embedding-based similarity detection.

AnswerabilityFilter identifies questions that can’t be answered from their contexts, a common issue in synthetic data. This filter uses language models to determine if questions are genuinely answerable from the provided contexts, helping to identify content that appears superficially coherent but lacks real semantic relationships.

Together, these filters provide a comprehensive approach to detecting and filtering synthetic text, preserving only authentic question-answer relationships that contribute meaningful training signals.


Usage#

Here’s a complete example of applying synthetic text filters to a dataset:

import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters.synthetic import EasinessFilter, AnswerabilityFilter

# Load your dataset with questions and contexts
dataset = DocumentDataset.read_json("qa_dataset/*.jsonl")

# Create an easiness filter to remove trivial questions
easiness_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://your-embedding-api-endpoint",
        api_key="your-api-key",
        model="embedding-model-name",
        percentile=0.7,  # Filter out easiest 70% of questions
        truncate="NONE",
        batch_size=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

# Create an answerability filter to ensure questions can be answered
answerability_filter = nc.ScoreFilter(
    AnswerabilityFilter(
        base_url="https://your-llm-api-endpoint",
        api_key="your-api-key",
        model="gpt-model-name",
        answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.",
        answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nIs this question answerable from the given context? Answer Y or N.",
        num_criteria=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="answerability_score"
)

# Apply the filters in sequence
filtered_dataset = easiness_filter(dataset)
filtered_dataset = answerability_filter(filtered_dataset)

# Save the results
filtered_dataset.to_json("filtered_qa_dataset/", write_to_filename=True)

Available Filters#

Filter

Description

Key Parameters

EasinessFilter

Identifies questions that are too easy to retrieve

base_url, api_key, model, percentile, truncate, batch_size, text_fields

AnswerabilityFilter

Ensures questions are answerable from their context

base_url, api_key, model, answerability_system_prompt, answerability_user_prompt_template, num_criteria, text_fields

EasinessFilter#

The EasinessFilter uses embedding models to identify questions that are too easily retrievable given their context. This helps filter out trivial questions that don’t provide meaningful training signal:

easiness_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://your-embedding-api-endpoint",
        api_key="your-api-key",
        model="embedding-model-name",
        percentile=0.7,  # Filter out easiest 70% of questions
        truncate="NONE",  # Options: "NONE", "START", "END"
        batch_size=10,    # Process 10 docs at once
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

Key Parameters#

  • base_url: API endpoint for the embedding service

  • api_key: Authentication key for the API

  • model: Name of the embedding model to use

  • percentile: Percentile threshold for filtering (higher values filter more aggressively)

  • truncate: Text truncation strategy (NONE, START, END)

  • batch_size: Number of documents to process in each batch

  • text_fields: List of field names containing the context and question

AnswerabilityFilter#

The AnswerabilityFilter uses large language models to determine if a question can be answered based on the provided context. This helps ensure the quality of question-answer pairs:

answerability_filter = nc.ScoreFilter(
    AnswerabilityFilter(
        base_url="https://your-llm-api-endpoint",
        api_key="your-api-key",
        model="gpt-model-name",
        answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.",
        answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nEvaluate the following criteria:\n1. Is the question answerable from the context? (Y/N)\n2. Is the answer clearly stated in the context? (Y/N)\n3. Would answering require external knowledge? (Y/N)\n\nFormat your response as JSON: {\"criterion_1\": \"Y\", \"criterion_2\": \"Y\", \"criterion_3\": \"N\"}",
        num_criteria=3,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="answerability_score"
)

Key Parameters#

  • base_url: API endpoint for the LLM service

  • api_key: Authentication key for the API

  • model: Name of the language model to use

  • answerability_system_prompt: System prompt for the LLM

  • answerability_user_prompt_template: Template for the user prompt with {text} and {question} placeholders

  • num_criteria: Number of criteria to evaluate in the response

  • text_fields: List of field names containing the context and question

Advanced Configuration#

Multi-Criteria Answerability#

You can configure the AnswerabilityFilter with multiple evaluation criteria:

# Define a prompt with multiple criteria
multi_criteria_prompt = """
Context: {text}

Question: {question}

Evaluate the following criteria:
1. Is the question answerable from the context? (Y/N)
2. Is the answer clearly stated in the context? (Y/N) 
3. Does the question require reasoning or inference? (Y/N)
4. Is the question relevant to the main topic of the context? (Y/N)

Format your response as JSON: 
{{"criterion_1": "Y", "criterion_2": "Y", "criterion_3": "Y", "criterion_4": "Y"}}
"""

# Configure the filter with 4 criteria
advanced_filter = AnswerabilityFilter(
    base_url="https://your-llm-api-endpoint",
    api_key="your-api-key",
    model="gpt-4",
    answerability_system_prompt="You are an expert at evaluating question-context pairs.",
    answerability_user_prompt_template=multi_criteria_prompt,
    num_criteria=4,
    text_fields=["text", "question"]
)

Custom Embedding Models#

You can use different embedding models for the EasinessFilter:

# Using a domain-specific embedding model
domain_filter = EasinessFilter(
    base_url="https://your-embedding-api-endpoint",
    api_key="your-api-key",
    model="domain-specific-embedding-model",
    percentile=0.6,  # Less aggressive filtering
    truncate="END",
    batch_size=5,
    text_fields=["text", "question"]
)

Best Practices#

Balancing Dataset Size and Quality#

When using synthetic text filters, consider these best practices:

  1. Start with less aggressive filtering: Begin with lower percentile thresholds for the EasinessFilter

    # Start with a conservative threshold
    conservative_filter = EasinessFilter(
        base_url="https://api-endpoint",
        api_key="your-key",
        model="embedding-model",
        percentile=0.5,  # Filter out only the easiest 50%
        truncate="NONE",
        batch_size=1,
        text_fields=["text", "question"]
    )
    
  2. Evaluate filter impact: Analyze the distribution of filtered questions

    # Before applying filter
    before_count = len(dataset)
    
    # After filtering
    filtered_dataset = filter_step(dataset)
    after_count = len(filtered_dataset)
    
    # Rejection rate
    rejection_rate = (before_count - after_count) / before_count
    print(f"Rejected {rejection_rate * 100:.2f}% of questions")
    
  3. Preserve diversity: Ensure your filters don’t eliminate valid question types

    # Sample and manually review filtered questions
    rejected = dataset.loc[~dataset.index.isin(filtered_dataset.index)]
    sample = rejected.sample(min(100, len(rejected)))
    
    # Export for manual review
    sample.to_json("rejected_sample.jsonl", orient="records", lines=True)
    

Use Cases#

# Pipeline for high-quality QA dataset creation
qa_pipeline = nc.Sequential([
    # First filter out questions that are too easy to retrieve
    nc.ScoreFilter(
        EasinessFilter(
            base_url="https://api-endpoint",
            api_key="your-key",
            model="embedding-model",
            percentile=0.7,
            truncate="NONE",
            batch_size=1,
            text_fields=["text", "question"]
        ),
        text_field=["text", "question"],
        score_field="easiness_score"
    ),
    
    # Then ensure remaining questions are answerable
    nc.ScoreFilter(
        AnswerabilityFilter(
            base_url="https://llm-endpoint",
            api_key="your-key",
            model="llm-model",
            answerability_system_prompt="You are a helpful assistant.",
            answerability_user_prompt_template="Context: {text}\nQuestion: {question}\nIs this answerable? (Y/N)",
            num_criteria=1,
            text_fields=["text", "question"]
        ),
        text_field=["text", "question"],
        score_field="answerability_score"
    )
])

# Apply the pipeline
high_quality_qa = qa_pipeline(dataset)
# Filter out likely synthetic questions
synthetic_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://api-endpoint",
        api_key="your-key",
        model="embedding-model",
        percentile=0.8,  # More aggressive filtering
        truncate="NONE",
        batch_size=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

# Apply to dataset
human_like_questions = synthetic_filter(dataset)

By applying these specialized synthetic text filters, you can create higher-quality datasets for question-answering systems and other applications where the quality of question-context relationships is critical.