Synthetic Text Detection#

NVIDIA NeMo Curator provides specialized filters for identifying and filtering synthetic or AI-generated content in your dataset. These filters help ensure the quality and authenticity of training data, particularly for question-answering systems and other applications where data provenance is important.

NVIDIA NeMo Curator’s synthetic text detection addresses several key challenges, including identifying content generated by language models, filtering out trivial questions, ensuring questions are actually answerable given their context, and maintaining diversity in training datasets. These filters are particularly valuable when creating high-quality datasets for question-answering systems, retrieval tasks, and other applications where the relationship between questions and answers is important.

How It Works#

Synthetic text detection identifies AI-generated content by targeting specific patterns and characteristics that are common in machine-generated text. Large language models often produce content that exhibits certain detectable properties:

Lexical Similarity: AI-generated questions often have high lexical overlap with their contexts, making them too trivial for meaningful model training
Superficial Questions: Synthetic text may contain questions that appear reasonable but lack substantive depth
Context Mismatches: Generated QA pairs may include questions that can’t actually be answered from the provided context

NeMo Curator addresses these issues using two complementary approaches:

EasinessFilter identifies synthetic content by detecting excessive lexical similarity between questions and contexts. AI-generated questions often reuse phrases directly from the context with minor modifications, resulting in trivial retrieval tasks. These questions don’t provide meaningful training signals and can be identified using embedding-based similarity detection.

AnswerabilityFilter identifies questions that can’t be answered from their contexts, a common issue in synthetic data. This filter uses language models to determine if questions are genuinely answerable from the provided contexts, helping to identify content that appears superficially coherent but lacks real semantic relationships.

Together, these filters provide a comprehensive approach to detecting and filtering synthetic text, preserving only authentic question-answer relationships that contribute meaningful training signals.

Usage#

Here’s a complete example of applying synthetic text filters to a dataset:

import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters.synthetic import EasinessFilter, AnswerabilityFilter

# Load your dataset with questions and contexts
dataset = DocumentDataset.read_json("qa_dataset/*.jsonl")

# Create an easiness filter to remove trivial questions
easiness_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://your-embedding-api-endpoint",
        api_key="your-api-key",
        model="embedding-model-name",
        percentile=0.7,  # Filter out easiest 70% of questions
        truncate="NONE",
        batch_size=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

# Create an answerability filter to ensure questions can be answered
answerability_filter = nc.ScoreFilter(
    AnswerabilityFilter(
        base_url="https://your-llm-api-endpoint",
        api_key="your-api-key",
        model="gpt-model-name",
        answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.",
        answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nIs this question answerable from the given context? Answer Y or N.",
        num_criteria=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="answerability_score"
)

# Apply the filters in sequence
filtered_dataset = easiness_filter(dataset)
filtered_dataset = answerability_filter(filtered_dataset)

# Save the results
filtered_dataset.to_json("filtered_qa_dataset/", write_to_filename=True)

Available Filters#

Filter	Description	Key Parameters
EasinessFilter	Identifies questions that are too easy to retrieve	`base_url`, `api_key`, `model`, `percentile`, `truncate`, `batch_size`, `text_fields`
AnswerabilityFilter	Ensures questions are answerable from their context	`base_url`, `api_key`, `model`, `answerability_system_prompt`, `answerability_user_prompt_template`, `num_criteria`, `text_fields`

EasinessFilter#

The EasinessFilter uses embedding models to identify questions that are too easily retrievable given their context. This helps filter out trivial questions that don’t provide meaningful training signal:

easiness_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://your-embedding-api-endpoint",
        api_key="your-api-key",
        model="embedding-model-name",
        percentile=0.7,  # Filter out easiest 70% of questions
        truncate="NONE",  # Options: "NONE", "START", "END"
        batch_size=10,    # Process 10 docs at once
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

Key Parameters#

base_url: API endpoint for the embedding service
api_key: Authentication key for the API
model: Name of the embedding model to use
percentile: Percentile threshold for filtering (higher values filter more aggressively)
truncate: Text truncation strategy (NONE, START, END)
batch_size: Number of documents to process in each batch
text_fields: List of field names containing the context and question

AnswerabilityFilter#

The AnswerabilityFilter uses large language models to determine if a question can be answered based on the provided context. This helps ensure the quality of question-answer pairs:

answerability_filter = nc.ScoreFilter(
    AnswerabilityFilter(
        base_url="https://your-llm-api-endpoint",
        api_key="your-api-key",
        model="gpt-model-name",
        answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.",
        answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nEvaluate the following criteria:\n1. Is the question answerable from the context? (Y/N)\n2. Is the answer clearly stated in the context? (Y/N)\n3. Would answering require external knowledge? (Y/N)\n\nFormat your response as JSON: {\"criterion_1\": \"Y\", \"criterion_2\": \"Y\", \"criterion_3\": \"N\"}",
        num_criteria=3,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="answerability_score"
)

Key Parameters#

base_url: API endpoint for the LLM service
api_key: Authentication key for the API
model: Name of the language model to use
answerability_system_prompt: System prompt for the LLM
answerability_user_prompt_template: Template for the user prompt with {text} and {question} placeholders
num_criteria: Number of criteria to evaluate in the response
text_fields: List of field names containing the context and question

Advanced Configuration#

Multi-Criteria Answerability#

You can configure the AnswerabilityFilter with multiple evaluation criteria:

# Define a prompt with multiple criteria
multi_criteria_prompt = """
Context: {text}

Question: {question}

Evaluate the following criteria:
1. Is the question answerable from the context? (Y/N)
2. Is the answer clearly stated in the context? (Y/N) 
3. Does the question require reasoning or inference? (Y/N)
4. Is the question relevant to the main topic of the context? (Y/N)

Format your response as JSON: 
{{"criterion_1": "Y", "criterion_2": "Y", "criterion_3": "Y", "criterion_4": "Y"}}
"""

# Configure the filter with 4 criteria
advanced_filter = AnswerabilityFilter(
    base_url="https://your-llm-api-endpoint",
    api_key="your-api-key",
    model="gpt-4",
    answerability_system_prompt="You are an expert at evaluating question-context pairs.",
    answerability_user_prompt_template=multi_criteria_prompt,
    num_criteria=4,
    text_fields=["text", "question"]
)

Custom Embedding Models#

You can use different embedding models for the EasinessFilter:

# Using a domain-specific embedding model
domain_filter = EasinessFilter(
    base_url="https://your-embedding-api-endpoint",
    api_key="your-api-key",
    model="domain-specific-embedding-model",
    percentile=0.6,  # Less aggressive filtering
    truncate="END",
    batch_size=5,
    text_fields=["text", "question"]
)

Best Practices#

Balancing Dataset Size and Quality#

When using synthetic text filters, consider these best practices:

Start with less aggressive filtering: Begin with lower percentile thresholds for the EasinessFilter

# Start with a conservative threshold
conservative_filter = EasinessFilter(
    base_url="https://api-endpoint",
    api_key="your-key",
    model="embedding-model",
    percentile=0.5,  # Filter out only the easiest 50%
    truncate="NONE",
    batch_size=1,
    text_fields=["text", "question"]
)

Evaluate filter impact: Analyze the distribution of filtered questions

# Before applying filter
before_count = len(dataset)

# After filtering
filtered_dataset = filter_step(dataset)
after_count = len(filtered_dataset)

# Rejection rate
rejection_rate = (before_count - after_count) / before_count
print(f"Rejected {rejection_rate * 100:.2f}% of questions")

Preserve diversity: Ensure your filters don’t eliminate valid question types

# Sample and manually review filtered questions
rejected = dataset.loc[~dataset.index.isin(filtered_dataset.index)]
sample = rejected.sample(min(100, len(rejected)))

# Export for manual review
sample.to_json("rejected_sample.jsonl", orient="records", lines=True)

Use Cases#

QA dataset Creation

# Pipeline for high-quality QA dataset creation
qa_pipeline = nc.Sequential([
    # First filter out questions that are too easy to retrieve
    nc.ScoreFilter(
        EasinessFilter(
            base_url="https://api-endpoint",
            api_key="your-key",
            model="embedding-model",
            percentile=0.7,
            truncate="NONE",
            batch_size=1,
            text_fields=["text", "question"]
        ),
        text_field=["text", "question"],
        score_field="easiness_score"
    ),
    
    # Then ensure remaining questions are answerable
    nc.ScoreFilter(
        AnswerabilityFilter(
            base_url="https://llm-endpoint",
            api_key="your-key",
            model="llm-model",
            answerability_system_prompt="You are a helpful assistant.",
            answerability_user_prompt_template="Context: {text}\nQuestion: {question}\nIs this answerable? (Y/N)",
            num_criteria=1,
            text_fields=["text", "question"]
        ),
        text_field=["text", "question"],
        score_field="answerability_score"
    )
])

# Apply the pipeline
high_quality_qa = qa_pipeline(dataset)

Synthetic Question Filtering

# Filter out likely synthetic questions
synthetic_filter = nc.ScoreFilter(
    EasinessFilter(
        base_url="https://api-endpoint",
        api_key="your-key",
        model="embedding-model",
        percentile=0.8,  # More aggressive filtering
        truncate="NONE",
        batch_size=1,
        text_fields=["text", "question"]
    ),
    text_field=["text", "question"],
    score_field="easiness_score"
)

# Apply to dataset
human_like_questions = synthetic_filter(dataset)

By applying these specialized synthetic text filters, you can create higher-quality datasets for question-answering systems and other applications where the quality of question-context relationships is critical.