WER Filtering#

Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.

Understanding WER#

What is Word Error Rate?#

Word Error Rate measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:

WER = (Substitutions + Deletions + Insertions) / Total_Reference_Words × 100

WER Quality Levels#

WER Range

Quality Level

Recommended Use

0-10%

Excellent

Production ASR training, high-quality datasets

10-25%

Good

General ASR training, most applications

25-50%

Moderate

Supplementary training data, domain adaptation

50-75%

Poor

Review required, potential filtering

75%+

Very Poor

Strong candidate for removal

Basic WER Filtering#

Calculate WER#

from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage

# Calculate WER for audio samples
wer_stage = GetPairwiseWerStage(
    text_key="text",           # Ground truth transcription
    pred_text_key="pred_text", # ASR model prediction
    wer_key="wer"             # Output field for WER value
)

# Add to pipeline
pipeline.add_stage(wer_stage)

Apply WER Threshold#

from nemo_curator.stages.audio.common import PreserveByValueStage

# Keep samples with WER ≤ 30% (good quality)
wer_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=30.0,
    operator="le"  # less than or equal
)

pipeline.add_stage(wer_filter)

Advanced WER Filtering#

Statistical WER Filtering#

For statistical analysis-based threshold selection, you can analyze your dataset’s WER distribution and then apply the calculated threshold using NeMo Curator’s PreserveByValueStage:

# Apply calculated statistical threshold
statistical_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=calculated_threshold,  # From your statistical analysis
    operator="le"
)

Domain-Specific WER Filtering#

Conversational Speech#

# More lenient thresholds for conversational speech
conversational_wer_config = {
    "excellent_threshold": 15.0,  # vs. 10.0 for read speech
    "good_threshold": 35.0,       # vs. 25.0 for read speech  
    "acceptable_threshold": 60.0   # vs. 50.0 for read speech
}

conversational_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=conversational_wer_config["good_threshold"],
    operator="le"
)

Broadcast and News#

# Stricter thresholds for high-quality broadcast speech
broadcast_wer_config = {
    "excellent_threshold": 5.0,   # Very strict
    "good_threshold": 15.0,       # Stricter than general
    "acceptable_threshold": 25.0   # Maximum for broadcast quality
}

broadcast_filter = PreserveByValueStage(
    input_value_key="wer", 
    target_value=broadcast_wer_config["good_threshold"],
    operator="le"
)