WER Filtering
Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.
Understanding WER
What is Word Error Rate?
Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:
Components:
- Substitutions: Words incorrectly replaced (for example, “cat” → “hat”)
- Deletions: Words omitted from the prediction
- Insertions: Extra words added to the prediction
- Total_Reference_Words: Total word count in ground truth transcription
A lower WER indicates higher transcription accuracy.
WER Quality Levels
The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:
Basic WER Filtering
Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:
Step 1: Calculate WER
Use GetPairwiseWerStage to compute WER between ground truth transcriptions and ASR model predictions:
Parameters:
text_key: Field name containing ground truth transcriptions in your manifestpred_text_key: Field name containing ASR predictions (fromInferenceAsrNemoStageor similar)wer_key: Field name to store calculated WER values (default:"wer")
Prerequisites: Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.
Step 2: Apply WER Threshold
Use PreserveByValueStage to filter audio samples based on the calculated WER values:
Parameters:
input_value_key: Field containing WER values (matcheswer_keyfrom previous stage)target_value: WER threshold (percentage as float, e.g.,30.0for 30%)operator: Comparison operator ("le"for ≤,"lt"for <,"ge"for ≥,"gt"for >)
The stage preserves samples meeting the threshold criteria and filters out others.
Advanced WER Filtering
Statistical WER Filtering
Rather than using fixed thresholds, you can analyze your dataset’s WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.
Workflow:
- Calculate WER for all samples using
GetPairwiseWerStage - Export results and analyze WER distribution (mean, median, percentiles)
- Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
- Apply the calculated threshold using
PreserveByValueStage
Example:
Use AudioToDocumentStage and JsonlWriter to export WER values for analysis in tools like pandas, numpy, or visualization libraries.
Domain-Specific WER Filtering
Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:
Conversational Speech
Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:
Use cases: Call center recordings, meeting transcriptions, casual interviews, social media audio
Broadcast and News
Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:
Use cases: News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers
Complete WER Filtering Example
Here’s a complete pipeline demonstrating WER calculation and filtering:
Best Practices
- Start with lenient thresholds: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
- Consider domain characteristics: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
- Analyze before filtering: Export WER distributions to understand your data before applying aggressive filters.
- Balance quality and quantity: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
- Check ASR model: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.
Related Topics
- Quality Assessment Overview - Complete guide to audio quality assessment
- Duration Filtering - Filter by audio length and speech rate
- ASR Inference - Generate ASR predictions for WER calculation