Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.
Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:
Components:
A lower WER indicates higher transcription accuracy.
The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:
Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:
Use GetPairwiseWerStage to compute WER between ground truth transcriptions and ASR model predictions:
Parameters:
text_key: Field name containing ground truth transcriptions in your manifestpred_text_key: Field name containing ASR predictions (from InferenceAsrNemoStage or similar)wer_key: Field name to store calculated WER values (default: "wer")Prerequisites: Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.
Use PreserveByValueStage to filter audio samples based on the calculated WER values:
Parameters:
input_value_key: Field containing WER values (matches wer_key from previous stage)target_value: WER threshold (percentage as float, e.g., 30.0 for 30%)operator: Comparison operator ("le" for ≤, "lt" for <, "ge" for ≥, "gt" for >)The stage preserves samples meeting the threshold criteria and filters out others.
Rather than using fixed thresholds, you can analyze your dataset’s WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.
Workflow:
GetPairwiseWerStagePreserveByValueStageExample:
Use AudioToDocumentStage and JsonlWriter to export WER values for analysis in tools like pandas, numpy, or visualization libraries.
Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:
Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:
Use cases: Call center recordings, meeting transcriptions, casual interviews, social media audio
Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:
Use cases: News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers
Here’s a complete pipeline demonstrating WER calculation and filtering: