WER Filtering | NeMo Curator

Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.

Understanding WER

What is Word Error Rate?

Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:

WER = (Substitutions + Deletions + Insertions) / Total_Reference_Words × 100

Components:

Substitutions: Words incorrectly replaced (for example, “cat” → “hat”)
Deletions: Words omitted from the prediction
Insertions: Extra words added to the prediction
Total_Reference_Words: Total word count in ground truth transcription

A lower WER indicates higher transcription accuracy.

WER Quality Levels

The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:

WER Range	Quality Level	Recommended Use
0-10%	Excellent	Production ASR training, high-quality datasets
10-25%	Good	General ASR training, most applications
25-50%	Moderate	Supplementary training data, domain adaptation
50-75%	Poor	Review required, potential filtering
75%+	Poor	Strong candidate for removal

Basic WER Filtering

Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:

Step 1: Calculate WER

Use GetPairwiseWerStage to compute WER between ground truth transcriptions and ASR model predictions:

1 from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
2 
3 # Calculate WER for audio samples
4 wer_stage = GetPairwiseWerStage(
5     text_key="text",           # Ground truth transcription
6     pred_text_key="pred_text", # ASR model prediction
7     wer_key="wer"             # Output field for WER value
8 )
9 
10 # Add to pipeline
11 pipeline.add_stage(wer_stage)

Parameters:

text_key: Field name containing ground truth transcriptions in your manifest
pred_text_key: Field name containing ASR predictions (from InferenceAsrNemoStage or similar)
wer_key: Field name to store calculated WER values (default: "wer")

Prerequisites: Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.

Step 2: Apply WER Threshold

Use PreserveByValueStage to filter audio samples based on the calculated WER values:

1 from nemo_curator.stages.audio.common import PreserveByValueStage
2 
3 # Keep samples with WER ≤ 30% (good quality)
4 wer_filter = PreserveByValueStage(
5     input_value_key="wer",
6     target_value=30.0,
7     operator="le"  # less than or equal
8 )
9 
10 pipeline.add_stage(wer_filter)

Parameters:

input_value_key: Field containing WER values (matches wer_key from previous stage)
target_value: WER threshold (percentage as float, e.g., 30.0 for 30%)
operator: Comparison operator ("le" for ≤, "lt" for <, "ge" for ≥, "gt" for >)

The stage preserves samples meeting the threshold criteria and filters out others.

Advanced WER Filtering

Statistical WER Filtering

Rather than using fixed thresholds, you can analyze your dataset’s WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.

Workflow:

Calculate WER for all samples using GetPairwiseWerStage
Export results and analyze WER distribution (mean, median, percentiles)
Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
Apply the calculated threshold using PreserveByValueStage

Example:

1 # Apply calculated statistical threshold
2 statistical_filter = PreserveByValueStage(
3     input_value_key="wer",
4     target_value=calculated_threshold,  # From your statistical analysis
5     operator="le"
6 )
7 
8 pipeline.add_stage(statistical_filter)

Use AudioToDocumentStage and JsonlWriter to export WER values for analysis in tools like pandas, numpy, or visualization libraries.

Domain-Specific WER Filtering

Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:

Conversational Speech

Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:

1 # More lenient thresholds for conversational speech
2 conversational_wer_config = {
3     "excellent_threshold": 15.0,  # compared to 10.0 for read speech
4     "good_threshold": 35.0,       # compared to 25.0 for read speech  
5     "acceptable_threshold": 60.0   # compared to 50.0 for read speech
6 }
7 
8 conversational_filter = PreserveByValueStage(
9     input_value_key="wer",
10     target_value=conversational_wer_config["good_threshold"],
11     operator="le"
12 )
13 
14 pipeline.add_stage(conversational_filter)

Use cases: Call center recordings, meeting transcriptions, casual interviews, social media audio

Broadcast and News

Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:

1 # Stricter thresholds for high-quality broadcast speech
2 broadcast_wer_config = {
3     "excellent_threshold": 5.0,   # Very strict
4     "good_threshold": 15.0,       # Stricter than general
5     "acceptable_threshold": 25.0   # Maximum for broadcast quality
6 }
7 
8 broadcast_filter = PreserveByValueStage(
9     input_value_key="wer", 
10     target_value=broadcast_wer_config["good_threshold"],
11     operator="le"
12 )
13 
14 pipeline.add_stage(broadcast_filter)

Use cases: News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers

Complete WER Filtering Example

Here’s a complete pipeline demonstrating WER calculation and filtering:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.backends.xenna import XennaExecutor
3 from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
4 from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
5 from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
6 from nemo_curator.stages.audio.common import PreserveByValueStage
7 from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
8 from nemo_curator.stages.text.io.writer import JsonlWriter
9 from nemo_curator.stages.resources import Resources
10 
11 # Create WER filtering pipeline
12 pipeline = Pipeline(name="wer_filtering")
13 
14 # 1. Load audio data with ground truth transcriptions
15 pipeline.add_stage(CreateInitialManifestFleursStage(
16     lang="en_us", 
17     split="validation", 
18     raw_data_dir="./audio_data"
19 ).with_(batch_size=8))
20 
21 # 2. Run ASR inference to generate predictions
22 pipeline.add_stage(InferenceAsrNemoStage(
23     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
24     pred_text_key="pred_text"
25 ).with_(resources=Resources(gpus=1.0)))
26 
27 # 3. Calculate WER
28 pipeline.add_stage(GetPairwiseWerStage(
29     text_key="text",
30     pred_text_key="pred_text",
31     wer_key="wer"
32 ))
33 
34 # 4. Filter by WER threshold (keep WER ≤ 30%)
35 pipeline.add_stage(PreserveByValueStage(
36     input_value_key="wer",
37     target_value=30.0,
38     operator="le"
39 ))
40 
41 # 5. Export filtered results
42 pipeline.add_stage(AudioToDocumentStage())
43 pipeline.add_stage(JsonlWriter(path="./filtered_audio"))
44 
45 # Execute pipeline
46 executor = XennaExecutor()
47 pipeline.run(executor)

Best Practices

Start with lenient thresholds: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
Consider domain characteristics: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
Analyze before filtering: Export WER distributions to understand your data before applying aggressive filters.
Balance quality and quantity: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
Check ASR model: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.

Quality Assessment Overview - Complete guide to audio quality assessment
Duration Filtering - Filter by audio length and speech rate
ASR Inference - Generate ASR predictions for WER calculation