Curate AudioProcess DataQuality Assessment

WER Filtering

View as Markdown

Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.

Understanding WER

What is Word Error Rate?

Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:

WER = (Substitutions + Deletions + Insertions) / Total_Reference_Words × 100

Components:

  • Substitutions: Words incorrectly replaced (for example, “cat” → “hat”)
  • Deletions: Words omitted from the prediction
  • Insertions: Extra words added to the prediction
  • Total_Reference_Words: Total word count in ground truth transcription

A lower WER indicates higher transcription accuracy.

WER Quality Levels

The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:

WER RangeQuality LevelRecommended Use
0-10%ExcellentProduction ASR training, high-quality datasets
10-25%GoodGeneral ASR training, most applications
25-50%ModerateSupplementary training data, domain adaptation
50-75%PoorReview required, potential filtering
75%+PoorStrong candidate for removal

Basic WER Filtering

Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:

Step 1: Calculate WER

Use GetPairwiseWerStage to compute WER between ground truth transcriptions and ASR model predictions:

1from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
2
3# Calculate WER for audio samples
4wer_stage = GetPairwiseWerStage(
5 text_key="text", # Ground truth transcription
6 pred_text_key="pred_text", # ASR model prediction
7 wer_key="wer" # Output field for WER value
8)
9
10# Add to pipeline
11pipeline.add_stage(wer_stage)

Parameters:

  • text_key: Field name containing ground truth transcriptions in your manifest
  • pred_text_key: Field name containing ASR predictions (from InferenceAsrNemoStage or similar)
  • wer_key: Field name to store calculated WER values (default: "wer")

Prerequisites: Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.

Step 2: Apply WER Threshold

Use PreserveByValueStage to filter audio samples based on the calculated WER values:

1from nemo_curator.stages.audio.common import PreserveByValueStage
2
3# Keep samples with WER ≤ 30% (good quality)
4wer_filter = PreserveByValueStage(
5 input_value_key="wer",
6 target_value=30.0,
7 operator="le" # less than or equal
8)
9
10pipeline.add_stage(wer_filter)

Parameters:

  • input_value_key: Field containing WER values (matches wer_key from previous stage)
  • target_value: WER threshold (percentage as float, e.g., 30.0 for 30%)
  • operator: Comparison operator ("le" for ≤, "lt" for <, "ge" for ≥, "gt" for >)

The stage preserves samples meeting the threshold criteria and filters out others.

Advanced WER Filtering

Statistical WER Filtering

Rather than using fixed thresholds, you can analyze your dataset’s WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.

Workflow:

  1. Calculate WER for all samples using GetPairwiseWerStage
  2. Export results and analyze WER distribution (mean, median, percentiles)
  3. Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
  4. Apply the calculated threshold using PreserveByValueStage

Example:

1# Apply calculated statistical threshold
2statistical_filter = PreserveByValueStage(
3 input_value_key="wer",
4 target_value=calculated_threshold, # From your statistical analysis
5 operator="le"
6)
7
8pipeline.add_stage(statistical_filter)

Use AudioToDocumentStage and JsonlWriter to export WER values for analysis in tools like pandas, numpy, or visualization libraries.

Domain-Specific WER Filtering

Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:

Conversational Speech

Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:

1# More lenient thresholds for conversational speech
2conversational_wer_config = {
3 "excellent_threshold": 15.0, # compared to 10.0 for read speech
4 "good_threshold": 35.0, # compared to 25.0 for read speech
5 "acceptable_threshold": 60.0 # compared to 50.0 for read speech
6}
7
8conversational_filter = PreserveByValueStage(
9 input_value_key="wer",
10 target_value=conversational_wer_config["good_threshold"],
11 operator="le"
12)
13
14pipeline.add_stage(conversational_filter)

Use cases: Call center recordings, meeting transcriptions, casual interviews, social media audio

Broadcast and News

Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:

1# Stricter thresholds for high-quality broadcast speech
2broadcast_wer_config = {
3 "excellent_threshold": 5.0, # Very strict
4 "good_threshold": 15.0, # Stricter than general
5 "acceptable_threshold": 25.0 # Maximum for broadcast quality
6}
7
8broadcast_filter = PreserveByValueStage(
9 input_value_key="wer",
10 target_value=broadcast_wer_config["good_threshold"],
11 operator="le"
12)
13
14pipeline.add_stage(broadcast_filter)

Use cases: News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers

Complete WER Filtering Example

Here’s a complete pipeline demonstrating WER calculation and filtering:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
4from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
5from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
6from nemo_curator.stages.audio.common import PreserveByValueStage
7from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
8from nemo_curator.stages.text.io.writer import JsonlWriter
9from nemo_curator.stages.resources import Resources
10
11# Create WER filtering pipeline
12pipeline = Pipeline(name="wer_filtering")
13
14# 1. Load audio data with ground truth transcriptions
15pipeline.add_stage(CreateInitialManifestFleursStage(
16 lang="en_us",
17 split="validation",
18 raw_data_dir="./audio_data"
19).with_(batch_size=8))
20
21# 2. Run ASR inference to generate predictions
22pipeline.add_stage(InferenceAsrNemoStage(
23 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
24 pred_text_key="pred_text"
25).with_(resources=Resources(gpus=1.0)))
26
27# 3. Calculate WER
28pipeline.add_stage(GetPairwiseWerStage(
29 text_key="text",
30 pred_text_key="pred_text",
31 wer_key="wer"
32))
33
34# 4. Filter by WER threshold (keep WER ≤ 30%)
35pipeline.add_stage(PreserveByValueStage(
36 input_value_key="wer",
37 target_value=30.0,
38 operator="le"
39))
40
41# 5. Export filtered results
42pipeline.add_stage(AudioToDocumentStage())
43pipeline.add_stage(JsonlWriter(path="./filtered_audio"))
44
45# Execute pipeline
46executor = XennaExecutor()
47pipeline.run(executor)

Best Practices

  • Start with lenient thresholds: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
  • Consider domain characteristics: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
  • Analyze before filtering: Export WER distributions to understand your data before applying aggressive filters.
  • Balance quality and quantity: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
  • Check ASR model: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.