***

description: >-
Filter audio samples based on Word Error Rate thresholds to ensure
high-quality transcription accuracy
categories:

* audio-processing
  tags:
* wer-filtering
* quality-filtering
* transcription-accuracy
* threshold-based
* speech-quality
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: how-to
  modality: audio-only

***

# WER Filtering

Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-quality transcription accuracy in your speech datasets. WER filtering is the primary quality control mechanism for ASR-based audio curation.

## Understanding WER

### What is Word Error Rate?

Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:

```text
WER = (Substitutions + Deletions + Insertions) / Total_Reference_Words × 100
```

**Components:**

* **Substitutions**: Words incorrectly replaced (for example, "cat" → "hat")
* **Deletions**: Words omitted from the prediction
* **Insertions**: Extra words added to the prediction
* **Total\_Reference\_Words**: Total word count in ground truth transcription

A lower WER indicates higher transcription accuracy.

### WER Quality Levels

The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:

| WER Range | Quality Level | Recommended Use                                |
| --------- | ------------- | ---------------------------------------------- |
| 0-10%     | Excellent     | Production ASR training, high-quality datasets |
| 10-25%    | Good          | General ASR training, most applications        |
| 25-50%    | Moderate      | Supplementary training data, domain adaptation |
| 50-75%    | Poor          | Review required, potential filtering           |
| 75%+      | Poor          | Strong candidate for removal                   |

## Basic WER Filtering

Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:

### Step 1: Calculate WER

Use `GetPairwiseWerStage` to compute WER between ground truth transcriptions and ASR model predictions:

```python
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage

# Calculate WER for audio samples
wer_stage = GetPairwiseWerStage(
    text_key="text",           # Ground truth transcription
    pred_text_key="pred_text", # ASR model prediction
    wer_key="wer"             # Output field for WER value
)

# Add to pipeline
pipeline.add_stage(wer_stage)
```

**Parameters:**

* `text_key`: Field name containing ground truth transcriptions in your manifest
* `pred_text_key`: Field name containing ASR predictions (from `InferenceAsrNemoStage` or similar)
* `wer_key`: Field name to store calculated WER values (default: `"wer"`)

**Prerequisites:** Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.

### Step 2: Apply WER Threshold

Use `PreserveByValueStage` to filter audio samples based on the calculated WER values:

```python
from nemo_curator.stages.audio.common import PreserveByValueStage

# Keep samples with WER ≤ 30% (good quality)
wer_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=30.0,
    operator="le"  # less than or equal
)

pipeline.add_stage(wer_filter)
```

**Parameters:**

* `input_value_key`: Field containing WER values (matches `wer_key` from previous stage)
* `target_value`: WER threshold (percentage as float, e.g., `30.0` for 30%)
* `operator`: Comparison operator (`"le"` for ≤, `"lt"` for \<, `"ge"` for ≥, `"gt"` for >)

The stage preserves samples meeting the threshold criteria and filters out others.

## Advanced WER Filtering

### Statistical WER Filtering

Rather than using fixed thresholds, you can analyze your dataset's WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.

**Workflow:**

1. Calculate WER for all samples using `GetPairwiseWerStage`
2. Export results and analyze WER distribution (mean, median, percentiles)
3. Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
4. Apply the calculated threshold using `PreserveByValueStage`

**Example:**

```python
# Apply calculated statistical threshold
statistical_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=calculated_threshold,  # From your statistical analysis
    operator="le"
)

pipeline.add_stage(statistical_filter)
```

<Tip>
  Use `AudioToDocumentStage` and `JsonlWriter` to export WER values for analysis in tools like pandas, numpy, or visualization libraries.
</Tip>

## Domain-Specific WER Filtering

Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:

### Conversational Speech

Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:

```python
# More lenient thresholds for conversational speech
conversational_wer_config = {
    "excellent_threshold": 15.0,  # compared to 10.0 for read speech
    "good_threshold": 35.0,       # compared to 25.0 for read speech  
    "acceptable_threshold": 60.0   # compared to 50.0 for read speech
}

conversational_filter = PreserveByValueStage(
    input_value_key="wer",
    target_value=conversational_wer_config["good_threshold"],
    operator="le"
)

pipeline.add_stage(conversational_filter)
```

**Use cases:** Call center recordings, meeting transcriptions, casual interviews, social media audio

### Broadcast and News

Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:

```python
# Stricter thresholds for high-quality broadcast speech
broadcast_wer_config = {
    "excellent_threshold": 5.0,   # Very strict
    "good_threshold": 15.0,       # Stricter than general
    "acceptable_threshold": 25.0   # Maximum for broadcast quality
}

broadcast_filter = PreserveByValueStage(
    input_value_key="wer", 
    target_value=broadcast_wer_config["good_threshold"],
    operator="le"
)

pipeline.add_stage(broadcast_filter)
```

**Use cases:** News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers

## Complete WER Filtering Example

Here's a complete pipeline demonstrating WER calculation and filtering:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import PreserveByValueStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.resources import Resources

# Create WER filtering pipeline
pipeline = Pipeline(name="wer_filtering")

# 1. Load audio data with ground truth transcriptions
pipeline.add_stage(CreateInitialManifestFleursStage(
    lang="en_us", 
    split="validation", 
    raw_data_dir="./audio_data"
).with_(batch_size=8))

# 2. Run ASR inference to generate predictions
pipeline.add_stage(InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
    pred_text_key="pred_text"
).with_(resources=Resources(gpus=1.0)))

# 3. Calculate WER
pipeline.add_stage(GetPairwiseWerStage(
    text_key="text",
    pred_text_key="pred_text",
    wer_key="wer"
))

# 4. Filter by WER threshold (keep WER ≤ 30%)
pipeline.add_stage(PreserveByValueStage(
    input_value_key="wer",
    target_value=30.0,
    operator="le"
))

# 5. Export filtered results
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./filtered_audio"))

# Execute pipeline
executor = XennaExecutor()
pipeline.run(executor)
```

## Best Practices

* **Start with lenient thresholds**: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
* **Consider domain characteristics**: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
* **Analyze before filtering**: Export WER distributions to understand your data before applying aggressive filters.
* **Balance quality and quantity**: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
* **Check ASR model**: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.

## Related Topics

* **[Quality Assessment Overview](/curate-audio/process-data/quality-assessment)** - Complete guide to audio quality assessment
* **[Duration Filtering](/curate-audio/process-data/quality-assessment/duration-filtering)** - Filter by audio length and speech rate
* **[ASR Inference](/curate-audio/process-data/asr-inference)** - Generate ASR predictions for WER calculation
