Audio Quality Metrics#
This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality.
Transcription Accuracy Metrics#
Word Error Rate (WER)#
The primary metric for measuring ASR transcription quality:
Definition: Percentage of words that differ between ground truth and predicted transcriptions.
Calculation:
WER = (Substitutions + Deletions + Insertions) / Total_Words × 100
Interpretation:
WER = 0%: Perfect transcription match
WER ≤ 10%: Excellent quality (production-ready)
WER ≤ 25%: Good quality (suitable for most training)
WER ≤ 50%: Moderate quality (may need review)
WER > 75%: Poor quality (consider filtering)
Example:
# Ground truth: "hello world example"
# Prediction: "hello word example"
# WER = 1/3 × 100 = 33.33% (1 substitution out of 3 words)
Note
WER and CER utilities depend on the editdistance
package.
Character Error Rate (CER)#
More granular accuracy measurement at the character level:
Definition: Percentage of characters that differ between ground truth and predicted transcriptions.
Calculation:
CER = (Character_Substitutions + Character_Deletions + Character_Insertions) / Total_Characters × 100
Use Cases:
Languages with complex morphology
Detailed accuracy analysis
Character-level model evaluation
Example:
# Ground truth: "hello"
# Prediction: "helo"
# CER = 1/5 × 100 = 20% (1 deletion out of 5 characters)
Audio Characteristic Metrics#
Duration Analysis#
Audio Duration: Precise measurement of audio file length in seconds.
Speech Rate Metrics:
Words per Second:
word_count / duration
Characters per Second:
character_count / duration
Note
To enforce duration thresholds in a pipeline, use PreserveByValueStage
.
Format and Technical Metrics#
Sample Rate: Audio sampling frequency (typically 16 kHz for ASR) Bit Depth: Audio resolution (16-bit or 24-bit) Channels: Mono (preferred) or stereo audio Encoding format: Compression format (WAV, FLAC preferred for quality)
Quality Assessment Strategies#
Threshold-Based Filtering#
Conservative Filtering (High Quality):
quality_thresholds = {
"max_wer": 15.0, # WER ≤ 15%
"min_duration": 1.0, # Duration ≥ 1 second
"max_duration": 20.0, # Duration ≤ 20 seconds
"min_words": 3, # At least 3 words
}
Balanced Filtering (Good Quality):
quality_thresholds = {
"max_wer": 30.0, # WER ≤ 30%
"min_duration": 0.5, # Duration ≥ 0.5 seconds
"max_duration": 30.0, # Duration ≤ 30 seconds
"min_words": 2, # At least 2 words
}
Lenient Filtering (Acceptable Quality):
quality_thresholds = {
"max_wer": 50.0, # WER ≤ 50%
"min_duration": 0.3, # Duration ≥ 0.3 seconds
"max_duration": 60.0, # Duration ≤ 60 seconds
"min_words": 1, # At least 1 word
}
Filtering mechanism reference: nemo_curator/stages/audio/common.py:71-116
(PreserveByValueStage
supports lt
, le
, eq
, ne
, ge
, gt
over a value key)
Language-Specific Considerations#
Different languages require different quality thresholds:
High-Resource Languages (English, Spanish, French):
Lower WER thresholds (≤ 20%)
Standard duration ranges
Extensive ASR model availability
Medium-Resource Languages (German, Italian, Portuguese):
Moderate WER thresholds (≤ 30%)
Slightly more lenient filtering
Good ASR model availability
Low-Resource Languages (Armenian, Estonian, Maltese):
Higher WER thresholds (≤ 50%)
More lenient duration filtering
Limited ASR model options
Composite Quality Scores#
Weighted Quality Scoring#
Combine multiple metrics for overall quality assessment:
def calculate_composite_quality(wer: float, duration: float, text: str) -> float:
"""Calculate composite quality score (0-100)."""
# WER component (50% weight)
wer_score = max(0, 100 - wer)
# Duration component (30% weight)
if 1.0 <= duration <= 15.0:
duration_score = 100
elif 0.5 <= duration < 1.0 or 15.0 < duration <= 30.0:
duration_score = 75
else:
duration_score = 25
# Text length component (20% weight)
word_count = len(text.split())
if word_count >= 5:
length_score = 100
elif word_count >= 3:
length_score = 75
else:
length_score = 50
# Weighted combination
composite_score = (
0.5 * wer_score +
0.3 * duration_score +
0.2 * length_score
)
return round(composite_score, 2)
Note
This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a composite_quality
field. For end-to-end examples, refer to the custom metrics guidance.
Domain-Specific Scoring#
Conversational Speech:
Emphasis on natural speech patterns
Tolerance for pauses and filler words
Speaker change detection importance
Broadcast Speech:
High accuracy requirements
Clear pronunciation expectations
Background noise considerations
Telephony Speech:
Bandwidth limitations consideration
Compression artifact tolerance
Channel-specific quality factors
Quality Monitoring#
Dataset Quality Distribution#
Monitor quality across your dataset:
def analyze_quality_distribution(manifest_data: list) -> dict:
"""Analyze quality distribution across dataset."""
wer_values = [item["wer"] for item in manifest_data]
duration_values = [item["duration"] for item in manifest_data]
return {
"total_samples": len(manifest_data),
"wer_stats": {
"mean": np.mean(wer_values),
"median": np.median(wer_values),
"std": np.std(wer_values),
"percentiles": np.percentile(wer_values, [25, 50, 75, 90, 95])
},
"duration_stats": {
"mean": np.mean(duration_values),
"median": np.median(duration_values),
"total_hours": sum(duration_values) / 3600
},
"quality_bins": {
"excellent": sum(1 for wer in wer_values if wer <= 10),
"good": sum(1 for wer in wer_values if 10 < wer <= 25),
"fair": sum(1 for wer in wer_values if 25 < wer <= 50),
"poor": sum(1 for wer in wer_values if wer > 50)
}
}
Note
This distribution function is a documentation example, not part of the shipped API. It requires numpy
(such as import numpy as np
). Consider integrating it in analysis notebooks or a custom stage.
Best Practices#
Quality Threshold Selection#
Start Conservative: Begin with strict thresholds (WER ≤ 20%)
Analyze Distribution: Examine quality distribution of your dataset
Adjust Iteratively: Relax thresholds based on data availability
Domain Adaptation: Customize thresholds for your specific use case
Metric Combination#
Primary Metric: Use WER as the main quality indicator
Secondary Filters: Apply duration and text length filters
Value-based Filtering: Apply configurable threshold filtering
Validation: Cross-validate quality with human evaluation
Quality-Performance Trade-offs#
High Quality (Strict Filtering):
Pros: Better model training, higher accuracy
Cons: Reduced dataset size, potential bias
Balanced Quality (Moderate Filtering):
Pros: Good quality with reasonable dataset size
Cons: Some noise in training data
High Coverage (Lenient Filtering):
Pros: Maximum data utilization, diverse content
Cons: Lower average quality, potential model degradation