Audio Quality Metrics
This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality.
Transcription Accuracy Metrics
Word Error Rate (WER)
The primary metric for measuring ASR transcription quality:
Definition: Percentage of words that differ between ground truth and predicted transcriptions.
Calculation:
Interpretation:
- WER = 0%: Perfect transcription match
- WER ≤ 10%: Excellent quality (production-ready)
- WER ≤ 25%: Good quality (suitable for most training)
- WER ≤ 50%: Moderate quality (may need review)
- WER >75%: Poor quality (consider filtering)
Example:
WER and CER utilities depend on the editdistance package.
Character Error Rate (CER)
More granular accuracy measurement at the character level:
Definition: Percentage of characters that differ between ground truth and predicted transcriptions.
Calculation:
Use Cases:
- Languages with complex morphology
- Detailed accuracy analysis
- Character-level model evaluation
Example:
Audio Characteristic Metrics
Duration Analysis
Audio Duration: Precise measurement of audio file length in seconds.
Speech Rate Metrics:
- Words per Second:
word_count / duration - Characters per Second:
character_count / duration
To enforce duration thresholds in a pipeline, use PreserveByValueStage.
Format and Technical Metrics
Sample Rate: Audio sampling frequency (typically 16 kHz for ASR) Bit Depth: Audio resolution (16-bit or 24-bit) Channels: Mono (preferred) or stereo audio Encoding format: Compression format (WAV, FLAC preferred for quality)
Quality Assessment Strategies
Threshold-Based Filtering
Conservative Filtering (High Quality):
Balanced Filtering (Good Quality):
Lenient Filtering (Acceptable Quality):
Filtering mechanism reference: nemo_curator/stages/audio/common.py:71-116 (PreserveByValueStage supports lt, le, eq, ne, ge, gt over a value key)
Language-Specific Considerations
Different languages require different quality thresholds:
High-Resource Languages (English, Spanish, French):
- Lower WER thresholds (≤ 20%)
- Standard duration ranges
- Extensive ASR model availability
Medium-Resource Languages (German, Italian, Portuguese):
- Moderate WER thresholds (≤ 30%)
- Slightly more lenient filtering
- Good ASR model availability
Low-Resource Languages (Armenian, Estonian, Maltese):
- Higher WER thresholds (≤ 50%)
- More lenient duration filtering
- Limited ASR model options
Composite Quality Scores
Weighted Quality Scoring
Combine multiple metrics for overall quality assessment:
This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a composite_quality field. For end-to-end examples, refer to the custom metrics guidance.
Domain-Specific Scoring
Conversational Speech:
- Emphasis on natural speech patterns
- Tolerance for pauses and filler words
- Speaker change detection importance
Broadcast Speech:
- High accuracy requirements
- Clear pronunciation expectations
- Background noise considerations
Telephony Speech:
- Bandwidth limitations consideration
- Compression artifact tolerance
- Channel-specific quality factors
Quality Monitoring
Dataset Quality Distribution
Monitor quality across your dataset:
This distribution function is a documentation example, not part of the shipped API. It requires numpy (such as import numpy as np). Consider integrating it in analysis notebooks or a custom stage.
Best Practices
Quality Threshold Selection
- Start Conservative: Begin with strict thresholds (WER ≤ 20%)
- Analyze Distribution: Examine quality distribution of your dataset
- Adjust Iteratively: Relax thresholds based on data availability
- Domain Adaptation: Customize thresholds for your specific use case
Metric Combination
- Primary Metric: Use WER as the main quality indicator
- Secondary Filters: Apply duration and text length filters
- Value-based Filtering: Apply configurable threshold filtering
- Validation: Cross-validate quality with human evaluation
Quality-Performance Trade-offs
High Quality (Strict Filtering):
- Pros: Better model training, higher accuracy
- Cons: Reduced dataset size, potential bias
Balanced Quality (Moderate Filtering):
- Pros: Good quality with reasonable dataset size
- Cons: Some noise in training data
High Coverage (Lenient Filtering):
- Pros: Maximum data utilization, diverse content
- Cons: Lower average quality, potential model degradation