*** description: >- Core concepts for evaluating speech transcription quality using WER, CER, duration analysis, and speech rate metrics categories: * concepts-architecture tags: * quality-metrics * wer * cer * speech-quality * transcription-accuracy personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: concept modality: audio-only *** # Audio Quality Metrics This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality. ## Transcription Accuracy Metrics ### Word Error Rate (WER) The primary metric for measuring ASR transcription quality: **Definition**: Percentage of words that differ between ground truth and predicted transcriptions. **Calculation**: ``` WER = (Substitutions + Deletions + Insertions) / Total_Words × 100 ``` **Interpretation**: * **WER = 0%**: Perfect transcription match * **WER ≤ 10%**: Excellent quality (production-ready) * **WER ≤ 25%**: Good quality (suitable for most training) * **WER ≤ 50%**: Moderate quality (may need review) * **WER >75%**: Poor quality (consider filtering) **Example**: ```python # Ground truth: "hello world example" # Prediction: "hello word example" # WER = 1/3 × 100 = 33.33% (1 substitution out of 3 words) ``` WER and CER utilities depend on the `editdistance` package. ### Character Error Rate (CER) More granular accuracy measurement at the character level: **Definition**: Percentage of characters that differ between ground truth and predicted transcriptions. **Calculation**: ``` CER = (Character_Substitutions + Character_Deletions + Character_Insertions) / Total_Characters × 100 ``` **Use Cases**: * Languages with complex morphology * Detailed accuracy analysis * Character-level model evaluation **Example**: ```python # Ground truth: "hello" # Prediction: "helo" # CER = 1/5 × 100 = 20% (1 deletion out of 5 characters) ``` ## Audio Characteristic Metrics ### Duration Analysis **Audio Duration**: Precise measurement of audio file length in seconds. **Speech Rate Metrics**: * **Words per Second**: `word_count / duration` * **Characters per Second**: `character_count / duration` To enforce duration thresholds in a pipeline, use `PreserveByValueStage`. ### Format and Technical Metrics **Sample Rate**: Audio sampling frequency (typically 16 kHz for ASR) **Bit Depth**: Audio resolution (16-bit or 24-bit) **Channels**: Mono (preferred) or stereo audio **Encoding format**: Compression format (WAV, FLAC preferred for quality) ## Quality Assessment Strategies ### Threshold-Based Filtering **Conservative Filtering** (High Quality): ```python quality_thresholds = { "max_wer": 15.0, # WER ≤ 15% "min_duration": 1.0, # Duration ≥ 1 second "max_duration": 20.0, # Duration ≤ 20 seconds "min_words": 3, # At least 3 words } ``` **Balanced Filtering** (Good Quality): ```python quality_thresholds = { "max_wer": 30.0, # WER ≤ 30% "min_duration": 0.5, # Duration ≥ 0.5 seconds "max_duration": 30.0, # Duration ≤ 30 seconds "min_words": 2, # At least 2 words } ``` **Lenient Filtering** (Acceptable Quality): ```python quality_thresholds = { "max_wer": 50.0, # WER ≤ 50% "min_duration": 0.3, # Duration ≥ 0.3 seconds "max_duration": 60.0, # Duration ≤ 60 seconds "min_words": 1, # At least 1 word } ``` Filtering mechanism reference: `nemo_curator/stages/audio/common.py:71-116` (`PreserveByValueStage` supports `lt`, `le`, `eq`, `ne`, `ge`, `gt` over a value key) ### Language-Specific Considerations Different languages require different quality thresholds: **High-Resource Languages** (English, Spanish, French): * Lower WER thresholds (≤ 20%) * Standard duration ranges * Extensive ASR model availability **Medium-Resource Languages** (German, Italian, Portuguese): * Moderate WER thresholds (≤ 30%) * Slightly more lenient filtering * Good ASR model availability **Low-Resource Languages** (Armenian, Estonian, Maltese): * Higher WER thresholds (≤ 50%) * More lenient duration filtering * Limited ASR model options ## Composite Quality Scores ### Weighted Quality Scoring Combine multiple metrics for overall quality assessment: ```python def calculate_composite_quality(wer: float, duration: float, text: str) -> float: """Calculate composite quality score (0-100).""" # WER component (50% weight) wer_score = max(0, 100 - wer) # Duration component (30% weight) if 1.0 <= duration <= 15.0: duration_score = 100 elif 0.5 <= duration < 1.0 or 15.0 < duration <= 30.0: duration_score = 75 else: duration_score = 25 # Text length component (20% weight) word_count = len(text.split()) if word_count >= 5: length_score = 100 elif word_count >= 3: length_score = 75 else: length_score = 50 # Weighted combination composite_score = ( 0.5 * wer_score + 0.3 * duration_score + 0.2 * length_score ) return round(composite_score, 2) ``` This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a `composite_quality` field. For end-to-end examples, refer to the custom metrics guidance. ### Domain-Specific Scoring **Conversational Speech**: * Emphasis on natural speech patterns * Tolerance for pauses and filler words * Speaker change detection importance **Broadcast Speech**: * High accuracy requirements * Clear pronunciation expectations * Background noise considerations **Telephony Speech**: * Bandwidth limitations consideration * Compression artifact tolerance * Channel-specific quality factors ## Quality Monitoring ### Dataset Quality Distribution Monitor quality across your dataset: ```python def analyze_quality_distribution(manifest_data: list) -> dict: """Analyze quality distribution across dataset.""" wer_values = [item["wer"] for item in manifest_data] duration_values = [item["duration"] for item in manifest_data] return { "total_samples": len(manifest_data), "wer_stats": { "mean": np.mean(wer_values), "median": np.median(wer_values), "std": np.std(wer_values), "percentiles": np.percentile(wer_values, [25, 50, 75, 90, 95]) }, "duration_stats": { "mean": np.mean(duration_values), "median": np.median(duration_values), "total_hours": sum(duration_values) / 3600 }, "quality_bins": { "excellent": sum(1 for wer in wer_values if wer <= 10), "good": sum(1 for wer in wer_values if 10 < wer <= 25), "fair": sum(1 for wer in wer_values if 25 < wer <= 50), "poor": sum(1 for wer in wer_values if wer >50) } } ``` This distribution function is a documentation example, not part of the shipped API. It requires `numpy` (such as `import numpy as np`). Consider integrating it in analysis notebooks or a custom stage. ## Best Practices ### Quality Threshold Selection 1. **Start Conservative**: Begin with strict thresholds (WER ≤ 20%) 2. **Analyze Distribution**: Examine quality distribution of your dataset 3. **Adjust Iteratively**: Relax thresholds based on data availability 4. **Domain Adaptation**: Customize thresholds for your specific use case ### Metric Combination 1. **Primary Metric**: Use WER as the main quality indicator 2. **Secondary Filters**: Apply duration and text length filters 3. **Value-based Filtering**: Apply configurable threshold filtering 4. **Validation**: Cross-validate quality with human evaluation ### Quality-Performance Trade-offs **High Quality (Strict Filtering)**: * Pros: Better model training, higher accuracy * Cons: Reduced dataset size, potential bias **Balanced Quality (Moderate Filtering)**: * Pros: Good quality with reasonable dataset size * Cons: Some noise in training data **High Coverage (Lenient Filtering)**: * Pros: Maximum data utilization, diverse content * Cons: Lower average quality, potential model degradation