About NeMo CuratorConceptsAudio Concepts

Audio Quality Metrics

View as Markdown

This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality.

Transcription Accuracy Metrics

Word Error Rate (WER)

The primary metric for measuring ASR transcription quality:

Definition: Percentage of words that differ between ground truth and predicted transcriptions.

Calculation:

WER = (Substitutions + Deletions + Insertions) / Total_Words × 100

Interpretation:

  • WER = 0%: Perfect transcription match
  • WER ≤ 10%: Excellent quality (production-ready)
  • WER ≤ 25%: Good quality (suitable for most training)
  • WER ≤ 50%: Moderate quality (may need review)
  • WER >75%: Poor quality (consider filtering)

Example:

1# Ground truth: "hello world example"
2# Prediction: "hello word example"
3# WER = 1/3 × 100 = 33.33% (1 substitution out of 3 words)

WER and CER utilities depend on the editdistance package.

Character Error Rate (CER)

More granular accuracy measurement at the character level:

Definition: Percentage of characters that differ between ground truth and predicted transcriptions.

Calculation:

CER = (Character_Substitutions + Character_Deletions + Character_Insertions) / Total_Characters × 100

Use Cases:

  • Languages with complex morphology
  • Detailed accuracy analysis
  • Character-level model evaluation

Example:

1# Ground truth: "hello"
2# Prediction: "helo"
3# CER = 1/5 × 100 = 20% (1 deletion out of 5 characters)

Audio Characteristic Metrics

Duration Analysis

Audio Duration: Precise measurement of audio file length in seconds.

Speech Rate Metrics:

  • Words per Second: word_count / duration
  • Characters per Second: character_count / duration

To enforce duration thresholds in a pipeline, use PreserveByValueStage.

Format and Technical Metrics

Sample Rate: Audio sampling frequency (typically 16 kHz for ASR) Bit Depth: Audio resolution (16-bit or 24-bit) Channels: Mono (preferred) or stereo audio Encoding format: Compression format (WAV, FLAC preferred for quality)

Quality Assessment Strategies

Threshold-Based Filtering

Conservative Filtering (High Quality):

1quality_thresholds = {
2 "max_wer": 15.0, # WER ≤ 15%
3 "min_duration": 1.0, # Duration ≥ 1 second
4 "max_duration": 20.0, # Duration ≤ 20 seconds
5 "min_words": 3, # At least 3 words
6}

Balanced Filtering (Good Quality):

1quality_thresholds = {
2 "max_wer": 30.0, # WER ≤ 30%
3 "min_duration": 0.5, # Duration ≥ 0.5 seconds
4 "max_duration": 30.0, # Duration ≤ 30 seconds
5 "min_words": 2, # At least 2 words
6}

Lenient Filtering (Acceptable Quality):

1quality_thresholds = {
2 "max_wer": 50.0, # WER ≤ 50%
3 "min_duration": 0.3, # Duration ≥ 0.3 seconds
4 "max_duration": 60.0, # Duration ≤ 60 seconds
5 "min_words": 1, # At least 1 word
6}

Filtering mechanism reference: nemo_curator/stages/audio/common.py:71-116 (PreserveByValueStage supports lt, le, eq, ne, ge, gt over a value key)

Language-Specific Considerations

Different languages require different quality thresholds:

High-Resource Languages (English, Spanish, French):

  • Lower WER thresholds (≤ 20%)
  • Standard duration ranges
  • Extensive ASR model availability

Medium-Resource Languages (German, Italian, Portuguese):

  • Moderate WER thresholds (≤ 30%)
  • Slightly more lenient filtering
  • Good ASR model availability

Low-Resource Languages (Armenian, Estonian, Maltese):

  • Higher WER thresholds (≤ 50%)
  • More lenient duration filtering
  • Limited ASR model options

Composite Quality Scores

Weighted Quality Scoring

Combine multiple metrics for overall quality assessment:

1def calculate_composite_quality(wer: float, duration: float, text: str) -> float:
2 """Calculate composite quality score (0-100)."""
3
4 # WER component (50% weight)
5 wer_score = max(0, 100 - wer)
6
7 # Duration component (30% weight)
8 if 1.0 <= duration <= 15.0:
9 duration_score = 100
10 elif 0.5 &lt;= duration < 1.0 or 15.0 < duration &lt;= 30.0:
11 duration_score = 75
12 else:
13 duration_score = 25
14
15 # Text length component (20% weight)
16 word_count = len(text.split())
17 if word_count &gt;= 5:
18 length_score = 100
19 elif word_count &gt;= 3:
20 length_score = 75
21 else:
22 length_score = 50
23
24 # Weighted combination
25 composite_score = (
26 0.5 * wer_score +
27 0.3 * duration_score +
28 0.2 * length_score
29 )
30
31 return round(composite_score, 2)

This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a composite_quality field. For end-to-end examples, refer to the custom metrics guidance.

Domain-Specific Scoring

Conversational Speech:

  • Emphasis on natural speech patterns
  • Tolerance for pauses and filler words
  • Speaker change detection importance

Broadcast Speech:

  • High accuracy requirements
  • Clear pronunciation expectations
  • Background noise considerations

Telephony Speech:

  • Bandwidth limitations consideration
  • Compression artifact tolerance
  • Channel-specific quality factors

Quality Monitoring

Dataset Quality Distribution

Monitor quality across your dataset:

1def analyze_quality_distribution(manifest_data: list) -> dict:
2 """Analyze quality distribution across dataset."""
3
4 wer_values = [item["wer"] for item in manifest_data]
5 duration_values = [item["duration"] for item in manifest_data]
6
7 return {
8 "total_samples": len(manifest_data),
9 "wer_stats": {
10 "mean": np.mean(wer_values),
11 "median": np.median(wer_values),
12 "std": np.std(wer_values),
13 "percentiles": np.percentile(wer_values, [25, 50, 75, 90, 95])
14 },
15 "duration_stats": {
16 "mean": np.mean(duration_values),
17 "median": np.median(duration_values),
18 "total_hours": sum(duration_values) / 3600
19 },
20 "quality_bins": {
21 "excellent": sum(1 for wer in wer_values if wer &lt;= 10),
22 "good": sum(1 for wer in wer_values if 10 < wer &lt;= 25),
23 "fair": sum(1 for wer in wer_values if 25 < wer &lt;= 50),
24 "poor": sum(1 for wer in wer_values if wer &gt;50)
25 }
26 }

This distribution function is a documentation example, not part of the shipped API. It requires numpy (such as import numpy as np). Consider integrating it in analysis notebooks or a custom stage.

Best Practices

Quality Threshold Selection

  1. Start Conservative: Begin with strict thresholds (WER ≤ 20%)
  2. Analyze Distribution: Examine quality distribution of your dataset
  3. Adjust Iteratively: Relax thresholds based on data availability
  4. Domain Adaptation: Customize thresholds for your specific use case

Metric Combination

  1. Primary Metric: Use WER as the main quality indicator
  2. Secondary Filters: Apply duration and text length filters
  3. Value-based Filtering: Apply configurable threshold filtering
  4. Validation: Cross-validate quality with human evaluation

Quality-Performance Trade-offs

High Quality (Strict Filtering):

  • Pros: Better model training, higher accuracy
  • Cons: Reduced dataset size, potential bias

Balanced Quality (Moderate Filtering):

  • Pros: Good quality with reasonable dataset size
  • Cons: Some noise in training data

High Coverage (Lenient Filtering):

  • Pros: Maximum data utilization, diverse content
  • Cons: Lower average quality, potential model degradation