Audio Quality Metrics

This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality.

Transcription Accuracy Metrics

Word Error Rate (WER)

The primary metric for measuring ASR transcription quality:

Definition: Percentage of words that differ between ground truth and predicted transcriptions.

Calculation:

WER = (Substitutions + Deletions + Insertions) / Total_Words × 100

Interpretation:

WER = 0%: Perfect transcription match
WER ≤ 10%: Excellent quality (production-ready)
WER ≤ 25%: Good quality (suitable for most training)
WER ≤ 50%: Moderate quality (may need review)
WER >75%: Poor quality (consider filtering)

Example:

1 # Ground truth: "hello world example"
2 # Prediction:   "hello word example"
3 # WER = 1/3 × 100 = 33.33% (1 substitution out of 3 words)

WER and CER utilities depend on the editdistance package.

Character Error Rate (CER)

More granular accuracy measurement at the character level:

Definition: Percentage of characters that differ between ground truth and predicted transcriptions.

Calculation:

CER = (Character_Substitutions + Character_Deletions + Character_Insertions) / Total_Characters × 100

Use Cases:

Languages with complex morphology
Detailed accuracy analysis
Character-level model evaluation

Example:

1 # Ground truth: "hello"
2 # Prediction:   "helo" 
3 # CER = 1/5 × 100 = 20% (1 deletion out of 5 characters)

Audio Characteristic Metrics

Duration Analysis

Audio Duration: Precise measurement of audio file length in seconds.

Speech Rate Metrics:

Words per Second: word_count / duration
Characters per Second: character_count / duration

To enforce duration thresholds in a pipeline, use PreserveByValueStage.

Format and Technical Metrics

Sample Rate: Audio sampling frequency (typically 16 kHz for ASR) Bit Depth: Audio resolution (16-bit or 24-bit) Channels: Mono (preferred) or stereo audio Encoding format: Compression format (WAV, FLAC preferred for quality)

Quality Assessment Strategies

Threshold-Based Filtering

Conservative Filtering (High Quality):

1 quality_thresholds = {
2     "max_wer": 15.0,        # WER ≤ 15%
3     "min_duration": 1.0,     # Duration ≥ 1 second
4     "max_duration": 20.0,    # Duration ≤ 20 seconds
5     "min_words": 3,          # At least 3 words
6 }

Balanced Filtering (Good Quality):

1 quality_thresholds = {
2     "max_wer": 30.0,        # WER ≤ 30%
3     "min_duration": 0.5,     # Duration ≥ 0.5 seconds
4     "max_duration": 30.0,    # Duration ≤ 30 seconds
5     "min_words": 2,          # At least 2 words
6 }

Lenient Filtering (Acceptable Quality):

1 quality_thresholds = {
2     "max_wer": 50.0,        # WER ≤ 50%
3     "min_duration": 0.3,     # Duration ≥ 0.3 seconds
4     "max_duration": 60.0,    # Duration ≤ 60 seconds
5     "min_words": 1,          # At least 1 word
6 }

Filtering mechanism reference: nemo_curator/stages/audio/common.py:71-116 (PreserveByValueStage supports lt, le, eq, ne, ge, gt over a value key)

Language-Specific Considerations

Different languages require different quality thresholds:

High-Resource Languages (English, Spanish, French):

Lower WER thresholds (≤ 20%)
Standard duration ranges
Extensive ASR model availability

Medium-Resource Languages (German, Italian, Portuguese):

Moderate WER thresholds (≤ 30%)
Slightly more lenient filtering
Good ASR model availability

Low-Resource Languages (Armenian, Estonian, Maltese):

Higher WER thresholds (≤ 50%)
More lenient duration filtering
Limited ASR model options

Composite Quality Scores

Weighted Quality Scoring

Combine multiple metrics for overall quality assessment:

1 def calculate_composite_quality(wer: float, duration: float, text: str) -> float:
2     """Calculate composite quality score (0-100)."""
3     
4     # WER component (50% weight)
5     wer_score = max(0, 100 - wer)
6     
7     # Duration component (30% weight) 
8     if 1.0 &lt;= duration &lt;= 15.0:
9         duration_score = 100
10     elif 0.5 &lt;= duration < 1.0 or 15.0 < duration &lt;= 30.0:
11         duration_score = 75
12     else:
13         duration_score = 25
14     
15     # Text length component (20% weight)
16     word_count = len(text.split())
17     if word_count &gt;= 5:
18         length_score = 100
19     elif word_count &gt;= 3:
20         length_score = 75
21     else:
22         length_score = 50
23     
24     # Weighted combination
25     composite_score = (
26         0.5 * wer_score +
27         0.3 * duration_score + 
28         0.2 * length_score
29     )
30     
31     return round(composite_score, 2)

This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a composite_quality field. For end-to-end examples, refer to the custom metrics guidance.

Domain-Specific Scoring

Conversational Speech:

Emphasis on natural speech patterns
Tolerance for pauses and filler words
Speaker change detection importance

Broadcast Speech:

High accuracy requirements
Clear pronunciation expectations
Background noise considerations

Telephony Speech:

Bandwidth limitations consideration
Compression artifact tolerance
Channel-specific quality factors

Quality Monitoring

Dataset Quality Distribution

Monitor quality across your dataset:

1 def analyze_quality_distribution(manifest_data: list) -> dict:
2     """Analyze quality distribution across dataset."""
3     
4     wer_values = [item["wer"] for item in manifest_data]
5     duration_values = [item["duration"] for item in manifest_data]
6     
7     return {
8         "total_samples": len(manifest_data),
9         "wer_stats": {
10             "mean": np.mean(wer_values),
11             "median": np.median(wer_values), 
12             "std": np.std(wer_values),
13             "percentiles": np.percentile(wer_values, [25, 50, 75, 90, 95])
14         },
15         "duration_stats": {
16             "mean": np.mean(duration_values),
17             "median": np.median(duration_values),
18             "total_hours": sum(duration_values) / 3600
19         },
20         "quality_bins": {
21             "excellent": sum(1 for wer in wer_values if wer &lt;= 10),
22             "good": sum(1 for wer in wer_values if 10 < wer &lt;= 25),
23             "fair": sum(1 for wer in wer_values if 25 < wer &lt;= 50),
24             "poor": sum(1 for wer in wer_values if wer &gt;50)
25         }
26     }

This distribution function is a documentation example, not part of the shipped API. It requires numpy (such as import numpy as np). Consider integrating it in analysis notebooks or a custom stage.

Best Practices

Quality Threshold Selection

Start Conservative: Begin with strict thresholds (WER ≤ 20%)
Analyze Distribution: Examine quality distribution of your dataset
Adjust Iteratively: Relax thresholds based on data availability
Domain Adaptation: Customize thresholds for your specific use case

Metric Combination

Primary Metric: Use WER as the main quality indicator
Secondary Filters: Apply duration and text length filters
Value-based Filtering: Apply configurable threshold filtering
Validation: Cross-validate quality with human evaluation

Quality-Performance Trade-offs

High Quality (Strict Filtering):

Pros: Better model training, higher accuracy
Cons: Reduced dataset size, potential bias

Balanced Quality (Moderate Filtering):

Pros: Good quality with reasonable dataset size
Cons: Some noise in training data

High Coverage (Lenient Filtering):

Pros: Maximum data utilization, diverse content
Cons: Lower average quality, potential model degradation

1	# Ground truth: "hello world example"
2	# Prediction: "hello word example"
3	# WER = 1/3 × 100 = 33.33% (1 substitution out of 3 words)

1	# Ground truth: "hello"
2	# Prediction: "helo"
3	# CER = 1/5 × 100 = 20% (1 deletion out of 5 characters)

1	quality_thresholds = {
2	"max_wer": 15.0, # WER ≤ 15%
3	"min_duration": 1.0, # Duration ≥ 1 second
4	"max_duration": 20.0, # Duration ≤ 20 seconds
5	"min_words": 3, # At least 3 words
6	}

1	quality_thresholds = {
2	"max_wer": 30.0, # WER ≤ 30%
3	"min_duration": 0.5, # Duration ≥ 0.5 seconds
4	"max_duration": 30.0, # Duration ≤ 30 seconds
5	"min_words": 2, # At least 2 words
6	}

1	quality_thresholds = {
2	"max_wer": 50.0, # WER ≤ 50%
3	"min_duration": 0.3, # Duration ≥ 0.3 seconds
4	"max_duration": 60.0, # Duration ≤ 60 seconds
5	"min_words": 1, # At least 1 word
6	}

1	def calculate_composite_quality(wer: float, duration: float, text: str) -> float:
2	"""Calculate composite quality score (0-100)."""
3
4	# WER component (50% weight)
5	wer_score = max(0, 100 - wer)
6
7	# Duration component (30% weight)
8	if 1.0 <= duration <= 15.0:
9	duration_score = 100
10	elif 0.5 <= duration < 1.0 or 15.0 < duration <= 30.0:
11	duration_score = 75
12	else:
13	duration_score = 25
14
15	# Text length component (20% weight)
16	word_count = len(text.split())
17	if word_count >= 5:
18	length_score = 100
19	elif word_count >= 3:
20	length_score = 75
21	else:
22	length_score = 50
23
24	# Weighted combination
25	composite_score = (
26	0.5 * wer_score +
27	0.3 * duration_score +
28	0.2 * length_score
29	)
30
31	return round(composite_score, 2)

1	def analyze_quality_distribution(manifest_data: list) -> dict:
2	"""Analyze quality distribution across dataset."""
3
4	wer_values = [item["wer"] for item in manifest_data]
5	duration_values = [item["duration"] for item in manifest_data]
6
7	return {
8	"total_samples": len(manifest_data),
9	"wer_stats": {
10	"mean": np.mean(wer_values),
11	"median": np.median(wer_values),
12	"std": np.std(wer_values),
13	"percentiles": np.percentile(wer_values, [25, 50, 75, 90, 95])
14	},
15	"duration_stats": {
16	"mean": np.mean(duration_values),
17	"median": np.median(duration_values),
18	"total_hours": sum(duration_values) / 3600
19	},
20	"quality_bins": {
21	"excellent": sum(1 for wer in wer_values if wer <= 10),
22	"good": sum(1 for wer in wer_values if 10 < wer <= 25),
23	"fair": sum(1 for wer in wer_values if 25 < wer <= 50),
24	"poor": sum(1 for wer in wer_values if wer >50)
25	}
26	}