This guide covers the quality metrics used in NeMo Curator for evaluating speech transcription accuracy, audio characteristics, and overall dataset quality.
The primary metric for measuring ASR transcription quality:
Definition: Percentage of words that differ between ground truth and predicted transcriptions.
Calculation:
Interpretation:
Example:
WER and CER utilities depend on the editdistance package.
More granular accuracy measurement at the character level:
Definition: Percentage of characters that differ between ground truth and predicted transcriptions.
Calculation:
Use Cases:
Example:
Audio Duration: Precise measurement of audio file length in seconds.
Speech Rate Metrics:
word_count / durationcharacter_count / durationTo enforce duration thresholds in a pipeline, use PreserveByValueStage.
Sample Rate: Audio sampling frequency (typically 16 kHz for ASR) Bit Depth: Audio resolution (16-bit or 24-bit) Channels: Mono (preferred) or stereo audio Encoding format: Compression format (WAV, FLAC preferred for quality)
Conservative Filtering (High Quality):
Balanced Filtering (Good Quality):
Lenient Filtering (Acceptable Quality):
Filtering mechanism reference: nemo_curator/stages/audio/common.py:71-116 (PreserveByValueStage supports lt, le, eq, ne, ge, gt over a value key)
Different languages require different quality thresholds:
High-Resource Languages (English, Spanish, French):
Medium-Resource Languages (German, Italian, Portuguese):
Low-Resource Languages (Armenian, Estonian, Maltese):
Combine multiple metrics for overall quality assessment:
This function is an example-only snippet to illustrate a possible scoring approach. It is not a built-in utility. To use it in a pipeline, implement a custom stage that writes a composite_quality field. For end-to-end examples, refer to the custom metrics guidance.
Conversational Speech:
Broadcast Speech:
Telephony Speech:
Monitor quality across your dataset:
This distribution function is a documentation example, not part of the shipped API. It requires numpy (such as import numpy as np). Consider integrating it in analysis notebooks or a custom stage.
High Quality (Strict Filtering):
Balanced Quality (Moderate Filtering):
High Coverage (Lenient Filtering):