Filter audio quality using transcription accuracy metrics, duration analysis, and custom quality measures to ensure high-quality speech datasets for ASR training.
Audio quality assessment in NeMo Curator focuses on speech-specific metrics that correlate with training data quality:
The primary metric for assessing transcription quality:
WER measures the percentage of words that differ between ground truth and predicted transcriptions:
More granular accuracy measurement at the character level. The get_cer() function is a utility for calculating CER programmatically::
The WER and CER utilities depend on the editdistance package. These are utility functions typically used within custom stages rather than directly in pipelines.
NeMo Curator provides utility functions for analyzing speaking speed and content density. These functions are designed for use in custom processing stages:
For a complete example of using speech rate metrics in a pipeline, refer to the Duration Filtering guide.
Filter audio samples based on transcription accuracy:
Filter by audio length to remove short or long samples:
The PreserveByValueStage supports several comparison operators:
Here’s a complete working example that demonstrates quality assessment: