Duration Filtering
Filter audio samples by duration ranges, speech rate metrics, and temporal characteristics to create optimal datasets for ASR training and speech processing applications.
Duration-Based Quality Control
Why Duration Matters
Training Efficiency: Duration filtering can improve ASR training by removing samples that may be problematic for training
Processing Performance: Duration affects computational requirements:
- Memory usage scales with audio length
- Batch processing efficiency varies with duration variance
- GPU utilization optimized for consistent lengths
Basic Duration Filtering
Simple Duration Range
Use Case-Specific Ranges
Speech Rate Analysis
Speech rate metrics (words per second, characters per second) help identify samples with speaking speeds appropriate for your use case.
Calculate Speech Rate Metrics
The built-in speech rate calculation functions can be used within custom processing stages to analyze speaking speed and add metrics to your pipeline data.
Speech Rate Filtering
If you have pre-calculated speech rate metrics in your data, you can filter based on them:
This example assumes you have already calculated and stored speech rate metrics in your audio data. The built-in stages do not automatically calculate speech rates - you would need to create a custom stage for that functionality.
Filtering by Speech Rate
After you calculate speech rate metrics, filter samples to keep those with appropriate speaking speeds:
Normal Speech Rate Range
These examples assume you have pre-calculated speech rate metrics in your audio data. Use the get_wordrate() and get_charrate() utility functions to calculate these values in a custom processing stage.
Normal Speech Rate Ranges
Typical speech rates for different contexts:
Best Practices
Duration Filtering Strategy
- Analyze First: Understand your dataset’s duration distribution
- Use Case Alignment: Align duration ranges with intended use
- Progressive Filtering: Apply duration filters before computationally expensive stages
- Quality Correlation: Consider correlation between duration and other quality metrics
Common Pitfalls
Over-Filtering: Removing too much data
Under-Filtering: Keeping problematic samples that may negatively impact training or processing efficiency.
Real Working Example
Here’s a complete working example from the NeMo Curator tutorials showing actual duration filtering in practice:
This example comes directly from tutorials/audio/fleurs/pipeline.py and shows the correct parameter names and usage patterns for the built-in stages.
Related Topics
- Quality Assessment Overview: Complete quality filtering workflow
- WER Filtering: Transcription accuracy filtering
- Audio Analysis: Duration calculation and analysis