Curate AudioProcess DataQuality Assessment

Duration Filtering

View as Markdown

Filter audio samples by duration ranges, speech rate metrics, and temporal characteristics to create optimal datasets for ASR training and speech processing applications.

Duration-Based Quality Control

Why Duration Matters

Training Efficiency: Duration filtering can improve ASR training by removing samples that may be problematic for training

Processing Performance: Duration affects computational requirements:

  • Memory usage scales with audio length
  • Batch processing efficiency varies with duration variance
  • GPU utilization optimized for consistent lengths

Basic Duration Filtering

Simple Duration Range

1from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
2
3# Calculate duration for each audio file
4duration_stage = GetAudioDurationStage(
5 audio_filepath_key="audio_filepath",
6 duration_key="duration"
7)
8
9# Filter for optimal duration range (1-15 seconds)
10min_duration_filter = PreserveByValueStage(
11 input_value_key="duration",
12 target_value=1.0,
13 operator="ge" # greater than or equal
14)
15
16max_duration_filter = PreserveByValueStage(
17 input_value_key="duration",
18 target_value=15.0,
19 operator="le" # less than or equal
20)
21
22# Add to pipeline
23pipeline.add_stage(duration_stage)
24pipeline.add_stage(min_duration_filter)
25pipeline.add_stage(max_duration_filter)

Use Case-Specific Ranges

1# Duration ranges for different applications
2duration_configs = {
3 "asr_training": {
4 "min_duration": 1.0,
5 "max_duration": 20.0,
6 "optimal_range": (2.0, 10.0)
7 },
8
9 "voice_cloning": {
10 "min_duration": 3.0,
11 "max_duration": 10.0,
12 "optimal_range": (4.0, 8.0)
13 },
14
15 "speech_synthesis": {
16 "min_duration": 2.0,
17 "max_duration": 15.0,
18 "optimal_range": (3.0, 12.0)
19 },
20
21 "keyword_spotting": {
22 "min_duration": 0.5,
23 "max_duration": 3.0,
24 "optimal_range": (1.0, 2.0)
25 }
26}
27
28def create_use_case_duration_filter(use_case: str) -> list[PreserveByValueStage]:
29 """Create duration filters for specific use case."""
30
31 config = duration_configs.get(use_case, duration_configs["asr_training"])
32
33 return [
34 PreserveByValueStage(
35 input_value_key="duration",
36 target_value=config["min_duration"],
37 operator="ge"
38 ),
39 PreserveByValueStage(
40 input_value_key="duration",
41 target_value=config["max_duration"],
42 operator="le"
43 )
44 ]

Speech Rate Analysis

Speech rate metrics (words per second, characters per second) help identify samples with speaking speeds appropriate for your use case.

Calculate Speech Rate Metrics

The built-in speech rate calculation functions can be used within custom processing stages to analyze speaking speed and add metrics to your pipeline data.

Speech Rate Filtering

If you have pre-calculated speech rate metrics in your data, you can filter based on them:

1from nemo_curator.stages.audio.common import PreserveByValueStage
2from nemo_curator.pipeline import Pipeline
3
4# Example: Filter by speech rate if you have word_rate field in your data
5pipeline = Pipeline(name="speech_rate_filtering")
6
7# Filter by speech rate (1.5-5 words per second)
8pipeline.add_stage(
9 PreserveByValueStage(
10 input_value_key="word_rate", # Assumes this field exists in your data
11 target_value=1.5,
12 operator="ge"
13 )
14)
15
16pipeline.add_stage(
17 PreserveByValueStage(
18 input_value_key="word_rate",
19 target_value=5.0,
20 operator="le"
21 )
22)

This example assumes you have already calculated and stored speech rate metrics in your audio data. The built-in stages do not automatically calculate speech rates - you would need to create a custom stage for that functionality.

Filtering by Speech Rate

After you calculate speech rate metrics, filter samples to keep those with appropriate speaking speeds:

Normal Speech Rate Range

1from nemo_curator.stages.audio.common import PreserveByValueStage
2
3# Filter by word rate (assumes word_rate field exists in your data)
4word_rate_min_filter = PreserveByValueStage(
5 input_value_key="word_rate",
6 target_value=1.5,
7 operator="ge"
8)
9
10word_rate_max_filter = PreserveByValueStage(
11 input_value_key="word_rate",
12 target_value=5.0,
13 operator="le"
14)
15
16# Filter by character rate (assumes char_rate field exists in your data)
17char_rate_min_filter = PreserveByValueStage(
18 input_value_key="char_rate",
19 target_value=8.0,
20 operator="ge"
21)
22
23char_rate_max_filter = PreserveByValueStage(
24 input_value_key="char_rate",
25 target_value=30.0,
26 operator="le"
27)

These examples assume you have pre-calculated speech rate metrics in your audio data. Use the get_wordrate() and get_charrate() utility functions to calculate these values in a custom processing stage.

Normal Speech Rate Ranges

Typical speech rates for different contexts:

ContextWords/SecondCharacters/SecondUse Case
Slow/Clear Speech1.5 - 2.58 - 15Educational content, accessibility
Normal Conversation2.5 - 4.015 - 24General ASR training
Fast Speech4.0 - 5.024 - 30News, presentations
Very Fast>5.0>30May indicate errors or problematic samples

Best Practices

Duration Filtering Strategy

  1. Analyze First: Understand your dataset’s duration distribution
  2. Use Case Alignment: Align duration ranges with intended use
  3. Progressive Filtering: Apply duration filters before computationally expensive stages
  4. Quality Correlation: Consider correlation between duration and other quality metrics

Common Pitfalls

Over-Filtering: Removing too much data

1# Check retention rates before applying filters
2retention_rate = filtered_count / original_count
3if retention_rate < 0.5: # Less than 50% retained
4 print("Warning: Very aggressive filtering - consider relaxing thresholds")

Under-Filtering: Keeping problematic samples that may negatively impact training or processing efficiency.

Real Working Example

Here’s a complete working example from the NeMo Curator tutorials showing actual duration filtering in practice:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
3from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
4from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
5from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
6from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
7from nemo_curator.stages.resources import Resources
8
9def create_audio_pipeline(raw_data_dir: str, wer_threshold: float = 75.0) -> Pipeline:
10 """Real working pipeline from NeMo Curator tutorials."""
11
12 pipeline = Pipeline(name="audio_inference", description="Inference audio and filter by WER threshold.")
13
14 # Load FLEURS dataset
15 pipeline.add_stage(
16 CreateInitialManifestFleursStage(
17 lang="hy_am",
18 split="dev",
19 raw_data_dir=raw_data_dir,
20 ).with_(batch_size=4)
21 )
22
23 # ASR inference
24 pipeline.add_stage(
25 InferenceAsrNemoStage(
26 model_name="nvidia/stt_hy_fastconformer_hybrid_large_pc"
27 ).with_(resources=Resources(gpus=1.0))
28 )
29
30 # Calculate WER
31 pipeline.add_stage(
32 GetPairwiseWerStage(
33 text_key="text",
34 pred_text_key="pred_text",
35 wer_key="wer"
36 )
37 )
38
39 # Calculate duration
40 pipeline.add_stage(
41 GetAudioDurationStage(
42 audio_filepath_key="audio_filepath",
43 duration_key="duration"
44 )
45 )
46
47 # Filter by WER threshold
48 pipeline.add_stage(
49 PreserveByValueStage(
50 input_value_key="wer",
51 target_value=wer_threshold,
52 operator="le"
53 )
54 )
55
56 # Convert to document format
57 pipeline.add_stage(AudioToDocumentStage().with_(batch_size=1))
58
59 return pipeline

This example comes directly from tutorials/audio/fleurs/pipeline.py and shows the correct parameter names and usage patterns for the built-in stages.