Curate AudioProcess DataQuality Filtering

AudioDataFilterStage Composite Pipeline

View as Markdown

AudioDataFilterStage is a CompositeStage that decomposes into a configurable sequence of audio sub-stages for extracting clean single-speaker segments from raw audio files. Use it when you want the full quality-filtering pipeline driven by a single YAML config instead of wiring stages individually.

Understanding the Composite

What It Does

A CompositeStage is a stage that, at pipeline build time, expands into a sequence of underlying stages. AudioDataFilterStage expands into the audio quality-filtering chain — preprocessing, VAD, band, UTMOS, SIGMOS, concatenation, speaker separation, per-speaker filters, and timestamp mapping — using parameters loaded from a YAML config.

This serves two purposes:

  1. Single-config pipelines: tune the entire pipeline in one place instead of editing many pipeline.add_stage(...) calls.
  2. Resource declarations live with the stage: each sub-stage’s CPU/GPU allocation is set in the same YAML, alongside its functional parameters.

Default Pipeline Order

When all sub-stages are enabled, the composite expands into:

  1. MonoConversionStage — normalize channels and sample rate.
  2. VADSegmentationStage — split into speech segments.
  3. BandFilterStage — drop segments not matching the target bandwidth.
  4. UTMOSFilterStage — drop segments below the MOS threshold.
  5. SIGMOSFilterStage — drop segments failing any active SIGMOS dimension.
  6. SegmentConcatenationStage — concatenate surviving segments with silence gaps.
  7. SpeakerSeparationStage — diarize and fan out one task per speaker.
  8. Per-speaker filters — rerun VAD + Band + UTMOS + SIGMOS on each speaker’s audio.
  9. TimestampMapperStage — project final boundaries back to original-file timestamps.

When to Use the Composite vs Individual Stages

ApproachUse When
AudioDataFilterStage (composite)Standard end-to-end curation; you want YAML-driven configuration; you want all sub-stages enabled.
Individual stagesYou only need part of the pipeline (e.g., VAD + UTMOS without speaker separation), or you need to interleave audio stages with custom code.

Basic Usage

Step 1: Pick a Config Source

Construct the stage from a YAML config file:

1from nemo_curator.stages.audio.advanced_pipelines.audio_data_filter import AudioDataFilterStage
2
3audio_filter = AudioDataFilterStage(config_path="./audio_filter.yaml")
4pipeline.add_stage(audio_filter)

Or pass a config dict inline:

1audio_filter = AudioDataFilterStage(
2 config={
3 "vad": {"enable": True, "min_duration_sec": 1.0},
4 "utmos": {"enable": True, "mos_threshold": 3.0},
5 "sigmos": {"enable": False},
6 "speaker_separation": {"enable": True},
7 },
8)

If neither config_path nor config is provided, the bundled default config is used.

Step 2: Customize the YAML

Each top-level key maps to one sub-stage; set enable: false to skip it. The default configuration shipped with the stage:

1mono_conversion:
2 output_sample_rate: 48000
3 strict_sample_rate: true
4 cpus: 1.0
5
6vad:
7 enable: true
8 min_duration_sec: 2.0
9 max_duration_sec: 60.0
10 threshold: 0.5
11 min_interval_ms: 500
12 speech_pad_ms: 300
13 cpus: 1.0
14 gpus: 0.1
15
16band_filter:
17 enable: true
18 band_value: full_band
19 cpus: 1.0
20 gpus: 0.0
21
22utmos:
23 enable: true
24 mos_threshold: 3.4
25 cpus: 1.0
26 gpus: 0.1
27
28sigmos:
29 enable: true
30 noise_threshold: 4.0
31 ovrl_threshold: 3.5
32 sig_threshold: null
33 col_threshold: null
34 disc_threshold: null
35 loud_threshold: null
36 reverb_threshold: null
37 cpus: 1.0
38 gpus: 0.1
39
40concatenation:
41 silence_duration_sec: 0.5
42 cpus: 1.0
43
44speaker_separation:
45 enable: true
46 exclude_overlaps: true
47 min_duration: 0.8
48 gap_threshold: 0.1
49 buffer_time: 0.5
50 cpus: 1.0
51 gpus: 0.3
52
53timestamp_mapper:
54 passthrough_keys: null
55 cpus: 1.0

The parameters mirror the sub-stage’s constructor arguments. See the per-stage pages linked at the bottom for parameter details.

The default UTMOS threshold in the YAML config is 3.4, while the standalone UTMOSFilterStage class default is 3.5. The composite uses the YAML value when constructed from the bundled config; tune as needed for your data.

Step 3: Disable Unneeded Stages

Each sub-stage with enable: accepts false to skip it. Common partial pipelines:

PipelineDisable
VAD-onlyband_filter.enable: false, utmos.enable: false, sigmos.enable: false, speaker_separation.enable: false
Quality-onlyspeaker_separation.enable: false (keeps audio whole instead of fanning out per speaker)
Single-speaker knownspeaker_separation.enable: false (substantial GPU savings when input has one speaker)
No bandwidth filteringband_filter.enable: false

Common Configurations

High-Quality TTS Training Data

Strict thresholds across all dimensions, no narrowband, no high-reverb:

1mono_conversion:
2 output_sample_rate: 48000
3vad:
4 enable: true
5 min_duration_sec: 3.0
6band_filter:
7 enable: true
8 band_value: full_band # full-band only
9utmos:
10 enable: true
11 mos_threshold: 4.0 # strict
12sigmos:
13 enable: true
14 noise_threshold: 4.5
15 ovrl_threshold: 4.0
16 reverb_threshold: 3.5
17 disc_threshold: 4.0
18speaker_separation:
19 enable: true

Permissive Web-Crawl Curation

Looser thresholds; preserve more data; rely on downstream training to filter further:

1mono_conversion:
2 output_sample_rate: 16000 # narrow-band acceptable
3 strict_sample_rate: false # auto-resample
4vad:
5 enable: true
6 threshold: 0.4 # lenient
7band_filter:
8 enable: false # accept any bandwidth
9utmos:
10 enable: true
11 mos_threshold: 3.0
12sigmos:
13 enable: true
14 noise_threshold: 3.5
15 ovrl_threshold: 3.0
16speaker_separation:
17 enable: true

ASR Training (Single-Speaker Read Speech)

Skip speaker separation since each file is known to have one speaker:

1mono_conversion:
2 output_sample_rate: 16000
3vad:
4 enable: true
5 min_duration_sec: 2.0
6band_filter:
7 enable: true
8 band_value: narrow_band # match deployment
9utmos:
10 enable: true
11 mos_threshold: 3.5
12sigmos:
13 enable: true
14speaker_separation:
15 enable: false # skip — single speaker

Complete Pipeline Example

A pipeline that uses AudioDataFilterStage as the entire processing chain:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.advanced_pipelines.audio_data_filter import AudioDataFilterStage
4from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
5from nemo_curator.stages.text.io.writer import JsonlWriter
6
7pipeline = Pipeline(name="audio_data_filter")
8
9# Reads from a manifest and produces filtered AudioTask per speaker
10pipeline.add_stage(AudioDataFilterStage(config_path="./audio_filter.yaml"))
11
12# Export
13pipeline.add_stage(AudioToDocumentStage())
14pipeline.add_stage(JsonlWriter(path="./curated_audio"))
15
16executor = XennaExecutor()
17pipeline.run(executor)

For a complete end-to-end walkthrough including dataset download, see the ReadSpeech Tutorial.

Best Practices

  • Start from the default config and tune one knob at a time: don’t tighten thresholds on five dimensions at once. You’ll lose visibility into which one rejected each dropped segment.
  • Disable speaker separation when you can: it’s the most expensive sub-stage. If your input has known single-speaker audio, set speaker_separation.enable: false for a substantial speedup.
  • Match resources to hardware: the cpus / gpus keys per sub-stage control parallelism. On a 16-CPU / 4-GPU node, the defaults work well; tune up for larger nodes.
  • Use strict_sample_rate: false only when needed: auto-resampling can mask data-quality bugs (unexpected 8 kHz audio in a 48 kHz dataset). Default to strict and disable only when heterogeneity is expected.
  • Inspect distributions before tightening thresholds: route a small sample through with most filters disabled to score the data, then pick thresholds from the percentile distributions.