AudioDataFilterStage Composite

AudioDataFilterStage is a CompositeStage that decomposes into a configurable sequence of audio sub-stages for extracting clean single-speaker segments from raw audio files. Use it when you want the full quality-filtering pipeline driven by a single YAML config instead of wiring stages individually.

Understanding the Composite

What It Does

A CompositeStage is a stage that, at pipeline build time, expands into a sequence of underlying stages. AudioDataFilterStage expands into the audio quality-filtering chain — preprocessing, VAD, band, UTMOS, SIGMOS, concatenation, speaker separation, per-speaker filters, and timestamp mapping — using parameters loaded from a YAML config.

This serves two purposes:

Single-config pipelines: tune the entire pipeline in one place instead of editing many pipeline.add_stage(...) calls.
Resource declarations live with the stage: each sub-stage’s CPU/GPU allocation is set in the same YAML, alongside its functional parameters.

Default Pipeline Order

When all sub-stages are enabled, the composite expands into:

MonoConversionStage — normalize channels and sample rate.
VADSegmentationStage — split into speech segments.
BandFilterStage — drop segments not matching the target bandwidth.
UTMOSFilterStage — drop segments below the MOS threshold.
SIGMOSFilterStage — drop segments failing any active SIGMOS dimension.
SegmentConcatenationStage — concatenate surviving segments with silence gaps.
SpeakerSeparationStage — diarize and fan out one task per speaker.
Per-speaker filters — rerun VAD + Band + UTMOS + SIGMOS on each speaker’s audio.
TimestampMapperStage — project final boundaries back to original-file timestamps.

When to Use the Composite vs Individual Stages

Approach	Use When
`AudioDataFilterStage` (composite)	Standard end-to-end curation; you want YAML-driven configuration; you want all sub-stages enabled.
Individual stages	You only need part of the pipeline (e.g., VAD + UTMOS without speaker separation), or you need to interleave audio stages with custom code.

Basic Usage

Step 1: Pick a Config Source

Construct the stage from a YAML config file:

1 from nemo_curator.stages.audio.advanced_pipelines.audio_data_filter import AudioDataFilterStage
2 
3 audio_filter = AudioDataFilterStage(config_path="./audio_filter.yaml")
4 pipeline.add_stage(audio_filter)

Or pass a config dict inline:

1 audio_filter = AudioDataFilterStage(
2     config={
3         "vad": {"enable": True, "min_duration_sec": 1.0},
4         "utmos": {"enable": True, "mos_threshold": 3.0},
5         "sigmos": {"enable": False},
6         "speaker_separation": {"enable": True},
7     },
8 )

If neither config_path nor config is provided, the bundled default config is used.

Step 2: Customize the YAML

Each top-level key maps to one sub-stage; set enable: false to skip it. The default configuration shipped with the stage:

1 mono_conversion:
2   output_sample_rate: 48000
3   strict_sample_rate: true
4   cpus: 1.0
5 
6 vad:
7   enable: true
8   min_duration_sec: 2.0
9   max_duration_sec: 60.0
10   threshold: 0.5
11   min_interval_ms: 500
12   speech_pad_ms: 300
13   cpus: 1.0
14   gpus: 0.1
15 
16 band_filter:
17   enable: true
18   band_value: full_band
19   cpus: 1.0
20   gpus: 0.0
21 
22 utmos:
23   enable: true
24   mos_threshold: 3.4
25   cpus: 1.0
26   gpus: 0.1
27 
28 sigmos:
29   enable: true
30   noise_threshold: 4.0
31   ovrl_threshold: 3.5
32   sig_threshold: null
33   col_threshold: null
34   disc_threshold: null
35   loud_threshold: null
36   reverb_threshold: null
37   cpus: 1.0
38   gpus: 0.1
39 
40 concatenation:
41   silence_duration_sec: 0.5
42   cpus: 1.0
43 
44 speaker_separation:
45   enable: true
46   exclude_overlaps: true
47   min_duration: 0.8
48   gap_threshold: 0.1
49   buffer_time: 0.5
50   cpus: 1.0
51   gpus: 0.3
52 
53 timestamp_mapper:
54   passthrough_keys: null
55   cpus: 1.0

The parameters mirror the sub-stage’s constructor arguments. See the per-stage pages linked at the bottom for parameter details.

The default UTMOS threshold in the YAML config is 3.4, while the standalone UTMOSFilterStage class default is 3.5. The composite uses the YAML value when constructed from the bundled config; tune as needed for your data.

Step 3: Disable Unneeded Stages

Each sub-stage with enable: accepts false to skip it. Common partial pipelines:

Pipeline	Disable
VAD-only	`band_filter.enable: false`, `utmos.enable: false`, `sigmos.enable: false`, `speaker_separation.enable: false`
Quality-only	`speaker_separation.enable: false` (keeps audio whole instead of fanning out per speaker)
Single-speaker known	`speaker_separation.enable: false` (substantial GPU savings when input has one speaker)
No bandwidth filtering	`band_filter.enable: false`

Common Configurations

High-Quality TTS Training Data

Strict thresholds across all dimensions, no narrowband, no high-reverb:

1 mono_conversion:
2   output_sample_rate: 48000
3 vad:
4   enable: true
5   min_duration_sec: 3.0
6 band_filter:
7   enable: true
8   band_value: full_band       # full-band only
9 utmos:
10   enable: true
11   mos_threshold: 4.0          # strict
12 sigmos:
13   enable: true
14   noise_threshold: 4.5
15   ovrl_threshold: 4.0
16   reverb_threshold: 3.5
17   disc_threshold: 4.0
18 speaker_separation:
19   enable: true

Permissive Web-Crawl Curation

Looser thresholds; preserve more data; rely on downstream training to filter further:

1 mono_conversion:
2   output_sample_rate: 16000   # narrow-band acceptable
3   strict_sample_rate: false   # auto-resample
4 vad:
5   enable: true
6   threshold: 0.4              # lenient
7 band_filter:
8   enable: false               # accept any bandwidth
9 utmos:
10   enable: true
11   mos_threshold: 3.0
12 sigmos:
13   enable: true
14   noise_threshold: 3.5
15   ovrl_threshold: 3.0
16 speaker_separation:
17   enable: true

ASR Training (Single-Speaker Read Speech)

Skip speaker separation since each file is known to have one speaker:

1 mono_conversion:
2   output_sample_rate: 16000
3 vad:
4   enable: true
5   min_duration_sec: 2.0
6 band_filter:
7   enable: true
8   band_value: narrow_band     # match deployment
9 utmos:
10   enable: true
11   mos_threshold: 3.5
12 sigmos:
13   enable: true
14 speaker_separation:
15   enable: false               # skip — single speaker

Complete Pipeline Example

A pipeline that uses AudioDataFilterStage as the entire processing chain:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.backends.xenna import XennaExecutor
3 from nemo_curator.stages.audio.advanced_pipelines.audio_data_filter import AudioDataFilterStage
4 from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
5 from nemo_curator.stages.text.io.writer import JsonlWriter
6 
7 pipeline = Pipeline(name="audio_data_filter")
8 
9 # Reads from a manifest and produces filtered AudioTask per speaker
10 pipeline.add_stage(AudioDataFilterStage(config_path="./audio_filter.yaml"))
11 
12 # Export
13 pipeline.add_stage(AudioToDocumentStage())
14 pipeline.add_stage(JsonlWriter(path="./curated_audio"))
15 
16 executor = XennaExecutor()
17 pipeline.run(executor)

For a complete end-to-end walkthrough including dataset download, see the ReadSpeech Tutorial.

Best Practices

Start from the default config and tune one knob at a time: don’t tighten thresholds on five dimensions at once. You’ll lose visibility into which one rejected each dropped segment.
Disable speaker separation when you can: it’s the most expensive sub-stage. If your input has known single-speaker audio, set speaker_separation.enable: false for a substantial speedup.
Match resources to hardware: the cpus / gpus keys per sub-stage control parallelism. On a 16-CPU / 4-GPU node, the defaults work well; tune up for larger nodes.
Use strict_sample_rate: false only when needed: auto-resampling can mask data-quality bugs (unexpected 8 kHz audio in a 48 kHz dataset). Default to strict and disable only when heterogeneity is expected.
Inspect distributions before tightening thresholds: route a small sample through with most filters disabled to score the data, then pick thresholds from the percentile distributions.

Preprocessing Stages — MonoConversionStage, SegmentConcatenationStage, TimestampMapperStage.
VAD, Band Filter, UTMOS, SIGMOS, Speaker Separation — per-stage details.
ReadSpeech Tutorial — end-to-end walkthrough.