Curate AudioProcess DataQuality Filtering

VAD Segmentation

View as Markdown

Split a continuous audio waveform into discrete speech segments using Silero VAD. Voice activity detection is the first transformative step in most audio curation pipelines — every downstream filter operates on segments, not whole files.

Understanding Voice Activity Detection

What VAD Does

A VAD model classifies short audio frames (typically 30 ms) as either speech or non-speech. VADSegmentationStage runs the classifier across the input waveform, groups contiguous speech frames into segments, and emits one AudioTask per detected segment. Each emitted task carries:

  • start_ms / end_ms — segment boundaries in the original file
  • segment_num — sequential index of the segment
  • duration_sec — segment length
  • waveform — the segment’s torch tensor
  • A PyDub AudioSegment for visualization or export

This fan-out behavior means downstream stages score each segment independently, and individual bad segments can be dropped without losing the rest of the file.

Threshold Guidelines

Silero VAD produces a confidence score from 0.0 to 1.0 for each frame. The threshold parameter controls how confident the model must be before classifying a frame as speech. The following table provides starting points; tune based on your dataset:

ThresholdSpeech RecallUse Case
0.3High (lenient)Noisy field recordings, quiet speakers, podcast audio
0.5Balanced (default)General-purpose curation, clean studio audio
0.7StrictHigh-precision curation; recall a single speaker against background chatter

Lower thresholds keep more borderline audio (potentially more false positives); higher thresholds keep only confident speech (potentially missing quieter passages).

Segment Length

min_duration_sec and max_duration_sec define the acceptable segment-length window:

  • Segments shorter than min_duration_sec are dropped (typically too short to contain useful content).
  • Segments longer than max_duration_sec are split (downstream models often have context-length limits).

Typical training-segment durations: 2–60 seconds for ASR, 5–30 seconds for TTS, 10–120 seconds for ALM.

Basic VAD Segmentation

Step 1: Configure the Stage

1from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
2
3vad = VADSegmentationStage(
4 min_duration_sec=2.0,
5 max_duration_sec=60.0,
6 threshold=0.5,
7 speech_pad_ms=300,
8)
9
10pipeline.add_stage(vad)

The default resource allocation is Resources(cpus=1.0, gpus=0.0). Silero VAD is small enough that fractional-GPU sharing works well; set gpus=0.1 on the stage’s Resources if you have GPU headroom.

Prerequisites: each input AudioTask must carry a waveform and sample_rate field. Place a MonoConversionStage upstream to guarantee both.

Step 2: Tune the Padding

speech_pad_ms adds padding to either side of each detected segment. Without padding, segments often clip the leading and trailing phonemes of an utterance. Defaults:

1VADSegmentationStage(
2 speech_pad_ms=300, # 300 ms each side — preserves natural breathing and onsets
3)

For training data going to TTS, increase to 400–500 ms to keep more natural prosody. For ASR transcription work, the default 300 ms is usually sufficient.

Parameters

ParameterTypeDefaultDescription
min_interval_msint500Minimum silence interval (in ms) between distinct speech segments before they merge.
min_duration_secfloat2.0Drop segments shorter than this duration.
max_duration_secfloat60.0Cap segment duration; longer segments are split.
thresholdfloat0.5Silero VAD confidence threshold (0.0–1.0). See Threshold Guidelines.
speech_pad_msint300Padding (in ms) added to either side of each detected segment.
waveform_keystr"waveform"Input task key containing the waveform tensor.
sample_rate_keystr"sample_rate"Input task key containing the sample rate.
nestedboolFalseWhen True, segments inherit the parent task’s mappings field for nested re-segmentation (used in per-speaker filter chains).

Domain-Specific Tuning

Conversational Audio

Conversational speech includes overlapping speakers, disfluencies, and back-channel responses (“uh-huh”). Keep more of it:

1VADSegmentationStage(
2 min_duration_sec=1.0, # keep short backchannels
3 threshold=0.4, # lenient — accept marginal speech
4 speech_pad_ms=400, # extra padding for natural turns
5)

Studio / Read-Speech Audio

Studio audio has cleanly separated utterances and minimal noise. Use stricter parameters to avoid splitting on internal pauses:

1VADSegmentationStage(
2 min_duration_sec=2.0,
3 threshold=0.6, # strict — only confident speech
4 min_interval_ms=800, # don't split on short pauses inside a sentence
5)

Long-Form Audio (Podcasts, Audiobooks)

Long files benefit from larger max_duration_sec so individual sentences and paragraphs aren’t fragmented:

1VADSegmentationStage(
2 min_duration_sec=3.0,
3 max_duration_sec=30.0, # natural sentence-paragraph length
4 threshold=0.5,
5)

Complete VAD Pipeline Example

A minimum-viable pipeline that loads audio, segments it, and writes a manifest of segments:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
6from nemo_curator.stages.text.io.writer import JsonlWriter
7
8pipeline = Pipeline(name="vad_segmentation")
9
10# 1. Normalize input audio
11pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
12
13# 2. Segment into speech chunks
14pipeline.add_stage(
15 VADSegmentationStage(
16 min_duration_sec=2.0,
17 max_duration_sec=60.0,
18 threshold=0.5,
19 speech_pad_ms=300,
20 )
21)
22
23# 3. Export the segment manifest
24pipeline.add_stage(AudioToDocumentStage())
25pipeline.add_stage(JsonlWriter(path="./segments"))
26
27executor = XennaExecutor()
28pipeline.run(executor)

Best Practices

  • Start with defaults: threshold=0.5 and min_duration_sec=2.0 cover most use cases. Tune only after inspecting a sample of output segments.
  • Pair with quality filters: VAD alone keeps anything that sounds like speech, including low-quality, noisy, or distorted segments. Chain a UTMOS filter and/or SIGMOS filter after VAD to drop low-quality segments.
  • Use nested=True for per-speaker pipelines: when running VAD again on speaker-separated audio downstream of Speaker Separation, set nested=True so the inner VAD inherits the outer segment’s mappings chain — required for correct timestamp resolution at the end.
  • Inspect distributions before filtering aggressively: export a sample manifest with VAD only, then plot the distribution of duration_sec across segments. Use that distribution to choose realistic min_duration_sec and max_duration_sec for your data.