VAD Segmentation
Split a continuous audio waveform into discrete speech segments using Silero VAD. Voice activity detection is the first transformative step in most audio curation pipelines — every downstream filter operates on segments, not whole files.
Understanding Voice Activity Detection
What VAD Does
A VAD model classifies short audio frames (typically 30 ms) as either speech or non-speech. VADSegmentationStage runs the classifier across the input waveform, groups contiguous speech frames into segments, and emits one AudioTask per detected segment. Each emitted task carries:
start_ms/end_ms— segment boundaries in the original filesegment_num— sequential index of the segmentduration_sec— segment lengthwaveform— the segment’s torch tensor- A PyDub
AudioSegmentfor visualization or export
This fan-out behavior means downstream stages score each segment independently, and individual bad segments can be dropped without losing the rest of the file.
Threshold Guidelines
Silero VAD produces a confidence score from 0.0 to 1.0 for each frame. The threshold parameter controls how confident the model must be before classifying a frame as speech. The following table provides starting points; tune based on your dataset:
Lower thresholds keep more borderline audio (potentially more false positives); higher thresholds keep only confident speech (potentially missing quieter passages).
Segment Length
min_duration_sec and max_duration_sec define the acceptable segment-length window:
- Segments shorter than
min_duration_secare dropped (typically too short to contain useful content). - Segments longer than
max_duration_secare split (downstream models often have context-length limits).
Typical training-segment durations: 2–60 seconds for ASR, 5–30 seconds for TTS, 10–120 seconds for ALM.
Basic VAD Segmentation
Step 1: Configure the Stage
The default resource allocation is Resources(cpus=1.0, gpus=0.0). Silero VAD is small enough that fractional-GPU sharing works well; set gpus=0.1 on the stage’s Resources if you have GPU headroom.
Prerequisites: each input AudioTask must carry a waveform and sample_rate field. Place a MonoConversionStage upstream to guarantee both.
Step 2: Tune the Padding
speech_pad_ms adds padding to either side of each detected segment. Without padding, segments often clip the leading and trailing phonemes of an utterance. Defaults:
For training data going to TTS, increase to 400–500 ms to keep more natural prosody. For ASR transcription work, the default 300 ms is usually sufficient.
Parameters
Domain-Specific Tuning
Conversational Audio
Conversational speech includes overlapping speakers, disfluencies, and back-channel responses (“uh-huh”). Keep more of it:
Studio / Read-Speech Audio
Studio audio has cleanly separated utterances and minimal noise. Use stricter parameters to avoid splitting on internal pauses:
Long-Form Audio (Podcasts, Audiobooks)
Long files benefit from larger max_duration_sec so individual sentences and paragraphs aren’t fragmented:
Complete VAD Pipeline Example
A minimum-viable pipeline that loads audio, segments it, and writes a manifest of segments:
Best Practices
- Start with defaults:
threshold=0.5andmin_duration_sec=2.0cover most use cases. Tune only after inspecting a sample of output segments. - Pair with quality filters: VAD alone keeps anything that sounds like speech, including low-quality, noisy, or distorted segments. Chain a UTMOS filter and/or SIGMOS filter after VAD to drop low-quality segments.
- Use
nested=Truefor per-speaker pipelines: when running VAD again on speaker-separated audio downstream of Speaker Separation, setnested=Trueso the inner VAD inherits the outer segment’smappingschain — required for correct timestamp resolution at the end. - Inspect distributions before filtering aggressively: export a sample manifest with VAD only, then plot the distribution of
duration_secacross segments. Use that distribution to choose realisticmin_duration_secandmax_duration_secfor your data.
Related Topics
- Preprocessing Stages —
MonoConversionStage(typically upstream) andSegmentConcatenationStage(typically downstream). - Speaker Separation — typical next stage when you need diarization in addition to segmentation.
AudioDataFilterStageComposite — wraps VAD with the rest of the audio quality pipeline.