Split a continuous audio waveform into discrete speech segments using Silero VAD. Voice activity detection is the first transformative step in most audio curation pipelines — every downstream filter operates on segments, not whole files.
A VAD model classifies short audio frames (typically 30 ms) as either speech or non-speech. VADSegmentationStage runs the classifier across the input waveform, groups contiguous speech frames into segments, and emits one AudioTask per detected segment. Each emitted task carries:
start_ms / end_ms — segment boundaries in the original filesegment_num — sequential index of the segmentduration_sec — segment lengthwaveform — the segment’s torch tensorAudioSegment for visualization or exportThis fan-out behavior means downstream stages score each segment independently, and individual bad segments can be dropped without losing the rest of the file.
Silero VAD produces a confidence score from 0.0 to 1.0 for each frame. The threshold parameter controls how confident the model must be before classifying a frame as speech. The following table provides starting points; tune based on your dataset:
Lower thresholds keep more borderline audio (potentially more false positives); higher thresholds keep only confident speech (potentially missing quieter passages).
min_duration_sec and max_duration_sec define the acceptable segment-length window:
min_duration_sec are dropped (typically too short to contain useful content).max_duration_sec are split (downstream models often have context-length limits).Typical training-segment durations: 2–60 seconds for ASR, 5–30 seconds for TTS, 10–120 seconds for ALM.
The default resource allocation is Resources(cpus=1.0, gpus=0.0). Silero VAD is small enough that fractional-GPU sharing works well; set gpus=0.1 on the stage’s Resources if you have GPU headroom.
Prerequisites: each input AudioTask must carry a waveform and sample_rate field. Place a MonoConversionStage upstream to guarantee both.
speech_pad_ms adds padding to either side of each detected segment. Without padding, segments often clip the leading and trailing phonemes of an utterance. Defaults:
For training data going to TTS, increase to 400–500 ms to keep more natural prosody. For ASR transcription work, the default 300 ms is usually sufficient.
Conversational speech includes overlapping speakers, disfluencies, and back-channel responses (“uh-huh”). Keep more of it:
Studio audio has cleanly separated utterances and minimal noise. Use stricter parameters to avoid splitting on internal pauses:
Long files benefit from larger max_duration_sec so individual sentences and paragraphs aren’t fragmented:
A minimum-viable pipeline that loads audio, segments it, and writes a manifest of segments:
threshold=0.5 and min_duration_sec=2.0 cover most use cases. Tune only after inspecting a sample of output segments.nested=True for per-speaker pipelines: when running VAD again on speaker-separated audio downstream of Speaker Separation, set nested=True so the inner VAD inherits the outer segment’s mappings chain — required for correct timestamp resolution at the end.duration_sec across segments. Use that distribution to choose realistic min_duration_sec and max_duration_sec for your data.MonoConversionStage (typically upstream) and SegmentConcatenationStage (typically downstream).AudioDataFilterStage Composite — wraps VAD with the rest of the audio quality pipeline.