nemo_curator.stages.audio.segmentation.vad_segmentation
nemo_curator.stages.audio.segmentation.vad_segmentation
nemo_curator.stages.audio.segmentation.vad_segmentation
VAD (Voice Activity Detection) segmentation stage.
Segments audio into speech chunks using Silero VAD model, filtering out silence and creating manageable segments for further processing.
Supports both CPU and GPU execution. GPU is used when available and requested via _resources configuration.
Bases: ProcessingStage[AudioTask, AudioTask]
Stage to segment audio using Voice Activity Detection (VAD).
This stage takes a single AudioTask and segments it into speech chunks based on VAD, filtering out silence and creating manageable segments for further processing. Uses Silero VAD model loaded via torch.hub.
Returns a list[AudioTask] with one AudioTask per detected speech segment (fan-out).
Parameters:
Minimum silence interval between speech segments in milliseconds.
Minimum segment duration in seconds.
Maximum segment duration in seconds.
Voice activity detection threshold (0.0-1.0).
Padding in ms to add before/after speech segments.
Key to get waveform data.
Key to get sample rate.
Build a single segment item dict from a VAD result.
Get speech segments using VAD.
Resolve waveform and sample_rate from task data. Returns None on failure.
Process a single AudioTask.
When nested=False (default), returns list[AudioTask] with one
task per speech segment (fan-out).
When nested=True, returns a single AudioTask with all segment
dicts stored in task.data["segments"] (no fan-out).