For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
              • Advanced Pipelines
              • Alm
              • Common
              • Datasets
              • Filtering
              • Inference
              • Io
              • Metrics
              • Postprocessing
              • Preprocessing
              • Segmentation
                • Speaker Separation
                • Speaker Separation Module
                • Vad Segmentation
              • Tagging
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Data
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesAudioSegmentation

nemo_curator.stages.audio.segmentation.vad_segmentation

||View as Markdown|
Previous

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep

Next

nemo_curator.stages.audio.tagging

VAD (Voice Activity Detection) segmentation stage.

Segments audio into speech chunks using Silero VAD model, filtering out silence and creating manageable segments for further processing.

Supports both CPU and GPU execution. GPU is used when available and requested via _resources configuration.

Module Contents

Classes

NameDescription
VADSegmentationStageStage to segment audio using Voice Activity Detection (VAD).

Data

SILERO_SUPPORTED_RATES

SILERO_TARGET_RATE

API

class nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage(
min_interval_ms: int = 500,
min_duration_sec: float = 2.0,
max_duration_sec: float = 60.0,
threshold: float = 0.5,
speech_pad_ms: int = 300,
waveform_key: str = 'waveform',
sample_rate_key: str = 'sample_rate',
nested: bool = False,
name: str = 'VADSegmentation',
batch_size: int = 1,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Stage to segment audio using Voice Activity Detection (VAD).

This stage takes a single AudioTask and segments it into speech chunks based on VAD, filtering out silence and creating manageable segments for further processing. Uses Silero VAD model loaded via torch.hub.

Returns a list[AudioTask] with one AudioTask per detected speech segment (fan-out).

Parameters:

min_interval_ms
intDefaults to 500

Minimum silence interval between speech segments in milliseconds.

min_duration_sec
floatDefaults to 2.0

Minimum segment duration in seconds.

max_duration_sec
floatDefaults to 60.0

Maximum segment duration in seconds.

threshold
floatDefaults to 0.5

Voice activity detection threshold (0.0-1.0).

speech_pad_ms
intDefaults to 300

Padding in ms to add before/after speech segments.

waveform_key
strDefaults to 'waveform'

Key to get waveform data.

sample_rate_key
strDefaults to 'sample_rate'

Key to get sample rate.

batch_size
int = 1
max_duration_sec
float = 60.0
min_duration_sec
float = 2.0
min_interval_ms
int = 500
name
str = 'VADSegmentation'
nested
bool = False
resources
Resources
sample_rate_key
str = 'sample_rate'
speech_pad_ms
int = 300
threshold
float = 0.5
waveform_key
str = 'waveform'
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.__post_init__()
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage._build_segment_item(
item: dict[str, typing.Any],
waveform: torch.Tensor,
sample_rate: int,
segment: dict[str, float],
segment_num: int
) -> dict[str, typing.Any]

Build a single segment item dict from a VAD result.

nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage._check_gpu_availability(
gpus: float
) -> None
staticmethod
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage._get_vad_segments(
waveform: torch.Tensor,
sample_rate: int
) -> list[dict[str, float]]

Get speech segments using VAD.

nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage._initialize_model() -> None
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage._resolve_audio(
item: dict[str, typing.Any]
) -> tuple[torch.Tensor, int] | None

Resolve waveform and sample_rate from task data. Returns None on failure.

nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask | list[nemo_curator.tasks.AudioTask]) -> nemo_curator.tasks.AudioTask | list[nemo_curator.tasks.AudioTask]

Process a single AudioTask.

When nested=False (default), returns list[AudioTask] with one task per speech segment (fan-out).

When nested=True, returns a single AudioTask with all segment dicts stored in task.data["segments"] (no fan-out).

nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.audio.segmentation.vad_segmentation.VADSegmentationStage.teardown() -> None
nemo_curator.stages.audio.segmentation.vad_segmentation.SILERO_SUPPORTED_RATES = {8000, 16000, 32000, 48000, 64000, 96000}
nemo_curator.stages.audio.segmentation.vad_segmentation.SILERO_TARGET_RATE = 16000