nemo_curator.stages.audio.inference.vad.whisperx_vad

View as Markdown

WhisperX VAD for NeMo Curator.

Provides WhisperXVADModel (shared VAD logic for pyannote and standalone VAD) and WhisperXVADStage (ProcessingStage for VAD-only pipeline use).

Module Contents

Classes

NameDescription
WhisperXVADModelShared VAD model and get_vad_segments logic for PyAnnote and standalone VAD.
WhisperXVADStageStage that performs Voice Activity Detection (VAD) using WhisperX’s VAD model.

API

class nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADModel(
device: str = 'cuda',
vad_onset: float = 0.5,
vad_offset: float = 0.363,
use_auth_token: str | None = None
)

Shared VAD model and get_vad_segments logic for PyAnnote and standalone VAD.

Used by PyAnnoteDiarizationStage for sub-segment VAD and by WhisperXVADStage for VAD-only processing.

_model
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADModel.get_vad_segments(
audio: numpy.ndarray,
merge_max_length: float,
sample_rate: int = SAMPLE_RATE
) -> list[dict]

Get voice activity detection segments for the given audio.

Parameters:

audio
np.ndarray

NumPy array of shape (C, N).

merge_max_length
float

Maximum length for merging chunks in seconds.

sample_rate
intDefaults to SAMPLE_RATE

Sample rate of the audio.

Returns: list[dict]

List of VAD segment dicts with “start” and “end” keys.

nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADModel.to(
device: str
) -> None

Move the model to the given device.

class nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage(
min_length: float = 0.5,
max_length: float = 40.0,
vad_onset: float = 0.5,
vad_offset: float = 0.363,
segments_key: str = 'vad_segments',
audio_filepath_key: str = 'resampled_audio_filepath',
name: str = 'WhisperXVAD',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(gpus=1))(),
_vad_model: typing.Any = None
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Stage that performs Voice Activity Detection (VAD) using WhisperX’s VAD model.

Adds VAD segments to each entry under segments_key (e.g. “vad_segments”). Entries shorter than min_length are skipped (not emitted).

_device
str

Derive device from resources configuration.

_vad_model
Any = field(default=None, repr=False)
audio_filepath_key
str = 'resampled_audio_filepath'
max_length
float = 40.0
min_length
float = 0.5
name
str = 'WhisperXVAD'
resources
Resources = field(default_factory=(lambda: Resources(gpus=1)))
segments_key
str = 'vad_segments'
vad_offset
float = 0.363
vad_onset
float = 0.5
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.audio.inference.vad.whisperx_vad.WhisperXVADStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Setup stage on node.