nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep

View as Markdown

Module Contents

Classes

NameDescription
SpeakerResultResult for a single speaker from get_speaker_audio_data.
SpeakerSeparatorClass for separating speakers in an audio file using diarization.

Functions

NameDescription
load_audioLoad audio file using soundfile.

API

class nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerResult()

Bases: NamedTuple

Result for a single speaker from get_speaker_audio_data.

audio
AudioSegment
diar_segments
list[tuple[float, float]]
duration
float
class nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator(
model_name: str | None = None,
config: dict | None = None
)

Class for separating speakers in an audio file using diarization.

config
= config or {}
device
= 'cpu'
nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator._get_param(
param_name: str,
default_value: object
) -> object

Helper to get a parameter from config, handling different config structures.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator._load_model() -> None

Load the diarization model from Hugging Face Hub.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.clean_cut_overlapping_segments(
speaker_segments: dict[str, list[tuple[float, float]]]
) -> dict[str, list[tuple[float, float]]]

Handle overlaps by cutting segments at overlap points.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.diarize_audio(
audio_path_or_waveform: str | torch.Tensor,
sample_rate: int | None = None
) -> list[str]

Run speaker diarization on an audio file or waveform.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.exclude_overlapping_segments(
speaker_segments: dict[str, list[tuple[float, float]]],
buffer_time: float | None = None
) -> dict[str, list[tuple[float, float]]]

Completely exclude any segments where multiple speakers are talking simultaneously.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.filter_short_segments(
speaker_segments: dict[str, list[tuple[float, float]]],
min_duration: float | None = None
) -> dict[str, list[tuple[float, float]]]

Filter out segments that are shorter than the minimum duration.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.get_speaker_audio_data(
audio_path_or_waveform: str | torch.Tensor,
sample_rate: int | None = None,
gap_threshold: float | None = None,
exclude_overlaps: bool | None = None,
min_duration: float | None = None,
buffer_time: float | None = None
) -> dict[str, nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerResult]

Process an audio file or waveform and return AudioSegment objects for each speaker.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.get_speaker_segments(
predicted_segments: list[str]
) -> dict[str, list[tuple[float, float]]]

Parse predicted segments and organize by speaker.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.merge_adjacent_segments(
segments: list[tuple[float, float]],
gap_threshold: float | None = None
) -> list[tuple[float, float]]

Merge adjacent segments for the same speaker if they are close enough.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.SpeakerSeparator.process_audio(
audio_path_or_waveform: str | torch.Tensor,
sample_rate: int | None = None,
gap_threshold: float | None = None,
exclude_overlaps: bool | None = None,
min_duration: float | None = None,
buffer_time: float | None = None
) -> dict[str, list[tuple[float, float]]]

Process an audio file or waveform to get speaker segments.

nemo_curator.stages.audio.segmentation.speaker_separation_module.speaker_sep.load_audio(
audio_path: str
) -> tuple[torch.Tensor, int]

Load audio file using soundfile.

Uses soundfile directly to avoid torchaudio/torchcodec/FFmpeg dependency issues.

Parameters:

audio_path
str

Path to the audio file

Returns: tuple[torch.Tensor, int]

Tuple of (waveform tensor, sample_rate)