Speaker Separation
Diarize multi-speaker audio and fan out one task per detected speaker so that downstream stages can score each speaker’s audio independently. NeMo Curator ships two diarization stages built on NVIDIA’s SortFormer family. Both target up to 4 speakers per file; choose based on whether your workload is offline batch curation or streaming/online.
Understanding Diarization
What Diarization Does
Diarization answers “who spoke when?” — it segments an audio stream into per-speaker regions, identifying that speaker A talks 0.0–3.5s, speaker B talks 3.5–7.0s, speaker A returns at 7.0–9.0s, and so on. The output is one AudioTask per speaker, each containing only that speaker’s audio.
This unlocks per-speaker filtering: pipelines can rerun VAD, UTMOS, SIGMOS, and the band filter separately on each speaker, dropping individual low-quality speakers without losing the rest of the recording.
Choosing a Stage
For most curation pipelines, SpeakerSeparationStage (offline) is the right choice. Use the streaming variant only when you need bounded latency or RTTM output.
Offline Speaker Separation
Step 1: Configure the Stage
The stage produces a fan-out list of AudioTask objects, one per detected speaker, each carrying:
speaker_id— speaker identifier (0, 1, 2, …)num_speakers— total speakers found in this fileduration_sec— duration of this speaker’s audiowaveform— that speaker’s torch tensor with overlapping regions removed (whenexclude_overlaps=True)
GPU is required (Resources(cpus=1.0, gpus=1.0) by default).
Step 2: Tune Overlap and Gap Handling
Speaker overlap regions (where multiple speakers talk simultaneously) and short gaps between same-speaker turns affect output quality:
SpeakerSeparationStage Parameters
Streaming Speaker Diarization
When to Use Streaming
The streaming variant (InferenceSortformerStage) is purpose-built for two use cases:
- Online / chunked workloads — bounded latency requirements that can’t tolerate waiting for the full utterance.
- RTTM output — downstream tooling (Kaldi, ESPnet, evaluation harnesses) consumes RTTM-format diarization output.
For pure offline curation, SpeakerSeparationStage is faster and more accurate.
Step 1: Configure the Stage
This stage does not fan out per speaker — instead it writes a diar_segments list onto the input AudioTask. Use it as a metadata-enriching stage; downstream code consumes the diar_segments field directly.
Step 2: Tune Latency
chunk_len controls latency vs accuracy:
Streaming-mode evaluation on CallHome-eng0 (139 files) at the default settings: 6.2% macro DER, 6.0% weighted DER at a 0.25-second collar.
InferenceSortformerStage Parameters
Default resource allocation: Resources(cpus=1.0, gpu_memory_gb=8.0).
Complete Speaker Separation Pipeline
A pipeline that diarizes, then runs per-speaker quality filters:
Best Practices
- Use offline mode unless you specifically need streaming:
SpeakerSeparationStageis faster and more accurate thanInferenceSortformerStagefor batch curation. - Run VAD + concat before diarization: feeding diarization a clean concatenated speech-only waveform (no long silences) is cheaper and more reliable than feeding raw audio.
- Pair with per-speaker quality filters: place the filtering chain (VAD → Band → UTMOS → SIGMOS) after speaker separation so each speaker’s audio is scored independently. Bad speakers get dropped; good speakers from the same file are kept.
- Mind the 4-speaker model limit: both stages target up to 4 speakers per file. Files with more speakers will likely produce degraded diarization.
- Don’t enable
exclude_overlaps=Falsefor training data: overlapping speech is hard for downstream models; only disable when explicitly preserving natural conversation.
Related Topics
- Preprocessing Stages —
SegmentConcatenationStageandTimestampMapperStageare typically paired with speaker separation. - VAD Segmentation — typical upstream stage producing the segments fed into diarization.
AudioDataFilterStageComposite — bundles offline speaker separation with per-speaker filters into the standard pipeline.