Curate AudioProcess DataQuality Filtering

Speaker Separation

View as Markdown

Diarize multi-speaker audio and fan out one task per detected speaker so that downstream stages can score each speaker’s audio independently. NeMo Curator ships two diarization stages built on NVIDIA’s SortFormer family. Both target up to 4 speakers per file; choose based on whether your workload is offline batch curation or streaming/online.

Understanding Diarization

What Diarization Does

Diarization answers “who spoke when?” — it segments an audio stream into per-speaker regions, identifying that speaker A talks 0.0–3.5s, speaker B talks 3.5–7.0s, speaker A returns at 7.0–9.0s, and so on. The output is one AudioTask per speaker, each containing only that speaker’s audio.

This unlocks per-speaker filtering: pipelines can rerun VAD, UTMOS, SIGMOS, and the band filter separately on each speaker, dropping individual low-quality speakers without losing the rest of the recording.

Choosing a Stage

StageModelBest For
SpeakerSeparationStagenvidia/diar_sortformer_4spk-v1 (offline)Bulk offline curation. Used inside AudioDataFilterStage. Higher accuracy because it sees the whole utterance.
InferenceSortformerStagenvidia/diar_streaming_sortformer_4spk-v2.1 (streaming)Online/chunked workloads with bounded latency. Supports RTTM output for downstream tools.

For most curation pipelines, SpeakerSeparationStage (offline) is the right choice. Use the streaming variant only when you need bounded latency or RTTM output.

Offline Speaker Separation

Step 1: Configure the Stage

1from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
2
3speaker_sep = SpeakerSeparationStage(
4 model_path="nvidia/diar_sortformer_4spk-v1",
5 exclude_overlaps=True,
6 min_duration=0.8,
7 gap_threshold=0.1,
8 buffer_time=0.5,
9)
10pipeline.add_stage(speaker_sep)

The stage produces a fan-out list of AudioTask objects, one per detected speaker, each carrying:

  • speaker_id — speaker identifier (0, 1, 2, …)
  • num_speakers — total speakers found in this file
  • duration_sec — duration of this speaker’s audio
  • waveform — that speaker’s torch tensor with overlapping regions removed (when exclude_overlaps=True)

GPU is required (Resources(cpus=1.0, gpus=1.0) by default).

Step 2: Tune Overlap and Gap Handling

Speaker overlap regions (where multiple speakers talk simultaneously) and short gaps between same-speaker turns affect output quality:

ParameterEffect
exclude_overlaps=True (default)Drops overlapping regions. Better for clean per-speaker training data.
exclude_overlaps=FalseIncludes overlapping regions on each speaker’s audio. Useful when you want to preserve natural conversation.
gap_threshold=0.1 (default)Same-speaker turns separated by < 100 ms are merged. Increase to 0.3–0.5 for more aggressive merging on fragmented diarization.
min_duration=0.8 (default)Drop speakers whose total audio is shorter than 0.8 seconds. Filters out spurious speaker detections.
buffer_time=0.5 (default)Buffer (in seconds) added around each merged speaker segment to avoid clipping turn boundaries.

SpeakerSeparationStage Parameters

ParameterTypeDefaultDescription
model_pathstr"nvidia/diar_sortformer_4spk-v1"Hugging Face model ID or path to a .nemo checkpoint.
exclude_overlapsboolTrueDrop regions where multiple speakers overlap.
min_durationfloat0.8Minimum per-speaker segment duration (seconds).
gap_thresholdfloat0.1Gap threshold for merging adjacent same-speaker segments.
buffer_timefloat0.5Buffer (seconds) added around each merged speaker segment.

Streaming Speaker Diarization

When to Use Streaming

The streaming variant (InferenceSortformerStage) is purpose-built for two use cases:

  1. Online / chunked workloads — bounded latency requirements that can’t tolerate waiting for the full utterance.
  2. RTTM output — downstream tooling (Kaldi, ESPnet, evaluation harnesses) consumes RTTM-format diarization output.

For pure offline curation, SpeakerSeparationStage is faster and more accurate.

Step 1: Configure the Stage

1from nemo_curator.stages.audio.inference.sortformer import InferenceSortformerStage
2
3streaming = InferenceSortformerStage(
4 model_name="nvidia/diar_streaming_sortformer_4spk-v2.1",
5 rttm_out_dir="./rttm",
6 chunk_len=340, # ~30.4 seconds latency in 80 ms frames
7 inference_batch_size=1,
8)
9pipeline.add_stage(streaming)

This stage does not fan out per speaker — instead it writes a diar_segments list onto the input AudioTask. Use it as a metadata-enriching stage; downstream code consumes the diar_segments field directly.

Step 2: Tune Latency

chunk_len controls latency vs accuracy:

chunk_lenLatencyAccuracy
100 (~8 s)LowLower (less context)
340 (default, ~30.4 s)MediumGood
600 (~48 s)HighBest

Streaming-mode evaluation on CallHome-eng0 (139 files) at the default settings: 6.2% macro DER, 6.0% weighted DER at a 0.25-second collar.

InferenceSortformerStage Parameters

ParameterTypeDefaultDescription
model_namestr"nvidia/diar_streaming_sortformer_4spk-v2.1"Hugging Face model ID.
model_pathstr | NoneNoneLocal .nemo checkpoint; overrides model_name when set.
cache_dirstr | NoneNoneCache dir for downloaded model weights.
filepath_keystr"audio_filepath"Manifest key with the audio path.
diar_segments_keystr"diar_segments"Output manifest key for the diarization segment list.
rttm_out_dirstr | NoneNoneOptional directory to write per-file RTTM.
chunk_lenint340Streaming chunk size in 80 ms frames.
chunk_left_contextint1Left-context frames retained between chunks.
chunk_right_contextint40Right-context frames retained between chunks.
fifo_lenint40FIFO queue size in frames.
spkcache_update_periodint300Speaker-cache update period in frames.
spkcache_lenint188Speaker-cache size in frames.
inference_batch_sizeint1Batch size passed to diarize().

Default resource allocation: Resources(cpus=1.0, gpu_memory_gb=8.0).

Complete Speaker Separation Pipeline

A pipeline that diarizes, then runs per-speaker quality filters:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
6from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
7from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
8from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
9from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
10from nemo_curator.stages.text.io.writer import JsonlWriter
11
12pipeline = Pipeline(name="speaker_diarization")
13
14# 1. Normalize and segment
15pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
16pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))
17
18# 2. Concatenate surviving segments per file
19pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
20
21# 3. Diarize and fan out per speaker
22pipeline.add_stage(SpeakerSeparationStage(exclude_overlaps=True))
23
24# 4. Per-speaker quality filter
25pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
26
27# 5. Resolve final timestamps
28pipeline.add_stage(
29 TimestampMapperStage(
30 passthrough_keys=["speaker_id", "num_speakers", "utmos_mos"]
31 )
32)
33
34# 6. Export
35pipeline.add_stage(AudioToDocumentStage())
36pipeline.add_stage(JsonlWriter(path="./per_speaker_audio"))
37
38executor = XennaExecutor()
39pipeline.run(executor)

Best Practices

  • Use offline mode unless you specifically need streaming: SpeakerSeparationStage is faster and more accurate than InferenceSortformerStage for batch curation.
  • Run VAD + concat before diarization: feeding diarization a clean concatenated speech-only waveform (no long silences) is cheaper and more reliable than feeding raw audio.
  • Pair with per-speaker quality filters: place the filtering chain (VAD → Band → UTMOS → SIGMOS) after speaker separation so each speaker’s audio is scored independently. Bad speakers get dropped; good speakers from the same file are kept.
  • Mind the 4-speaker model limit: both stages target up to 4 speakers per file. Files with more speakers will likely produce degraded diarization.
  • Don’t enable exclude_overlaps=False for training data: overlapping speech is hard for downstream models; only disable when explicitly preserving natural conversation.
  • Preprocessing StagesSegmentConcatenationStage and TimestampMapperStage are typically paired with speaker separation.
  • VAD Segmentation — typical upstream stage producing the segments fed into diarization.
  • AudioDataFilterStage Composite — bundles offline speaker separation with per-speaker filters into the standard pipeline.