NeMo Speaker Diarization Configuration Files

Since speaker diarization model here is not a fully-trainble End-to-End model but an inference pipeline, we use diarizer instead of model which is used in other tasks.

The diarizer section will generally require information about the dataset(s) being used, models used in this pipline, as well as inference related parameters such as postprocessing of each models. The sections on this page cover each of these in more detail.

Example configuration files for speaker diarization can be found in the <NeMo_git_root>/examples/speaker_recognition/conf/speaker_diarization.yaml>

Note

For model details and deep understanding about configs, finetuning, tuning threshold, and evaluation, please refer to <NeMo_git_root>/tutorials/speaker_recognition/Speaker_Diarization_Inference.ipynb; for other applications such as possible integration with ASR, have a look at <NeMo_git_root>/tutorials/speaker_recognition/ASR_with_SpeakerDiarization.ipynb.

Dataset Configuration

In contrast to other ASR related tasks or models in NeMo, speaker diarization supported in NeMo is a modular inference pipeline. Datasets here denotes the data you would like to perform speaker diarization on.

An example Speaker Diarization dataset configuration could look like:

diarizer:
  num_speakers: 2 # for each recording
  out_dir: ???
  paths2audio_files: null # either list of audio file paths or file containing paths to audio files for which we need to perform diarization.
  path2groundtruth_rttm_files: null # (Optional) either list of rttm file paths or file containing paths to rttm files (this can be passed if we need to calculate DER rate based on our ground truth rttm files).
  ...

Note

We expect audio and the corresponding RTTM to have the same base name and the name should be unique.

Diarizer Architecture Configurations

diarizer:
...
  vad:
    model_path: null #.nemo local model path or pretrained model name or none
    window_length_in_sec: 0.15
    shift_length_in_sec: 0.01
    threshold: 0.5 # tune threshold on dev set. Check <NeMo_git_root>/scripts/voice_activity_detection/vad_tune_threshold.py
    vad_decision_smoothing: True
    smoothing_params:
      method: "median"
      overlap: 0.875

  speaker_embeddings:
    oracle_vad_manifest: null # leave it null if to perform diarization with above VAD model else path to manifest file genrerated as shown in Datasets section
    model_path: ??? #.nemo local model path or pretrained model name
    window_length_in_sec: 1.5
    shift_length_in_sec: 0.75