NeMo Speaker Diarization Configuration Files ============================================ Since speaker diarization model here is not a fully-trainble End-to-End model but an inference pipeline, we use **diarizer** instead of **model** which is used in other tasks. The diarizer section will generally require information about the dataset(s) being used, models used in this pipline, as well as inference related parameters such as postprocessing of each models. The sections on this page cover each of these in more detail. Example configuration files for speaker diarization can be found in the ``/examples/speaker_recognition/conf/speaker_diarization.yaml>`` .. note:: For model details and deep understanding about configs, finetuning, tuning threshold, and evaluation, please refer to ``/tutorials/speaker_recognition/Speaker_Diarization_Inference.ipynb``; for other applications such as possible integration with ASR, have a look at ``/tutorials/speaker_recognition/ASR_with_SpeakerDiarization.ipynb``. Dataset Configuration ----------------------- In contrast to other ASR related tasks or models in NeMo, speaker diarization supported in NeMo is a modular inference pipeline. Datasets here denotes the data you would like to perform speaker diarization on. An example Speaker Diarization dataset configuration could look like: .. code-block:: yaml diarizer: num_speakers: 2 # for each recording out_dir: ??? paths2audio_files: null # either list of audio file paths or file containing paths to audio files for which we need to perform diarization. path2groundtruth_rttm_files: null # (Optional) either list of rttm file paths or file containing paths to rttm files (this can be passed if we need to calculate DER rate based on our ground truth rttm files). ... .. note:: We expect audio and the corresponding RTTM to have the same base name and the name should be unique. Diarizer Architecture Configurations ------------------------------------- .. code-block:: yaml diarizer: ... vad: model_path: null #.nemo local model path or pretrained model name or none window_length_in_sec: 0.15 shift_length_in_sec: 0.01 threshold: 0.5 # tune threshold on dev set. Check /scripts/voice_activity_detection/vad_tune_threshold.py vad_decision_smoothing: True smoothing_params: method: "median" overlap: 0.875 speaker_embeddings: oracle_vad_manifest: null # leave it null if to perform diarization with above VAD model else path to manifest file genrerated as shown in Datasets section model_path: ??? #.nemo local model path or pretrained model name window_length_in_sec: 1.5 shift_length_in_sec: 0.75