Speaker Diarization#

Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “who spoke when?”. Speaker diarization makes a clear distinction when it is compared with speech recognition. As shown in the figure below, before we perform speaker diarization, we know “what is spoken” yet we do not know “who spoke it”. Therefore, speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels.

Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering

To figure out “who spoke when”, speaker diarization systems need to capture the characteristics of unseen speakers and tell apart which regions in the audio recording belong to which speaker. To achieve this, speaker diarization systems extract voice characteristics, count the number of speakers, then assign the audio segments to the corresponding speaker index.

The following figure shows the overall data flow of the NeMo speaker diarization pipeline.

Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering

NeMo speaker diarization system consists of the following modules:

Voice Activity Detection (VAD): A trainable model which detects the presence or absence of speech in the given audio recording. Speaker Embedding Extractor: A trainable model that extracts speaker embedding vectors containing voice characteristics from raw audio signal. Clustering Module: A non-trainable module that groups speaker embedding vectors into a number of clusters. Neural Diarizer: A trainable model that estimates speaker labels from the given features.

Speaker diarization inference can be done in two different modes: oracle VAD: Speaker diarization based on ground-truth VAD timestamps system VAD: Speaker diarization based on the results from a VAD model

The full documentation tree is as follows:

Resource and Documentation Guide#

Hands-on speaker diarization tutorial notebooks can be found under <NeMo_git_root>/tutorials/speaker_tasks.

There are tutorials for performing speaker diarization inference using MarbleNet (VAD), TitaNet model, and Multi-scale Diarization Decoder model. We also provide tutorials about getting ASR transcriptions combined with speaker labels along with voice activity timestamps with NeMo ASR collections.

Most of the tutorials can be run on Google Colab by specifying the link to the notebooks’ GitHub pages on Colab.

If you are looking for information about a particular model used for speaker diarization inference, or would like to find out more about the model architectures available in the nemo_asr collection, check out the Models page.

Documentation on dataset preprocessing can be found on the Datasets page. NeMo includes preprocessing scripts for several common ASR datasets, and this page contains instructions on running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.

Information about how to load model checkpoints (either local files or pretrained ones from NGC), perform inference, as well as a list of the checkpoints available on NGC are located on the Checkpoints page.

Documentation for configuration files specific to the nemo_asr models can be found on the Configuration Files page.