Datasets

Check out page Speech Classification Datasets and Speaker Recogniton Datasets for preparing datasets for training and validating VAD and speaker embedding models respectively.

For Speaker Diarization inference, diarizer expects either list of paths to audio files or a file containing absolute paths to audio files.

For generating a file that contains paths to audio files (which we call as scp file), you can simply use find bash command as shown below:

find $PWD/{data_dir} -iname '*.wav' > path_to_audiofiles.scp

Preparing Evaluation Dataset

To score with a diarizer model, we need to provide an scp like file for groundtruth label file. Each groundtruth label file should be in NIST Rich Transcription Time Marked (RTTM) format. Take one line from a RTTM file for example:

SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 <NA> <NA> MTD046ID <NA> <NA>

Prepraing ORACLE manifest

To perform just oracle diarization, that is taking speech activity time stamps from groundtruths instead from VAD output, diarizer expects an orcale manifest file that contains paths to audio files with offset for start time and duration of segment.

To prepare an oracle manifest file, use the helper function from speaker_utils as shown below:

from nemo.collections.asr.parts.speaker_utils import write_rttm2manifest

oracle_manifest = os.path.join(os.getcwd(),'oracle_manifest.json')
write_rttm2manifest(paths2audio_files=paths2audio_files,
                paths2rttm_files=path2groundtruth_rttm_files,
                manifest_file=oracle_manifest)

Here paths2audio_files and path2groundtruth_rttm_files are lists containing paths to audio files.