Datasets¶
Check out page Speech Classification Datasets and Speaker Recogniton Datasets for preparing datasets for training and validating VAD and speaker embedding models respectively.
For Speaker Diarization inference, diarizer
expects either list of paths to audio files or a file containing absolute paths to audio files.
For generating a file that contains paths to audio files (which we call as scp file
), you can simply use find
bash command as shown below:
find $PWD/{data_dir} -iname '*.wav' > path_to_audiofiles.scp
Preparing Evaluation Dataset¶
To score with a diarizer model, we need to provide an scp like file for groundtruth label file. Each groundtruth label file should be in NIST Rich Transcription Time Marked (RTTM) format. Take one line from a RTTM file for example:
SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 <NA> <NA> MTD046ID <NA> <NA>
Prepraing ORACLE manifest¶
To perform just oracle diarization, that is taking speech activity time stamps from groundtruths instead from VAD output, diarizer
expects
an orcale manifest file that contains paths to audio files with offset for start time and duration of segment.
To prepare an oracle manifest file, use the helper function from speaker_utils
as shown below:
from nemo.collections.asr.parts.speaker_utils import write_rttm2manifest
oracle_manifest = os.path.join(os.getcwd(),'oracle_manifest.json')
write_rttm2manifest(paths2audio_files=paths2audio_files,
paths2rttm_files=path2groundtruth_rttm_files,
manifest_file=oracle_manifest)
Here paths2audio_files
and path2groundtruth_rttm_files
are lists containing paths to audio files.