Datasets ======== Check out page :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recogniton Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively. For Speaker Diarization inference, ``diarizer`` expects either list of paths to audio files or a file containing absolute paths to audio files. For generating a file that contains paths to audio files (which we call as ``scp file``), you can simply use ``find`` bash command as shown below: .. code-block:: bash find $PWD/{data_dir} -iname '*.wav' > path_to_audiofiles.scp Preparing Evaluation Dataset ---------------------------- To score with a diarizer model, we need to provide an scp like file for groundtruth label file. Each groundtruth label file should be in NIST Rich Transcription Time Marked (RTTM) format. Take one line from a RTTM file for example: .. code-block:: bash SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 MTD046ID Prepraing ORACLE manifest ------------------------- To perform just oracle diarization, that is taking speech activity time stamps from groundtruths instead from VAD output, ``diarizer`` expects an orcale manifest file that contains paths to audio files with offset for start time and duration of segment. To prepare an oracle manifest file, use the helper function from ``speaker_utils`` as shown below: .. code-block:: python from nemo.collections.asr.parts.speaker_utils import write_rttm2manifest oracle_manifest = os.path.join(os.getcwd(),'oracle_manifest.json') write_rttm2manifest(paths2audio_files=paths2audio_files, paths2rttm_files=path2groundtruth_rttm_files, manifest_file=oracle_manifest) Here ``paths2audio_files`` and ``path2groundtruth_rttm_files`` are lists containing paths to audio files.