Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Datasets
HI-MIA
Run the script to download and process hi-mia
dataset in order to generate files in the supported format of nemo_asr
. You should set the data folder of
hi-mia using --data_root
. These scripts are present in <nemo_root>/scripts
python get_hi-mia_data.py --data_root=<data directory>
After download and conversion, your data folder should contain directories with following set of files as:
data/<set>/train.json
data/<set>/dev.json
data/<set>/{set}_all.json
data/<set>/utt2spk
All-other Datasets
These methods can be applied to any dataset to get similar training or inference manifest files.
filelist_to_manifest.py script in $<NeMo_root>/scripts/speaker_tasks/ folder generates manifest file from a text file containing paths to audio files.
sample filelist.txt file contents:
/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav
/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav
/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav
This list file is used to generate manifest file. This script has optional arguments to split the whole manifest file in to train and dev and also segment audio files to smaller segments for robust training (for testing, we don’t need to create segments for each utterance).
sample usage:
python filelist_to_manifest.py --filelist=filelist.txt --id=-3 --out=speaker_manifest.json
This would create a manifest containing file contents as shown below: .. code-block:: json
{“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav”, “offset”: 0, “duration”: 4.16, “label”: “id00179”} {“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav”, “offset”: 0, “duration”: 12.288, “label”: “id00806”} {“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav”, “offset”: 0, “duration”: 4.608, “label”: “id01510”}
For other optional arguments like splitting manifest file to train and dev and for creating segements from each utterance refer to the arguments described in the script.
Tarred Datasets
Similarly to ASR, you can tar your audio files and use ASR Dataset class TarredAudioToSpeechLabelDataset
(corresponding to the AudioToSpeechLabelDataset
) for this case.
If you want to use tarred dataset, have a look at ASR Tarred Datasets.