Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Datasets#
HI-MIA#
Run the script to download and process hi-mia
dataset in order to generate files in the supported format of nemo_asr
. You should set the data folder of
hi-mia using --data_root
. These scripts are present in <nemo_root>/scripts
python get_hi-mia_data.py --data_root=<data directory>
After download and conversion, your data folder should contain directories with following set of files as:
data/<set>/train.json
data/<set>/dev.json
data/<set>/{set}_all.json
data/<set>/utt2spk
All-other Datasets#
These methods can be applied to any dataset to get similar training or inference manifest files.
filelist_to_manifest.py script in $<NeMo_root>/scripts/speaker_tasks/ folder generates manifest file from a text file containing paths to audio files.
sample filelist.txt file contents:
/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav
/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav
/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav
This list file is used to generate manifest file. This script has optional arguments to split the whole manifest file in to train and dev and also segment audio files to smaller segments for robust training (for testing, we don’t need to create segments for each utterance).
sample usage:
python filelist_to_manifest.py --filelist=filelist.txt --id=-3 --out=speaker_manifest.json
This would create a manifest containing file contents as shown below: .. code-block:: json
{“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav”, “offset”: 0, “duration”: 4.16, “label”: “id00179”} {“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav”, “offset”: 0, “duration”: 12.288, “label”: “id00806”} {“audio_filepath”: “/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav”, “offset”: 0, “duration”: 4.608, “label”: “id01510”}
For other optional arguments like splitting manifest file to train and dev and for creating segements from each utterance refer to the arguments described in the script.
Tarred Datasets#
Similarly to ASR, you can tar your audio files and use ASR Dataset class TarredAudioToSpeechLabelDataset
(corresponding to the AudioToSpeechLabelDataset
) for this case.
If you want to use tarred dataset, have a look at ASR Tarred Datasets.