Datasets

HI-MIA

Run the script to download and process hi-mia dataset in order to generate files in the supported format of nemo_asr. You should set the data folder of hi-mia using --data_root. These scripts are present in <nemo_root>/scripts

python get_hi-mia_data.py --data_root=<data directory>

After download and conversion, your data folder should contain directories with following set of files as:

  • data/<set>/train.json

  • data/<set>/dev.json

  • data/<set>/{set}_all.json

  • data/<set>/utt2spk

All-other Datasets

These methods can be applied to any dataset to get similar training manifest files.

First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using find bash command as follows:

!find {data_dir}/{train_dir}  -iname "*.wav" > data/train_all.scp
!head -n 3 data/train_all.scp

Since we created the scp file for the train, we use scp_to_manifest.py to convert this scp file to a manifest file and then optionally split the files to train & dev for evaluating the models while training by using the –split flag. We wont be needing the –split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label. After the download and conversion, your data folder should contain directories with manifest files as:

  • data/<path>/train.json

  • data/<path>/dev.json

  • data/<path>/train_all.json

Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it’s duration in seconds, and label is the speaker class label:

{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}

Tarred Datasets

Similarly to ASR, you can tar your audio files and use ASR Dataset class TarredAudioToSpeechLabelDataset (corresponding to the AudioToSpeechLabelDataset) for this case.

If you want to use tarred dataset, have a look at ASR Tarred Datasets.