Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Data Preprocessing
NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of scripts/data_processing/tts/, to convert common public TTS datasets into the format expected by the dataloaders as defined in nemo/collections/tts/data/dataset.py. The nemo_tts
collection expects each dataset to consist of a set of utterances in individual audio files plus a JSON
manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by Pydub, though we recommend WAV
files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the feature preprocess can automatically resample the original sampling rate into the target one.
There should be one JSON
manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice versa. Each line of the manifest should be in the following format:
{
"audio_filepath": "/path/to/audio.wav",
"text": "the transcription of the utterance",
"normalized_text": "the normalized transcription of the utterance",
"speaker": 5,
"duration": 15.4
}
where "audio_filepath"
provides an absolute path to the .wav
file corresponding to the utterance so that audio files can be located anywhere without the constraint of being organized in the same directory as the manifest itself; "text"
contains the full transcript (either graphemes or phonemes or their mixer) for the utterance; "normalized_text"
contains normalized "text"
that helps to bypass the normalization steps but it is fully optional; "speaker"
refers to the integer speaker ID; "duration"
describes the duration of the utterance in seconds.
Each entry in the manifest (describing one audio file) should be bordered by "{"
and "}"
and must be placed on one line. The "key": value
pairs should be separated by a commas as shown above. NeMo enforces no blank lines in the manifest so that the total number of lines indicates the total number of audio files in the dataset.
Once there is a manifest that describes each audio file in the dataset, assign the JSON
manifest file path in the experiment config file, for example, training_ds.manifest_filepath=<path/to/manifest.json>
.
Following the instructions below on how to run the corpus-specific scripts, you can get started with either directly processing those public datasets, or creating your custom scripts to preprocess your custom datasets scripts.
Public TTS Datasets
This table below summarizes the statistics for a collection of high-quality public datasets used by NeMo TTS. We recommend to start customizing the scripts for your custom datasets that have close sampling rate and number of speakers.
Language |
Locale |
Dataset Name |
#spk-total |
#spk-F |
#spk-M |
#hours-total |
#hour-F |
#hour-M |
Sampling Rate |
URL |
---|---|---|---|---|---|---|---|---|---|---|
English |
en-US |
LJSpeech |
1 |
1 |
0 |
23.92 |
23.92 |
0.00 |
22,050Hz |
|
English |
en-US |
LibriTTS (clean) |
1230 |
592 |
638 |
262.62 |
133.97 |
128.65 |
24,000Hz |
|
English |
en-US |
HiFiTTS |
10 |
6 |
4 |
291.60 |
158.30 |
133.30 |
44,100Hz |
|
German |
de-DE |
Thorsten Müller Neutral 21.02 dataset |
1 |
0 |
1 |
20.91 |
0.00 |
20.91 |
22,050Hz |
https://zenodo.org/record/5525342/files/thorsten-neutral_v03.tgz?download=1 |
German |
de-DE |
Thorsten Müller Neutral 22.10 dataset |
1 |
0 |
1 |
11.21 |
0.00 |
11.21 |
22,050Hz |
https://zenodo.org/record/7265581/files/ThorstenVoice-Dataset_2022.10.zip?download=1 |
German |
de-DE |
HUI-Audio-Corpus-German (clean) |
118 |
n/a |
n/a |
253.85 |
0.00 |
0.00 |
44,100Hz |
|
Spanish |
es-AR |
Crowdsourced high-quality Argentinian Spanish |
44 |
31 |
13 |
8.03 |
5.61 |
2.42 |
48,000Hz |
|
Spanish |
es-CL |
Crowdsourced high-quality Chilean Spanish |
31 |
13 |
18 |
7.15 |
2.84 |
4.31 |
48,000Hz |
|
Spanish |
es-CO |
Crowdsourced high-quality Colombian Spanish |
33 |
16 |
17 |
7.58 |
3.74 |
3.84 |
48,000Hz |
|
Spanish |
es-PE |
Crowdsourced high-quality Peruvian Spanish |
38 |
18 |
20 |
9.22 |
4.35 |
4.87 |
48,000Hz |
|
Spanish |
es-PR |
Crowdsourced high-quality Puerto Rico Spanish |
5 |
5 |
0 |
1.00 |
1.00 |
0.00 |
48,000Hz |
|
Spanish |
es-VE |
Crowdsourced high-quality Venezuelan Spanish |
23 |
11 |
12 |
4.81 |
2.41 |
2.40 |
48,000Hz |
|
Chinese |
zh-CN |
SFSpeech Chinese/English Bilingual Speech |
1 |
1 |
0 |
4.50 |
4.50 |
0.00 |
22,050Hz |
https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en |
Corpus-Specific Data Preprocessing
NeMo implements model-agnostic data preprocessing scripts that wrap up steps of downloading raw datasets, extracting files, and/or normalizing raw texts, and generating data manifest files. Most scripts are able to be reused for any datasets with only minor adaptations. Most TTS models work out-of-the-box with the LJSpeech dataset, so it would be straightforward to start adapting your custom script from LJSpeech script. For some models that may require supplementary data for training and validating, such as speech/text alignment prior, pitch, speaker ID, emotion ID, energy, etc, you may need an extra step of supplementary data extraction by calling script/dataset_processing/tts/extract_sup_data.py . The following sub-sections demonstrate detailed instructions for running data preprocessing scripts.
LJSpeech
Dataset URL: https://keithito.com/LJ-Speech-Dataset/
Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/ljspeech/get_data.py
Command Line Instruction:
$ python scripts/dataset_processing/tts/ljspeech/get_data.py \
--data-root <your_local_dataset_root>
$ python scripts/dataset_processing/tts/extract_sup_data.py \
--config-path ljspeech/ds_conf \
--config-name ds_for_fastpitch_align.yaml \
manifest_filepath=<your_path_to_train_manifest> \
sup_data_path=<your_path_to_where_to_save_supplementary_data>
LibriTTS
Dataset URL: https://www.openslr.org/60/
Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/libritts/get_data.py
Command Line Instruction:
$ python scripts/dataset_processing/tts/libritts/get_data.py \
--data-root <your_local_dataset_root> \
--data-sets "ALL"
--num-workers 4
$ python scripts/dataset_processing/tts/extract_sup_data.py \
--config-path ljspeech/ds_conf \
--config-name ds_for_fastpitch_align.yaml \
manifest_filepath=<your_path_to_train_manifest> \
sup_data_path=<your_path_to_where_to_save_supplementary_data>
Note
LibriTTS original sampling rate is 24000 Hz, we re-use LJSpeech’s config to down-sample it to 22050 Hz.
HiFiTTS
The texts of this dataset has been normalized already. So there is no extra need to preprocess the data again. But we still need a download script and split it into manifests.
Dataset URL: http://www.openslr.org/109/
Dataset Processing Script: TBD
Command Line Instruction: TBD
Thorsten Müller’s German Neutral-TTS Datasets
There are two German neutral datasets released by Thorsten Müller for now, 21.02 and 22.10, respectively. Version 22.10 has been recorded with a better recording setup, such as recording chamber and better microphone. So it is advised to train models on the 22.10 version because its audio quality is better and it has a way more natural speech flow and higher character rate per second speech. The two datasets are described below and defined in scripts/dataset_processing/tts/thorsten_neutral/get_data.py:THORSTEN_NEUTRAL.
# Thorsten Müller published two neural voice datasets, 21.02 and 22.10.
THORSTEN_NEUTRAL = {
"21_02": {
"url": "https://zenodo.org/record/5525342/files/thorsten-neutral_v03.tgz?download=1",
"dir_name": "thorsten-de_v03",
"metadata": ["metadata.csv"],
},
"22_10": {
"url": "https://zenodo.org/record/7265581/files/ThorstenVoice-Dataset_2022.10.zip?download=1",
"dir_name": "ThorstenVoice-Dataset_2022.10",
"metadata": ["metadata_train.csv", "metadata_dev.csv", "metadata_test.csv"],
},
}
Thorsten Müller’s German Datasets repo: https://github.com/thorstenMueller/Thorsten-Voice
Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/thorsten_neutral/get_data.py
Command Line Instruction:
# Version 22.10
$ python scripts/dataset_processing/tts/thorsten_neutral/get_data.py \
--data-root <your_local_dataset_root> \
--manifests-root <your_local_manifest_root> \
--data-version "22_10" \
--val-size 100 \
--test-size 100 \
--seed-for-ds-split 100 \
--normalize-text
# Version 21.02
$ python scripts/dataset_processing/tts/thorsten_neutral/get_data.py \
--data-root <your_local_dataset_root> \
--manifests-root <your_local_manifest_root> \
--data-version "21_02" \
--val-size 100 \
--test-size 100 \
--seed-for-ds-split 100 \
--normalize-text
# extract pitch and compute pitch normalization params for each version.
$ python scripts/dataset_processing/tts/extract_sup_data.py \
--config-path thorsten_neutral/ds_conf \
--config-name ds_for_fastpitch_align.yaml \
manifest_filepath=<your_path_to_train_manifest> \
sup_data_path=<your_path_to_where_to_save_supplementary_data>
HUI Audio Corpus German
Dataset URL: https://opendata.iisys.de/datasets.html
Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/hui_acg/get_data.py
Command Line Instruction:
$ python scripts/dataset_processing/tts/hui_acg/get_data.py \
--data-root <your_local_dataset_root> \
--manifests-root <your_local_manifest_root> \
--set-type clean \
--min-duration 0.1 \
--max-duration 15 \
--val-num-utts-per-speaker 1 \
--test-num-utts-per-speaker 1 \
--seed-for-ds-split 100
$ python scripts/dataset_processing/tts/hui_acg/phonemizer.py \
--json-manifests <your_path_to_train_manifest> <your_path_to_val_manifest> <your_path_to_test_manifest> \
--preserve-punctuation
$ python scripts/dataset_processing/tts/extract_sup_data.py \
--config-path hui_acg/ds_conf \
--config-name ds_for_fastpitch_align.yaml \
manifest_filepath=<your_path_to_train_manifest> \
sup_data_path=<your_path_to_where_to_save_supplementary_data>
SFSpeech Chinese/English Bilingual Speech
Dataset URL: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/sfbilingual/get_data.py
Command Line Instruction: please refer details in Section 1 (NGC Registry CLI installation), Section 2 (Downloading SFSpeech Dataset), and Section 3 (Creatiung Data Manifests) from https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_ChineseTTS_Training.ipynb. Below code block briefly describes the steps.
# [prerequisite] Install and setup 'ngc' cli tool by following document https://docs.ngc.nvidia.com/cli/cmd.html
$ ngc registry resource download-version "nvidia/sf_bilingual_speech_zh_en:v1"
$ unzip sf_bilingual_speech_zh_en_vv1/SF_bilingual.zip -d <your_local_dataset_root>
$ python scripts/dataset_processing/tts/sfbilingual/get_data.py \
--data-root <your_local_dataset_root>/SF_bilingual \
--val-size 0.005 \
--test-size 0.01 \
--seed-for-ds-split 100
$ python scripts/dataset_processing/tts/extract_sup_data.py \
--config-path sfbilingual/ds_conf \
--config-name ds_for_fastpitch_align.yaml \
manifest_filepath=<your_path_to_train_manifest> \
sup_data_path=<your_path_to_where_to_save_supplementary_data>