Data Preprocessing#

NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (work in progress), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of scripts/data_processing/tts/, to convert common public TTS datasets into the format expected by the dataloaders as defined in nemo/collections/tts/torch/data.py. The nemo_tts collection expects each dataset to consist of a set of utterances in individual audio files plus a JSON manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by Pydub, though we recommend WAV files as they are the default and have been most thoroughly tested.

There should be one JSON manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice versa. Each line of the manifest should be in the following format:

{
  "audio_filepath": "/path/to/audio.wav",
  "text": "the transcription of the utterance",
  "normalized_text": "the normalized transcription of the utterance",
  "speaker": 5,
  "duration": 15.4
}

where "audio_filepath" provides an absolute path to the .wav file corresponding to the utterance so that audio files can be located anywhere without the constraint of being organized in the same directory as the manifest itself; "text" contains the full transcript (either graphemes or phonemes or their mixer) for the utterance; "normalized_text" contains normalized "text" that helps to bypass the normalization steps but it is fully optional; "speaker" refers to the integer speaker ID; "duration" describes the duration of the utterance in seconds.

Each entry in the manifest (describing one audio file) should be bordered by "{" and "}" and must be placed on one line. The "key": value pairs should be separated by a commas as shown above. NeMo enforces no blank lines in the manifest so that the total number of lines indicates the total number of audio files in the dataset.

Once there is a manifest that describes each audio file in the dataset, assign the JSON manifest file path in the experiment config file, for example, training_ds.manifest_filepath=<path/to/manifest.json>.

Following the instructions below on how to run the corpus-specific scripts, you can get started with either directly processing those public datasets, or creating your custom scripts to preprocess your custom datasets scripts.

Public TTS Datasets#

This table below summarizes the statistics for a collection of high-quality public datasets used by NeMo TTS. We recommend to start customizing the scripts for your custom datasets that have close sampling rate and number of speakers.

Language

Locale

Dataset Name

#spk-total

#spk-F

#spk-M

#hours-total

#hour-F

#hour-M

Sampling Rate

URL

English

en-US

LJSpeech

1

1

0

23.92

23.92

0.00

22,050Hz

https://keithito.com/LJ-Speech-Dataset/

English

en-US

LibriTTS (clean)

1230

592

638

262.62

133.97

128.65

24,000Hz

https://www.openslr.org/60/

English

en-US

HiFiTTS

10

6

4

291.60

158.30

133.30

44,100Hz

http://www.openslr.org/109/

German

de-DE

Thorsten Müller (German Neutral-TTS dataset)

1

0

1

22.96

0.00

22.96

22,050Hz

https://www.openslr.org/95/

German

de-DE

HUI-Audio-Corpus-German (clean)

118

0

0

253.85

0.00

0.00

44,100Hz

https://opendata.iisys.de/datasets.html

Spanish

es-AR

Crowdsourced high-quality Argentinian Spanish

44

31

13

8.03

5.61

2.42

48,000Hz

https://www.openslr.org/61/

Spanish

es-CL

Crowdsourced high-quality Chilean Spanish

31

13

18

7.15

2.84

4.31

48,000Hz

https://www.openslr.org/71/

Spanish

es-CO

Crowdsourced high-quality Colombian Spanish

33

16

17

7.58

3.74

3.84

48,000Hz

https://www.openslr.org/72/

Spanish

es-PE

Crowdsourced high-quality Peruvian Spanish

38

18

20

9.22

4.35

4.87

48,000Hz

https://www.openslr.org/73/

Spanish

es-PR

Crowdsourced high-quality Puerto Rico Spanish

5

5

0

1.00

1.00

0.00

48,000Hz

https://www.openslr.org/74/

Spanish

es-VE

Crowdsourced high-quality Venezuelan Spanish

23

11

12

4.81

2.41

2.40

48,000Hz

https://www.openslr.org/75/

Corpus-Specific Data Preprocessing#

NeMo implements model-agnostic data preprocessing scripts that wrap up steps of downloading raw datasets, extracting files, and/or normalizing raw texts, and generating data manifest files. Most scripts are able to be reused for any datasets with only minor adaptations. Most TTS models work out-of-the-box with the LJSpeech dataset, so it would be straightforward to start adapting your custom script from LJSpeech script. For some models that may require supplementary data for training and validating, such as speech/text alignment prior, pitch, speaker ID, emotion ID, energy, etc, you may need an extra step of supplementary data extraction by calling script/dataset_processing/tts/extract_sup_data.py . The following sub-sections demonstrate detailed instructions for running data preprocessing scripts.

LJSpeech#

$ python scripts/dataset_processing/tts/ljspeech/get_data.py \
    --data-root <your_local_dataset_root> \
    --whitelist-path <your_local_whitelist_filepath> \
    or default nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv

$ python scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path ljspeech/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=<your_path_to_train_manifest> \
    sup_data_path=<your_path_to_where_to_save_supplementary_data>

LibriTTS#

$ python scripts/dataset_processing/tts/libritts/get_data.py \
    --data-root <your_local_dataset_root> \
    --data-sets dev_clean

$ python scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path ljspeech/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=<your_path_to_train_manifest> \
    sup_data_path=<your_path_to_where_to_save_supplementary_data>

HiFiTTS#

The texts of this dataset has been normalized already. So there is no extra need to preprocess the data again. But we still need a download script and split it into manifests.

Thorsten Müller (German Neutral-TTS dataset)#

$ python scripts/dataset_processing/tts/openslr/get_data.py \
    --data-root <your_local_dataset_root> \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100

$ python scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path openslr/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=<your_path_to_train_manifest> \
    sup_data_path=<your_path_to_where_to_save_supplementary_data>

HUI Audio Corpus German#

$ python scripts/dataset_processing/tts/hui_acg/get_data.py \
    --data-root <your_local_dataset_root> \
    --set-type clean \
    --min-duration 0.1 \
    --max-duration 15 \
    --val-num-utts-per-speaker 1 \
    --test-num-utts-per-speaker 1 \
    --seed-for-ds-split 100

$ python scripts/dataset_processing/tts/hui_acg/phonemizer.py \
    --json-manifests <your_path_to_train_manifest> <your_path_to_val_manifest> <your_path_to_test_manifest> \
    --preserve-punctuation

$ python scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path hui_acg/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=<your_path_to_train_manifest> \
    sup_data_path=<your_path_to_where_to_save_supplementary_data>