Text-to-Speech (TTS)

Text-to-Speech (TTS) synthesis refers to a system that converts textual inputs into natural human speech. The synthesized speech is expected to sound intelligible and natural. With the resurgence of deep neural networks, TTS research has achieved tremendous progress. NeMo implementation focuses on the state-of-the-art neural TTS where both cascaded and end-to-end (upcoming) systems are included,

  1. Cascaded TTS follows a three-stage process. Text analysis stage transliterates grapheme inputs into phonemes by either looking up in a canonical dictionary or using a grapheme-to-phoneme (G2P) conversion; acoustic modeling stage generates acoustic features from phoneme inputs or from a mixer of graphemes and phonemes. NeMo chooses mel-spectrograms to represent expressive acoustic features, so we would use the term in the context, mel-spectrogram generators or acoustic models, interchangeably; vocoder stage synthesizes waveform audios from acoustic features accordingly.

  2. End-to-End TTS alternatively integrates the above three stages as a single model so that it directly synthesizes audios from graphemes/phonemes inputs without any intermediate processes.

We will illustrate details in the following sections.

Resources and Documentation

Hands-on TTS tutorial notebooks can be found under the TTS tutorials folder. If you are a beginner to NeMo, consider trying out the tutorials of NeMo Primer and NeMo Model. If you are also a beginner to TTS, consider trying out the NeMo TTS Primer Tutorial. These tutorials can be run on Google Colab by specifying the link to the notebooks’ GitHub pages on Colab.

If you are looking for information about a particular TTS model, or would like to find out more about the model architectures available in the directory of nemo.collections.tts, refer to the Models section.

NeMo includes preprocessing scripts for several common TTS datasets. The Data Preprocessing section contains instructions on how to run those scripts. You can also creating your own NeMo-compatible dataset preprocessing script by following the guidance.

Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints available on NGC are located on the Checkpoints section.

Documentation regarding the configuration files specific to the NeMo TTS models can be found on the Configuration Files section.