Speech synthesis or text-to-speech (TTS) is defined as the artificial production of human voices. The main use is to translate text into spoken speech automatically. TAO Toolkit supports a two-stage pipeline for TTS:

  1. A spectrogram model to generate a Mel spectrogram from text

  2. A vocoder model to generate audio from a Mel spectrogram