Overview

Speech synthesis or text-to-speech (TTS) is defined as the artificial production of human voices. The main use is to translate text into spoken speech automatically. TAO Toolkit supports a two-stage pipeline for TTS:

A spectrogram model to generate a Mel spectrogram from text
A vocoder model to generate audio from a Mel spectrogram