Speech Synthesis#

Mel Spectrogram Generators#

FastPitch#

Recommended

A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-Speech with Pitch Prediction paper. FastPitch is the recommended fully parallel TTS model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.

Vocoders#

HiFi-GAN#

Recommended

A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model architecture that achieves both efficient and high-fidelity speech synthesis.