Speech Synthesis#

Mel Spectrogram Generators#

FastPitch#

Recommended

A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.

Tacotron 2#

Recommended

Caution

Tacotron 2 is not recommended for new deployments. FastPitch achieves higher quality, improved stability, and faster performance.

A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.

Vocoders#

HiFi-GAN#

Recommended

A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model architecture that achieves both efficient and high-fidelity speech synthesis.

WaveGlow#

Recommended

Caution

WaveGlow is not recommended for new deployments. HiFi-GAN achieves higher quality and faster performance.

A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.