TTS Overview#

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel-spectrogram using the first model, and then generates speech using the second model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Pretrained TTS Models#

Language

Model

Dataset

G2P

Gender

Voices

Voice Samples

en-US

FastPitch HiFi-GAN

English-US

IPA

Multi-speaker

English-US.Female-1 English-US.Male-1 English-US.Female-Calm English-US.Female-Neutral English-US.Female-Happy English-US.Female-Angry English-US.Female-Fearful English-US.Female-Sad English-US.Male-Calm English-US.Male-Neutral English-US.Male-Happy English-US.Male-Angry

🔉
🔉
🔉
🔉
🔉
🔉
🔉
🔉
🔉
🔉
🔉
🔉

en-US

Rad-TTS HiFi-GAN

English-US

IPA

Multi-speaker

English-US-RadTTS.Female-1 English-US-RadTTS.Male-1 English-US-RadTTS.Female-Calm English-US-RadTTS.Female-Neutral English-US-RadTTS.Female-Happy English-US-RadTTS.Female-Angry English-US-RadTTS.Female-Fearful English-US-RadTTS.Female-Sad English-US-RadTTS.Male-Calm English-US-RadTTS.Male-Neutral English-US-RadTTS.Male-Happy English-US-RadTTS.Male-Angry

🔉
🔉

en-US

FastPitch HiFi-GAN

LJSpeech

ARPABET

ljspeech

en-US

FastPitch HiFi-GAN (Deprecated)

English-US

ARPABET

Multi-speaker

English-US.Female-1 English-US.Male-1

zh-CN

FastPitch HiFi-GAN

Mandarin-CN

IPA

Multi-speaker

Mandarin-CN.Female-1 Mandarin-CN.Male-1 Mandarin-CN.Female-Calm Mandarin-CN.Female-Neutral Mandarin-CN.Male-Happy Mandarin-CN.Male-Fearful Mandarin-CN.Male-Sad Mandarin-CN.Male-Calm Mandarin-CN.Male-Neutral Mandarin-CN.Male-Angry

es-ES

FastPitch HiFi-GAN

Public/Proprietary

IPA

Female

Spanish-ES-Female-1

es-ES

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

Spanish-ES-Male-1

es-US

FastPitch HiFi-GAN

Public/Proprietary

IPA

Multi-speaker

Spanish-US.Female-1 Spanish-US.Male-1 Spanish-US.Female-Calm Spanish-US.Male-Calm Spanish-US.Female-Narrator Spanish-US.Male-Narrator Spanish-US.Female-Angry Spanish-US.Male-Angry Spanish-US.Female-Neutral Spanish-US.Male-Neutral Spanish-US.Female-Sad Spanish-US.Male-Happy

it-IT

FastPitch HiFi-GAN

Public/Proprietary

IPA

Female

Italian-IT-Female-1

it-IT

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

Italian-IT-Male-1

de-DE

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

German-DE-Male-1

Try It Out#

0 / 400

Language Support#

Riva Speech AI Skills provides pretrained models across a variety of languages. Upgraded models and new languages are released regularly.

Currently supported languages are English (en-US), Mandarin(zh-CN), Spanish (es-US), Spanish (es-ES), Italian (it-IT) and German (de-DE).

To select which language to deploy, simply change the variable tts_language_code in the config.sh file within the quickstart directory of the Quick Start scripts.

Checking deployed models#

Once a server is running, retrieving the available models can be done via the GetRivaSynthesisConfig RPC. For each model available to make inference requests, the RPC returns the parameters used when the model was deployed.

Output Audio Encoding#

Besides the default Pulse-Code Modulation (PCM) output stream, you can choose Opus encoded and compressed stream. Compression enables you to significantly reduce the network bandwidth.