TTS Overview#

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva models like FastPitch and RadTTS++ first generates a mel-spectrogram, and then generates speech using the HifiGAN model while MagpieTTS Multilingual generates tokens and then generates speech using the Audio Codec model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva TTS supports both streaming and offline inference modes. In offline mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Try It Out#

Experience Riva TTS on our demo platform: https://build.nvidia.com/explore/speech

Pretrained TTS Models#

The .riva models used to generate the RMIRs in the Quick Start scripts can be found at the following NGC locations. Supported voice names and samples generated with these models are also mentioned in the following table.

Language	Model	Dataset	G2P	Gender	Voices	Voice Samples
English, French, Spanish (en-US, fr-FR, es-US)	MagpieTTS Multilingual AudioCodec	Multilingual	IPA	Multi-speaker	`Magpie-Multilingual.EN-US.Female.Neutral` `Magpie-Multilingual.EN-US.Female.Calm` `Magpie-Multilingual.EN-US.Female.Fearful` `Magpie-Multilingual.EN-US.Female.Happy` `Magpie-Multilingual.EN-US.Female.Angry` `Magpie-Multilingual.EN-US.Female.Female-1` `Magpie-Multilingual.EN-US.Male.Calm` `Magpie-Multilingual.EN-US.Male.Fearful` `Magpie-Multilingual.EN-US.Male.Happy` `Magpie-Multilingual.EN-US.Male.Neutral` `Magpie-Multilingual.EN-US.Male.Angry` `Magpie-Multilingual.EN-US.Male.Disgusted` `Magpie-Multilingual.EN-US.Male.Male-1` `Magpie-Multilingual.FR-FR.Male.Male-1` `Magpie-Multilingual.FR-FR.Female.Female-1` `Magpie-Multilingual.FR-FR.Female.Angry` `Magpie-Multilingual.FR-FR.Female.Calm` `Magpie-Multilingual.FR-FR.Female.Disgust` `Magpie-Multilingual.FR-FR.Female.Sad` `Magpie-Multilingual.FR-FR.Female.Happy` `Magpie-Multilingual.FR-FR.Female.Fearful` `Magpie-Multilingual.FR-FR.Female.Neutral` `Magpie-Multilingual.FR-FR.Male.Neutral` `Magpie-Multilingual.FR-FR.Male.Angry` `Magpie-Multilingual.FR-FR.Male.Calm` `Magpie-Multilingual.FR-FR.Male.Sad` `Magpie-Multilingual.ES-US.Male.Male-1` `Magpie-Multilingual.ES-US.Female.Female-1` `Magpie-Multilingual.ES-US.Female.Neutral` `Magpie-Multilingual.ES-US.Male.Neutral` `Magpie-Multilingual.ES-US.Male.Angry` `Magpie-Multilingual.ES-US.Female.Angry` `Magpie-Multilingual.ES-US.Female.Happy` `Magpie-Multilingual.ES-US.Male.Happy` `Magpie-Multilingual.ES-US.Female.Calm` `Magpie-Multilingual.ES-US.Male.Calm` `Magpie-Multilingual.ES-US.Female.Pleasant_Surprise` `Magpie-Multilingual.ES-US.Male.Pleasant_Surprise` `Magpie-Multilingual.ES-US.Female.Sad` `Magpie-Multilingual.ES-US.Male.Sad` `Magpie-Multilingual.ES-US.Male.Disgust`	`🔉EN-US.Female.Female-1` `🔉EN-US.Male.Male-1` `🔉FR-FR.Male.Male-1` `🔉FR-FR.Female.Female-1` `🔉ES-US.Male.Male-1` `🔉ES-US.Female.Female-1`
English (en-US)	FastPitch HiFi-GAN	English-US	IPA	Multi-speaker	`English-US.Female-1` `English-US.Male-1` `English-US.Female-Calm` `English-US.Female-Neutral` `English-US.Female-Happy` `English-US.Female-Angry` `English-US.Female-Fearful` `English-US.Female-Sad` `English-US.Male-Calm` `English-US.Male-Neutral` `English-US.Male-Happy` `English-US.Male-Angry`	`🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉` `🔉`
English (en-US)	Rad-TTS HiFi-GAN	English-US	IPA	Multi-speaker	`English-US-RadTTS.Female-1` `English-US-RadTTS.Male-1` `English-US-RadTTS.Female-Calm` `English-US-RadTTS.Female-Neutral` `English-US-RadTTS.Female-Happy` `English-US-RadTTS.Female-Angry` `English-US-RadTTS.Female-Fearful` `English-US-RadTTS.Female-Sad` `English-US-RadTTS.Male-Calm` `English-US-RadTTS.Male-Neutral` `English-US-RadTTS.Male-Happy` `English-US-RadTTS.Male-Angry`	`🔉` `🔉`
English (en-US)	FastPitch HiFi-GAN	LJSpeech	ARPABET		`ljspeech`
English (en-US)	FastPitch HiFi-GAN (Deprecated)	English-US	ARPABET	Multi-speaker	`English-US.Female-1` `English-US.Male-1`
Mandarin (zh-CN)	FastPitch HiFi-GAN	Mandarin-CN	IPA	Multi-speaker	`Mandarin-CN.Female-1` `Mandarin-CN.Male-1` `Mandarin-CN.Female-Calm` `Mandarin-CN.Female-Neutral` `Mandarin-CN.Male-Happy` `Mandarin-CN.Male-Fearful` `Mandarin-CN.Male-Sad` `Mandarin-CN.Male-Calm` `Mandarin-CN.Male-Neutral` `Mandarin-CN.Male-Angry`
Spanish (es-ES)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Female	`Spanish-ES-Female-1`
Spanish (es-ES)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Male	`Spanish-ES-Male-1`
Spanish-US (es-US)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Multi-speaker	`Spanish-US.Female-1` `Spanish-US.Male-1` `Spanish-US.Female-Calm` `Spanish-US.Male-Calm` `Spanish-US.Female-Angry` `Spanish-US.Male-Angry` `Spanish-US.Female-Neutral` `Spanish-US.Male-Neutral` `Spanish-US.Female-Sad` `Spanish-US.Male-Happy` `Spanish-US.Male-Fearful` `Spanish-US.Male-Sad`
Italian (it-IT)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Female	`Italian-IT-Female-1`
Italian (it-IT)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Male	`Italian-IT-Male-1`
German (de-DE)	FastPitch HiFi-GAN	Public/Proprietary	IPA	Male	`German-DE-Male-1`

Features#

Riva TTS supports the following features:

Streaming and offline inference modes
Output audio encoding
SSML input
Zero Shot TTS (Beta Feature)
Emotion mixing (Beta Feature)
Custom pronunciation dictionary

Language and Model Support#

Riva Speech AI Skills provides pretrained models across a variety of languages that are listed in above section. Upgraded models and new languages are released regularly.

To select which language and model to deploy, simply change the variables tts_model and tts_language_code in the config.sh file within the quickstart directory of the Quick Start scripts.

Zero Shot TTS (Beta Feature)#

Riva introduces Zero Shot TTS as a beta feature. This feature allows users to provide a speech prompt, enabling the model to adapt to the voice in prompt and synthesize speech using it.

SSML input#

Riva TTS supports SSML inputs for both streaming and offline inference modes. Using SSML tags to provide finer control over the generated speech, users can specify the following tags: prosody, phoneme, and sub. You can use the phoneme tag to specify the phoneme for a given word, the sub tag to specify the substitution for a given word, and the prosody tag to specify the prosody attributes like pitch, rate, and volume for a given text. Refer to the notebook here for more details.

Model	`prosody`	`phoneme`	`sub`
FastPitch	✅	✅	✅
RadTTS++	✅	✅	✅
MagpieTTS Multilingual		✅

Emotion mixing (Beta feature)#

Riva now supports the mixing of emotional intensities as a beta feature. This will allow the users to control the emotion in an audio. This feature is accessible through SSML emotion attribute. Currently the quantization is only supported for calm, angry, fearful, neutral and happy for Female, and calm, happy and neutral for male. Currently emotion mixing is only supported in the RadTTS++ model.

Custom pronunciation dictionary#

Riva TTS supports providing a text dictionary to get the desired pronunciation for specific words synthesized by the server. This custom dictionary must contain a word (grapheme) followed by the desired pronunciation (phoneme), both separated by two spaces. Different such words and pronunciation pairs can be provided on a new line in the input dictionary file. The input dictionary file can be passed in the custom_dictionary field while configuring a request from the client. In the Python client, this can be done by passing the dictionary in the custom_dictionary field while configuring a request from the client like here.

Checking deployed models#

Once a server is running, retrieving the available models can be done via the GetRivaSynthesisConfig RPC. For each model available to make inference requests, the RPC returns the parameters used when the model was deployed.

Output Audio Encoding#

Besides the default Pulse-Code Modulation (PCM) output stream, you can choose Opus encoded and compressed stream. Compression enables you to significantly reduce the network bandwidth.

NVIDIA Riva

TTS Overview

Contents