TTS Overview#

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva models like FastPitch and RadTTS++ first generates a mel-spectrogram, and then generates speech using the HifiGAN model while MagpieTTS Multilingual generates tokens and then generates speech using the Audio Codec model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva TTS Pipeline

Riva TTS supports both streaming and offline inference modes. In offline mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Try It Out#

Experience Riva TTS on our demo platform: https://build.nvidia.com/explore/speech

Pretrained TTS Models#

The .riva models used to generate the RMIRs in the Quick Start scripts can be found at the following NGC locations. Supported voice names and samples generated with these models are also mentioned in the following table.

Language

Model

Dataset

G2P

Gender

Voices

Voice Samples

English, French, Spanish (en-US, fr-FR, es-US)

MagpieTTS Multilingual AudioCodec

Multilingual

IPA

Multi-speaker

Magpie-Multilingual.EN-US.Female.Neutral Magpie-Multilingual.EN-US.Female.Calm Magpie-Multilingual.EN-US.Female.Fearful Magpie-Multilingual.EN-US.Female.Happy Magpie-Multilingual.EN-US.Female.Angry Magpie-Multilingual.EN-US.Female.Female-1 Magpie-Multilingual.EN-US.Male.Calm Magpie-Multilingual.EN-US.Male.Fearful Magpie-Multilingual.EN-US.Male.Happy Magpie-Multilingual.EN-US.Male.Neutral Magpie-Multilingual.EN-US.Male.Angry Magpie-Multilingual.EN-US.Male.Disgusted Magpie-Multilingual.EN-US.Male.Male-1 Magpie-Multilingual.FR-FR.Male.Male-1 Magpie-Multilingual.FR-FR.Female.Female-1 Magpie-Multilingual.FR-FR.Female.Angry Magpie-Multilingual.FR-FR.Female.Calm Magpie-Multilingual.FR-FR.Female.Disgust Magpie-Multilingual.FR-FR.Female.Sad Magpie-Multilingual.FR-FR.Female.Happy Magpie-Multilingual.FR-FR.Female.Fearful Magpie-Multilingual.FR-FR.Female.Neutral Magpie-Multilingual.FR-FR.Male.Neutral Magpie-Multilingual.FR-FR.Male.Angry Magpie-Multilingual.FR-FR.Male.Calm Magpie-Multilingual.FR-FR.Male.Sad Magpie-Multilingual.ES-US.Male.Male-1 Magpie-Multilingual.ES-US.Female.Female-1 Magpie-Multilingual.ES-US.Female.Neutral Magpie-Multilingual.ES-US.Male.Neutral Magpie-Multilingual.ES-US.Male.Angry Magpie-Multilingual.ES-US.Female.Angry Magpie-Multilingual.ES-US.Female.Happy Magpie-Multilingual.ES-US.Male.Happy Magpie-Multilingual.ES-US.Female.Calm Magpie-Multilingual.ES-US.Male.Calm Magpie-Multilingual.ES-US.Female.Pleasant_Surprise Magpie-Multilingual.ES-US.Male.Pleasant_Surprise Magpie-Multilingual.ES-US.Female.Sad Magpie-Multilingual.ES-US.Male.Sad Magpie-Multilingual.ES-US.Male.Disgust

πŸ”‰EN-US.Female.Female-1
πŸ”‰EN-US.Male.Male-1
πŸ”‰FR-FR.Male.Male-1
πŸ”‰FR-FR.Female.Female-1
πŸ”‰ES-US.Male.Male-1
πŸ”‰ES-US.Female.Female-1

English (en-US)

FastPitch HiFi-GAN

English-US

IPA

Multi-speaker

English-US.Female-1 English-US.Male-1 English-US.Female-Calm English-US.Female-Neutral English-US.Female-Happy English-US.Female-Angry English-US.Female-Fearful English-US.Female-Sad English-US.Male-Calm English-US.Male-Neutral English-US.Male-Happy English-US.Male-Angry

πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰
πŸ”‰

English (en-US)

Rad-TTS HiFi-GAN

English-US

IPA

Multi-speaker

English-US-RadTTS.Female-1 English-US-RadTTS.Male-1 English-US-RadTTS.Female-Calm English-US-RadTTS.Female-Neutral English-US-RadTTS.Female-Happy English-US-RadTTS.Female-Angry English-US-RadTTS.Female-Fearful English-US-RadTTS.Female-Sad English-US-RadTTS.Male-Calm English-US-RadTTS.Male-Neutral English-US-RadTTS.Male-Happy English-US-RadTTS.Male-Angry

πŸ”‰
πŸ”‰

English (en-US)

FastPitch HiFi-GAN

LJSpeech

ARPABET

ljspeech

English (en-US)

FastPitch HiFi-GAN (Deprecated)

English-US

ARPABET

Multi-speaker

English-US.Female-1 English-US.Male-1

Mandarin (zh-CN)

FastPitch HiFi-GAN

Mandarin-CN

IPA

Multi-speaker

Mandarin-CN.Female-1 Mandarin-CN.Male-1 Mandarin-CN.Female-Calm Mandarin-CN.Female-Neutral Mandarin-CN.Male-Happy Mandarin-CN.Male-Fearful Mandarin-CN.Male-Sad Mandarin-CN.Male-Calm Mandarin-CN.Male-Neutral Mandarin-CN.Male-Angry

Spanish (es-ES)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Female

Spanish-ES-Female-1

Spanish (es-ES)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

Spanish-ES-Male-1

Spanish-US (es-US)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Multi-speaker

Spanish-US.Female-1 Spanish-US.Male-1 Spanish-US.Female-Calm Spanish-US.Male-Calm Spanish-US.Female-Angry Spanish-US.Male-Angry Spanish-US.Female-Neutral Spanish-US.Male-Neutral Spanish-US.Female-Sad Spanish-US.Male-Happy Spanish-US.Male-Fearful Spanish-US.Male-Sad

Italian (it-IT)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Female

Italian-IT-Female-1

Italian (it-IT)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

Italian-IT-Male-1

German (de-DE)

FastPitch HiFi-GAN

Public/Proprietary

IPA

Male

German-DE-Male-1

Features#

Riva TTS supports the following features:

  • Streaming and offline inference modes

  • Output audio encoding

  • SSML input

  • Zero Shot TTS (Beta Feature)

  • Emotion mixing (Beta Feature)

  • Custom pronunciation dictionary

Language and Model Support#

Riva Speech AI Skills provides pretrained models across a variety of languages that are listed in above section. Upgraded models and new languages are released regularly.

To select which language and model to deploy, simply change the variables tts_model and tts_language_code in the config.sh file within the quickstart directory of the Quick Start scripts.

Zero Shot TTS (Beta Feature)#

Riva introduces Zero Shot TTS as a beta feature. This feature allows users to provide a speech prompt, enabling the model to adapt to the voice in prompt and synthesize speech using it.

SSML input#

Riva TTS supports SSML inputs for both streaming and offline inference modes. Using SSML tags to provide finer control over the generated speech, users can specify the following tags: prosody, phoneme, and sub. You can use the phoneme tag to specify the phoneme for a given word, the sub tag to specify the substitution for a given word, and the prosody tag to specify the prosody attributes like pitch, rate, and volume for a given text. Refer to the notebook here for more details.

Model

prosody

phoneme

sub

FastPitch

βœ…

βœ…

βœ…

RadTTS++

βœ…

βœ…

βœ…

MagpieTTS Multilingual

βœ…

Emotion mixing (Beta feature)#

Riva now supports the mixing of emotional intensities as a beta feature. This will allow the users to control the emotion in an audio. This feature is accessible through SSML emotion attribute. Currently the quantization is only supported for calm, angry, fearful, neutral and happy for Female, and calm, happy and neutral for male. Currently emotion mixing is only supported in the RadTTS++ model.

Custom pronunciation dictionary#

Riva TTS supports providing a text dictionary to get the desired pronunciation for specific words synthesized by the server. This custom dictionary must contain a word (grapheme) followed by the desired pronunciation (phoneme), both separated by two spaces. Different such words and pronunciation pairs can be provided on a new line in the input dictionary file. The input dictionary file can be passed in the custom_dictionary field while configuring a request from the client. In the Python client, this can be done by passing the dictionary in the custom_dictionary field while configuring a request from the client like here.

Checking deployed models#

Once a server is running, retrieving the available models can be done via the GetRivaSynthesisConfig RPC. For each model available to make inference requests, the RPC returns the parameters used when the model was deployed.

Output Audio Encoding#

Besides the default Pulse-Code Modulation (PCM) output stream, you can choose Opus encoded and compressed stream. Compression enables you to significantly reduce the network bandwidth.