TTS Overview

TTS Overview#

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel-spectrogram using the first model, and then generates speech using the second model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Try It Out#

Voices#

Language	Language Code	Gender	Voice Name	Sample
English	en-US	Female	`English-US.Female-1`	`🔉`
English	en-US	Male	`English-US.Male-1`	`🔉`

Checking deployed models#

Once a server is running retrieving the available models can be done via the GetRivaSynthesisConfig rpc. For each model availble to make inference requets, the rpc returns the parameters used when the model was deployed.

NVIDIA Riva

TTS Overview

Contents

TTS Overview#

Try It Out#

Voices#

Checking deployed models#