TTS Overview#

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Try It Out#

0 / 400

Voices#

Language

Language Code

Gender

Voice Name

Sample

English

en-US

Female

English-US-Female-1

🔉

English

en-US

Male

English-US-Male-1

🔉

Note

The Riva Quick Start scripts download only the female voice by default. To enable the male voice, uncomment the male model download link in config.sh. Then, run riva_stop.sh if Riva is currently running. Run riva_init.sh and riva_start.sh. Lastly, specify English-US-Male-1 as the TTS voice name in the Riva TTS request.