TTS Overview
Contents
TTS Overview#
The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel-spectrogram using the first model, and then generates speech using the second model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.
Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Try It Out#
Pretrained TTS Models#
The .riva
models used to generate the RMIRs in the Quick Start scripts can be found at the following NGC locations. Supported voice names and samples generated with these models are also mentioned in the following table.
Language |
Model |
Dataset |
G2P |
Gender |
Voices |
Voice Samples |
---|---|---|---|---|---|---|
English (en-US) |
English-US |
IPA |
Multi-speaker |
|
||
English (en-US) |
English-US |
IPA |
Multi-speaker |
|
||
English (en-US) |
LJSpeech |
ARPABET |
|
|||
English (en-US) |
English-US |
ARPABET |
Multi-speaker |
|
||
Mandarin (zh-CN) |
Mandarin-CN |
IPA |
Multi-speaker |
|
||
Spanish (es-ES) |
Public/Proprietary |
IPA |
Female |
|
||
Spanish (es-ES) |
Public/Proprietary |
IPA |
Male |
|
||
Spanish-US (es-US) |
Public/Proprietary |
IPA |
Multi-speaker |
|
||
Italian (it-IT) |
Public/Proprietary |
IPA |
Female |
|
||
Italian (it-IT) |
Public/Proprietary |
IPA |
Male |
|
||
German (de-DE) |
Public/Proprietary |
IPA |
Male |
|
Language Support#
Riva Speech AI Skills provides pretrained models across a variety of languages that are listed in above section. Upgraded models and new languages are released regularly.
To select which language to deploy, simply change the variable tts_language_code
in the config.sh
file within the quickstart
directory of the Quick Start scripts.
Zero Shot TTS (Beta Feature)#
Riva introduces Zero Shot TTS as a beta feature. This feature allows users to provide a speech prompt, enabling the model to adapt to the voice in prompt and synthesize speech using it.
Emotion mixing (Beta feature)#
Riva now supports mixing of emotional intensities as a beta feature. This will allow the users to control the emotion in an audio. This feature is accessible through SSML emotion attribute. Currently the quantization is only supported for calm
, angry
, fearful
, neutral
and happy
for Female
, and calm
, happy
and neutral
for male
.
Checking deployed models#
Once a server is running, retrieving the available models can be done via the GetRivaSynthesisConfig
RPC.
For each model available to make inference requests, the RPC returns the parameters used when the model was deployed.
Output Audio Encoding#
Besides the default Pulse-Code Modulation (PCM) output stream, you can choose Opus encoded and compressed stream. Compression enables you to significantly reduce the network bandwidth.