Speech Synthesis¶
The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.
For new users, it is recommended to start with the FastPitch + HiFi-GAN models.
Model Architectures - Mel Spectrogram Generators¶
FastPitch: A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.
Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.
Model Architectures - Vocoders¶
HiFi-GAN: A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model archiecture that achieves both efficient and high-fidelity speech synthesis.
WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.
Services¶
Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Model Deployment¶
Like all Riva models, Riva TTS requires the following steps:
The following sections describe some examples for specific steps as outlined above.
Creating Riva files¶
Riva files can be created from .nemo
or .tao
files. The following is an example of how a
HiFi-GAN model can be converted to a .riva
file from a .nemo
file. First, download the
.nemo
file from NGC onto the host system. Run the NeMo container and share the .nemo
file
with the container including the -v
option.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
-v $(pwd):/NeMo \
--shm-size=8g \
-p 8888:8888 \
-p 6006:6006 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device=/dev/snd \
nvcr.io/nvidia/nemo:1.4.0
After the container has launched, run:
pip3 install nvidia-pyindex
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/riva/riva_quickstart/versions/1.7.0-beta/files/riva_api-1.7.0b0-py3-none-any.whl -O riva_api-1.7.0b0-py3-none-any.whl
pip3 install nemo2riva-1.7.0_beta-py3-none-any.whl
nemo2riva --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo
You can repeat this process for each .nemo
model to generate .riva
files. It is suggested that
you do so for FastPitch before continuing to the next step. Be sure that you are getting the latest
tts_hifigan.nemo
checkpoint, latest nvcr.io/nvidia/nemo
container version, and latest
nemo2riva-{version}_beta-py3-none-any.whl
version when doing the above step:
Riva Build: FastPitch and HiFi-GAN Pipeline Configuration¶
Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<FastPitch_riva>:<encryption_key> \
/servicemaker-dev/<HiFi-GAN_riva>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file>
where:
<rmir_filename>
is the Rivarmir
file that is generated<encryption_key>
is the encryption key used during the export of the.riva
filepipeline_name
is an optional user-defined name for the components in the model repository<FastPitch_riva>
is the name of theriva
file for FastPitch<HiFi-Gan_riva>
is the name of theriva
file for HiFi-GAN<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
Upon successful completion of this command, a file named <rmir_filename>
is created in the
/servicemaker-dev/
folder. If your .riva
archives are encrypted, you need to include
:<encryption_key>
at the end of the RMIR filename and riva
filename, otherwise this is
unnecessary.
Riva Build: Tacotron2 and Waveglow Pipeline Configuration¶
In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<tacotron_nemo_filename> \
/servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file>
where:
<rmir_filename>
is the Rivarmir
file that is generated<encryption_key>
is the encryption key used during the export of the.riva
filepipeline_name
is an optional user-defined name for the components in the model repository<tacotron_nemo_filename>
is the name of thenemo
checkpoint file for Tacotron 2<waveglow_riva_filename>
is the name of theriva
file for the universal WaveGlow model<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
Upon successful completion of this command, a file named <rmir_filename>
is created in the
/servicemaker-dev/
folder. If your .riva
archives are encrypted, you need to include
:<encryption_key>
at the end of the RMIR filename and riva
filename, otherwise this is
unnecessary.
Speech Synthesis Markup Language (SSML)¶
Riva 1.8 adds preliminary support for SSML. Only the FastPitch model is supported at this time.
There are no plans to add this functionality to Tacotron2. The FastPitch model must be exported
using NeMo 1.5.1 and the nemo2riva 1.8.0 tool. All SSML inputs must be a valid XML document and
use the <speak>
root tag. All non-valid XML and all valid XML with a different root tag are
treated as raw input text. Riva currently supports the following in a limited capacity:
prosody
tag
pitch
attribute
rate
attribute
Pitch Attribute¶
Riva currently supports a additive relative change to the pitch. The pitch attribute has a range
of [-3, 3]. Values outside this range result in an error being logged, and no audio returned.
Note that this value returns a pitch shift of the attribute value multiplied with the
speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained
checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of
1.25
results in a change of 1.25*52.185=~65.23Hz
pitch shift up. The pitch attribute
is expressed in the following formats:
pitch="1"
pitch="+1.8"
pitch="-0.65"
Rate Attribute¶
Riva currently supports a “%” relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged, and no audio returned. The rate attribute is expressed in the following formats:
rate="35%"
rate="+200%"
Warning
The pitch attribute is not currently in compliance with the SSML specs, and does not support “Hz”, “st”, “%” changes, nor does it support “x-low”, “low”, “medium”, “high”, “x-high”, or “default”. Support is planned for a future Riva release.
The rate attribute does not currently support “x-low”, “low”, “medium”, “high”, “x-high”, or “default”. Support is planned for a future Riva release.
For SSML examples with sample audio, refer to the Riva_speech_API_demo notebook
Pretrained Models¶
Task |
Architecture |
Language |
Dataset |
Compatibility with TAO Toolit 3.0-21.08 |
Compatibility with Nemo 1.5.1 |
Link |
---|---|---|---|---|---|---|
Mel Spectrogram Generation |
FastPitch |
English |
LJSpeech |
No |
Yes |
|
Mel Spectrogram Generation |
Tacotron2 |
English |
LJSpeech |
No |
Yes |
|
Vocoder |
HiFi-GAN |
English |
LJSpeech |
No |
Yes |
|
Vocoder |
Waveglow |
English |
LJSpeech |
No |
Yes |