Speech Synthesis

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

For new users, it is recommended to start with the FastPitch + HiFi-GAN models.

Model Architectures - Mel Spectrogram Generators

FastPitch: A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.

Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.

Model Architectures - Vocoders

HiFi-GAN: A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model archiecture that achieves both efficient and high-fidelity speech synthesis.

WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.

Services

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Model Deployment

Like all Riva models, Riva TTS requires the following steps:

  • Create .riva files for each model from either a .tao or .nemo file as outlined in the respective TAO and NeMo sections

  • Create .rmir files for each Riva skill (for example, ASR, NLP, and TTS) using riva-build

  • Create model directories using riva_deploy

  • Deploy the model directory using riva_server

The following sections describe some examples for specific steps as outlined above.

Creating Riva files

Riva files can be created from .nemo or .tao files. The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file. First, download the .nemo file from NGC onto the host system. Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:1.4.0

After the container has launched, run:

pip3 install nvidia-pyindex
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/riva/riva_quickstart/versions/1.7.0-beta/files/riva_api-1.7.0b0-py3-none-any.whl -O riva_api-1.7.0b0-py3-none-any.whl
pip3 install nemo2riva-1.7.0_beta-py3-none-any.whl
nemo2riva --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

You can repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Be sure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-{version}_beta-py3-none-any.whl version when doing the above step:

Riva Build: FastPitch and HiFi-GAN Pipeline Configuration

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:

riva-build speech_synthesis \
        /servicemaker-dev/<rmir_filename>:<encryption_key> \
        /servicemaker-dev/<FastPitch_riva>:<encryption_key> \
        /servicemaker-dev/<HiFi-GAN_riva>:<encryption_key> \
        --voice_name=<pipeline_name> \
        --abbreviations_file=/servicemaker-dev/<abbr_file> \
        --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

  • <rmir_filename> is the Riva rmir file that is generated

  • <encryption_key> is the encryption key used during the export of the .riva file

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <FastPitch_riva> is the name of the riva file for FastPitch

  • <HiFi-Gan_riva> is the name of the riva file for HiFi-GAN

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Riva Build: Tacotron2 and Waveglow Pipeline Configuration

In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<tacotron_nemo_filename> \
    /servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

  • <rmir_filename> is the Riva rmir file that is generated

  • <encryption_key> is the encryption key used during the export of the .riva file

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <tacotron_nemo_filename> is the name of the nemo checkpoint file for Tacotron 2

  • <waveglow_riva_filename> is the name of the riva file for the universal WaveGlow model

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Speech Synthesis Markup Language (SSML)

Riva 1.8.0 adds preliminary support for SSML. Only the FastPitch model is supported at this time. There are no plans to add this functionality to Tacotron2. The FastPitch model must be exported using NeMo 1.5.1 and the nemo2riva 1.8.0 tool. All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text. Riva currently supports the following in a limited capacity:

  • prosody tag

    • pitch attribute

    • rate attribute

Pitch Attribute

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3]. Values outside this range result in an error being logged, and no audio returned. Note that this value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported. The pitch attribute is expressed in the following formats:

  • pitch="1"

  • pitch="+1.8"

  • pitch="-0.65"

  • pitch="high"

  • pitch="default"

Rate Attribute

Riva supports a % relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. It also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported. The rate attribute is expressed in the following formats:

  • rate="35%"

  • rate="+200%"

  • rate="low"

  • rate="default"

Warning

The pitch attribute currently does not support Hz, st, and % changes. Support is planned for a future Riva release.

For SSML examples with sample audio, refer to the Riva_speech_API_demo notebook section.

Pretrained Models

Task

Architecture

Language

Dataset

Compatibility with TAO Toolit 3.0-21.08

Compatibility with Nemo 1.5.1

Link

Mel Spectrogram Generation

FastPitch

English

LJSpeech

No

Yes

TAO NeMo

Mel Spectrogram Generation

Tacotron2

English

LJSpeech

No

Yes

TAO NeMo

Vocoder

HiFi-GAN

English

LJSpeech

No

Yes

TAO NeMo

Vocoder

Waveglow

English

LJSpeech

No

Yes

TAO NeMo