Speech Synthesis

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on Tacotron 2 and WaveGlow.

This TTS system is a combination of two neural network models: Tacotron 2 and WaveGlow. The Tacotron 2 and WaveGlow models form a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Model Architectures

FastPitch: A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.

HiFi-GAN: A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model archiecture that achieves both efficient and high-fidelity speech synthesis.

Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.

WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.

Services

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Converting FastPitch and HiFi-GAN from NeMo

Convert a checkpoint from NeMo into a .riva file by running a NeMo container. Here, we downloaded a HiFi-GAN model from NGC onto the host system and used a shared volume (with the -v option) to expose the downloaded model to the container.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:1.1.0

After the container has launched, run:

pip3 install nvidia-pyindex
pip3 install nemo2riva-1.3.0_beta-py3-none-any.whl
nemo2riva --out /NeMo/hifi.riva /NeMo/tts_hifigan.nemo

FastPitch and HiFi-GAN Pipeline Configuration

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:

riva-build speech_synthesis \
        /servicemaker-dev/<rmir_filename>:<encryption_key> \
        /servicemaker-dev/<FastPitch_riva>:<encryption_key> \
        /servicemaker-dev/<HiFi-GAN_riva>:<encryption_key> \
        --name=<pipeline_name> \
        --abbreviations_file=/servicemaker-dev/<abbr_file>

where:

  • <encryption_key> is the encryption key used during the export of the .riva file

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <FastPitch_riva> is the name of the riva file for FastPitch

  • <HiFi-Gan_riva> is the name of the riva file for HiFi-GAN

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <rmir_filename> is the Riva rmir file that is generated

Upon succesful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Tacotron2 and Waveglow Pipeline Configuration

In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<tacotron_nemo_filename> \
    /servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

  • <encryption_key> is the encryption key used during the export of the .riva file

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <tacotron_nemo_filename> is the name of the nemo checkpoint file for Tacotron 2

  • <waveglow_riva_filename> is the name of the riva file for the universal WaveGlow model

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

  • <rmir_filename> is the Riva rmir file that is generated

Upon succesful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Pretrained Models

Task

Architecture

Language

Dataset

Compatibility with TAO Toolit 3.0-21.08

Compatibility with Nemo 1.0.0b4

Link

Mel Spectrogram Generation

Tacotron2

English

LJSpeech

No

Yes

RIVA

Mel Spectrogram Generation

FastPitch

English

LJSpeech

No

Yes

RIVA

Vocoder

Waveglow

English

LJSpeech

No

No

RIVA

Vocoder

HiFi-GAN

English

LJSpeech

No

No

NeMo