Speech Synthesis

The text-to-speech (TTS) pipeline implemented for the Jarvis TTS service is based on Tacotron 2 and WaveGlow.

This TTS system is a combination of two neural network models: Tacotron 2 and WaveGlow. The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.

Model Architectures

Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper.

Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Lim algorithm or a neural decoder such as WaveNet.

WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper.

Jarvis uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.

Services

Jarvis TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated, and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Pipeline Configuration

In the simplest use case, you can deploy a TTS model as follows:

jarvis-build speech_synthesis \
    /servicemaker-dev/<jmir_filename>:<encryption_key>  \
    /servicemaker-dev/<tacotron_nemo_filename> \
    /servicemaker-dev/<waveglow_ejrvs_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

  • <encryption_key> is the encryption key used during the export of the .ejrvs file.

  • pipeline_name is an optional user-defined name for the components in the model repository.

  • <tacotron_nemo_filename> is the name of the nemo checkpoint file for Tacotron2.

  • <waveglow_ejrvs_filename> is the name of the ejrvs file for the universal Waveglow model.

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions.

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

  • <jmir_filename> is the Jarvis jmir file that will be generated.

Upon succesful completion of this command, a file named <jmir_filename> will be created in the /servicemaker-dev/ folder. If your .ejrvs archives are encrypted you need to include :<encryption_key> at the end of the JMIR filename and ejrvs filename. Otherwise this is unnecessary.

Pretrained Models

Task

Architecture

Language

Dataset

Compatibility with TLT 3.0

Compatibility with Nemo 1.0.0b4

Link

Mel Spectrogram Generation

Tacotron2

English

LJSpeech

No

Yes

EJRVS

Vocoder

Waveglow

English

LJSpeech

No

No

EJRVS