Speech Synthesis¶
The text-to-speech (TTS) pipeline implemented for the Jarvis TTS service is based on Tacotron 2 and WaveGlow.
This TTS system is a combination of two neural network models: Tacotron 2 and WaveGlow. The Tacotron 2 and WaveGlow models form a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.
Model Architectures¶
Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.
WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Jarvis uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.
Services¶
Jarvis TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated, and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Pipeline Configuration¶
In the simplest use case, you can deploy a TTS model as follows:
jarvis-build speech_synthesis \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<tacotron_nemo_filename> \
/servicemaker-dev/<waveglow_ejrvs_filename>:<encryption_key> \
--name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file>
where:
<encryption_key>
is the encryption key used during the export of the.ejrvs
filepipeline_name
is an optional user-defined name for the components in the model repository<tacotron_nemo_filename>
is the name of thenemo
checkpoint file for Tacotron2<waveglow_ejrvs_filename>
is the name of theejrvs
file for the universal Waveglow model<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic
representation in ARPABET.
<jmir_filename>
is the Jarvisjmir
file that is generated
Upon succesful completion of this command, a file named <jmir_filename>
is created in the /servicemaker-dev/
folder. If
your .ejrvs
archives are encrypted, you need to include :<encryption_key>
at the end of the JMIR filename and ejrvs
filename, otherwise this is unnecessary.