Text-to-Speech¶

Abstract: This NVIDIA Jarvis Text-To-Speech (TTS) 0.2 Early Access (EA) User Guide provides step-by-step instructions for training and deploying your model as well as how to use the TTS service with Jarvis. TTS is an application based on the Tacotron 2 + Waveglow model. TTS represents a full text-to-speech pipeline that is GPU accelerated with optimized performance and accuracy regarding generated speech quality. Specifically, TTS takes as input an English-language text string, one of a set of speakers, and returns an audio waveform of the given voice speaking this string.

Introduction¶

NVIDIA Jarvis Text-To-Speech (TTS) is an application based on the Tacotron 2 + Waveglow model and used in Triton Inference Server with a focus on performance and ease-of-use. Specifically, TTS represents a full text-to-speech pipeline that is GPU accelerated with optimized performance and accuracy regarding generated speech quality. We’ve used Triton Inference Server as our chosen framework to implement this speech synthesis model.

Benefits Of Jarvis TTS¶

The Triton Inference Server implementation of the text-to-speech pipeline based on the Tacotron 2 + Waveglow model provides the following benefits:

Easy to use

Multiple models can be trained in cases where you want to provide multiple voices and/or languages in the same server; simply choose a language/voice pair and a text to be synthesized.

Fast

Because Jarvis is an NVIDIA product that leverages GPUs, the Triton Inference Server text-to-speech pipeline achieves state-of-the-art performance. This implementation is a reference for users who want to efficiently implement text-to-speech pipelines on GPUs.

Accurate

The quality of speech generated by the models we provide is state-of-the-art in terms of TTS using neural vocoders.

Modular

Even though this implementation of text-to-speech uses the Tacotron 2 + Waveglow neural network, it is modular such that you can easily replace one or many components of this pipeline.

Tacotron 2 + Waveglow¶

The TTS pipeline implemented for the Jarvis TTS service is based on Tacotron 2 and Waveglow.

This TTS system is a combination of two neural network models:

[]{#anchor-6}A modified Tacotron 2 model from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper.
[]{#anchor-7}A flow-based neural network model from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper.

Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Lim algorithm or a neural decoder such as Wavenet.

In our implementation, we use Waveglow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. One advantage of Waveglow when compared to other recently proposed neural vocoders is that it is not auto-regressive, which makes it more performant when running on GPUs.

The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech. Both models are based on implementations of NVIDIA GitHub repository GitHub: Tacotron 2 and GitHub: WaveGlow.

Modules¶

The TTS framework is based on the Tacotron 2 + Waveglow model and can be separated into the following major components:

Table x: Major components of TTS

Component Description

Encoder model Converts a sequence of characters and/or phonemes into a spectral representation.

                **Tacotron 2** - The Tacotron 2 model is a recurrent sequence-to-sequence model that predicts mel-spectrograms from the text. The encoder transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the neural decoder that produces one spectrogram frame at a time.

Neural vocoder Responsible for converting from a spectral representation to a time-domain speech signal.

                **Waveglow** - The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning. During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution.

Post-processing Responsible for processing the output signal. Waveglow has a post-processing denoising module that improves speech quality. Another example of a postprocessing module could be a speech codec or sample rate converter.

                **Denoiser** - The Waveglow model can use an optional denoising step that improves speech quality by removing background noise from the generated speech. This denoising filter is specific for each Waveglow model and can be configured to perform weak or strong denoising.

Training A Model With Your Data¶

Both Tacotron 2 and Waveglow can be trained using Neural Modules (NeMo). Before training using your own data, ensure you follow the step-by-step NeMo documentation about how to train Tacotron 2 on the LJSpeech dataset.

Resample all your audio to one consistent sample rate using a tool like sox, if necessary.
Create a manifest file that describes your data in JSON lines format. Each line should look similar to the following:

{“audio_filepath”: “<PATH TO WAV>”, “duration”: <duration in seconds>, “text”: “WAV TRANSCRIPTION”}

Split the .json file into train and eval subsets. For example, you could shuffle your file and then split it into a training and evaluation subset as follows:

shuf manifest.json > manifest_shuffled.json

head -n <number of training files> > train.json

tail -n <number of evaluation files> > eval.json

Where <number of training files> and <number of evaluation files> are fractions of the total number of samples in your dataset (usual values would be 95% and 5% of the total number of samples, respectively).

Run the Tacotron 2 training command:*

python tacotron2.py –train_dataset=<train.json> –eval_datasets <eval.json> –model_config=configs/tacotron.yaml –max_steps=30000

Optional: Run the Waveglow training command:*

python waveglow.py –train_dataset=<train.json> –eval_datasets <eval.json> –model_config=configs/waveglow.yaml –num_epochs=1500

* These commands run training on a single GPU. For multi-GPU runs, refer to the NeMo TTS documentation.

Exporting The Checkpoints From NeMo To Jarvis¶

When using models trained using NeMo with Jarvis, you are required to convert the checkpoints into a format consumable by the Jarvis TTS service. This conversion will first convert the PyTorch checkpoints into JSON and ONNX files, which will then be converted to TensorRT engine plans. This is needed in order to get the best performance from our optimized TensorRT implementation of Tacotron 2 and Waveglow.

The Docker container nvcr.io/ea-2-jarvis/jarvis-model-tool:ea2 contains the tools required to perform this conversion.

docker run –gpus device=<your GPU device> –rm \

-v<absolute path to Tacotron2 NeMo folder>:/tacotron \

-v<absolute path to Waveglow NeMo folder>:/waveglow \

-v<absolute path to output folder>:/models jarvis-model-tool \ ./convert/scripts/tts/export_nemo_models.sh –tacotron_dir /tacotron \

–waveglow_dir /waveglow /models

Where <your GPU device> is the device ID as output by nvidia-smi. You should choose the same device as the one that will be used for deployment, since TensorRT engines are optimized for a specific device.

The TensorRT engine files tacotron2.eng, waveglow.eng, and denoiser.eng are stored in the provided output path.

Generating The Triton Inference Server Model Repository¶

There are two ways to generate a Triton Inference Server model repository depending on the source of your model:

You can use one of our pre-trained models in NGC, or
You can use a fine-tune custom model with Neural Modules (NeMo).

Creating A Model Repository Using A Pre-Trained NGC Model¶

To deploy the TTS models and generate a model repository configured for the TTS service, first download the *quick start scripts from *NGC. The config.sh configuration file should be edited to configure the deployment. To disable services other than TTS, simply set:

service_enabled_asr=false

service_enabled_nlp=false

service_enabled_tts=true

service_enabled_vision=false

Users should also set the ID of the GPU to use, and the type of GPU being used (t4 for compute capability 7.5 GPUs and v100 for compute capability 7.0 GPUs). For use cases where low latency of intermediate transcripts is more important than maximum throughput, the following ASR model configuration files should be used:

models_tts=(
  "ea-2-jarvis::jarvis_tts_ljspeech:config_streaming_prebuilt_{GPU_TYPE}.yaml:ea2"
)

After the quickstart/config.sh file is properly configured, generate the Triton model repository by running:

quickstart/jarvis_init.sh

For more information, refer to Models Available For Deployment in the Jarvis AI Services Quick Start Guide.

Creating A Model Repository Using A Fine-Tuned Model From NeMo¶

To generate a model repository using the models generated from Nemo, move the tacotron2.eng, waveglow.eng, and denoiser.eng files into a subdirectory called nemo at the path where you want to generate the model repository.

Note that when using a fine-tuned NeMo model, the variable jarvis_model_loc in the quickstart/config.sh file must be a local folder, such as /tmp/jarvis. For example, if you are generating the model repository at /tmp/jarvis, run the following commands:

NEMO_MODEL_DIR=/tmp/jarvis/nemo/jarvis_tts_nemo_streaming/1/

mkdir -p $NEMO_MODEL_DIR

cp <path to tacotron2.eng> $NEMO_MODEL_DIR/tacotron2.eng

cp <path to waveglow.eng> $NEMO_MODEL_DIR/waveglow.eng

cp <path to denoiser.eng> $NEMO_MODEL_DIR/denoiser.eng

Then, modify the quickstart/config.sh by replacing the configuration file for your model as following:

models_tts=(

“ea-2-jarvis::jarvis_tts_nemo:config_streaming_nemo.yaml:ea2”

)n

Finally, generate the model repository by running:

quickstart/jarvis_init.sh

Deploying Your Model¶

Regardless of whether you generated a Triton Inference Server model repository using a model that was pre-trained from NGC or whether you used a fine-tuned model that was trained in NeMo, the deployment process is the same.

To deploy your model, you can choose from the following:

You can launch a Docker container manually to deploy.
You can use a Helm chart to launch on Kubernetes to deploy.

Deploying A Model Using A Docker Container¶

To deploy your model using a Docker container, perform the steps in Quick Deployment Using Quick Start Scripts in the Jarvis Services Quick Start Guide.

Deploying A Model Using A Helm Chart¶

The Helm chart provided for Jarvis is responsible for downloading model artifacts (if necessary), setting up a model repository, and launching the required services. The Using Helm To Deploy Jarvis AI Services on Kubernetes section in the Jarvis Services Quick Start Guide describes in detail the process for retrieving the Helm chart from NGC and how to install.

When deploying to Kubernetes via Helm, it is possible to disable components that are not required. If Jarvis services other than TTS are not required, modify the values.yaml file before installing the Helm chart.

If NLP and/or TTS is not required, set jarvis.speechServices.[nlp|tts] = false in values.yaml. Optionally, you may remove any subset of NLP-related and/or TTS-related models from modelRepoGenerator.ngcModelConfigs while keeping the services enabled. Models that are not needed (due to a service being disabled) will not be downloaded and installed.

If deploying fine-tuned models, configure modelTemplateVolume to map to a persistent storage device. This volume will be made available to the trtis-model-repo container in /templates.

When building your custom model deployments, use absolute paths including /templates to link to model artifacts stored in this persistent volume. Concretely, the yaml file used for the model generator should be stored in /templates/<name of your model>/config.yaml, along with any other model artifacts. These config paths are then specified in values.yaml in the localModelConfigs array.

Using The Jarvis TTS Service¶

There are two ways users can interact with the Jarvis TTS service: through the command-line client that is written in gRPC or through the Python API.

Interacting With The Jarvis TTS Service Using The gRPC API¶

Client applications interact with the Jarvis TTS Service using the gRPC protocol which supports multiple programming languages. For more information on the API, refer to the TTS API document.

We provide protobuf files so you can generate bindings for your language of choice. These files are located in the /work/src/jarvis_proto directory in the jarvis-api-client container. For more information, refer to the gRPC documentation for the respective programming language. We also provide pre-generated bindings for Python in the same folder.

Interacting With The Jarvis TTS Service Using The Python API¶

To interact with the Jarvis TTS Service using Python, use the provided Python bindings that come pre-installed in the jarvis-api-client container or can be installed from the Python wheel available from NGC in the Jarvis Quick Start scripts.

pip install jarvis_api-0.10.0_ea2-py3-none-any.whl

The following sample code shows how to interact with the Jarvis TTS Service using its gRPC interface.

import numpy as np
import soundfile
import grpc
import jarvis_api.jarvis_tts_pb2 as jtts
import jarvis_api.jarvis_tts_pb2_grpc as jtts_srv
import jarvis_api.audio_pb2 as ja

# Establish connection to Jarvis API server and TTS service

jarvis_api_uri = 'localhost:50051'
channel = grpc.insecure_channel(jarvis_api_uri)
jarvis_tts = jtts_srv.JarvisTTSStub(channel)

# Create TTS request
req = jtts.SynthesizeSpeechRequest()
req.text = "Is it recognize speech or wreck a nice beach?"
req.language_code = "en-US" # currently required to be "en-US"
req.encoding = ja.AudioEncoding.LINEAR_PCM # only PCM encoding supported
req.sample_rate_hz = 22050 \# ignored, audio returned will be 22.05KHz
req.voice_name = "ljspeech"

response = jarvis_tts.Synthesize(req)

# Extract samples from response object and save as WAV file
audio_samples = np.frombuffer(response.audio, dtype=np.float32)
soundfile.write('jarvis_tts_output.wav', audio_samples, 22050, 'PCM_16')

# Or, request using the streaming API
responses = jarvis_tts.SynthesizeOnline(req)
samples = []

for resp in responses:
    samples.append(np.frombuffer(resp.audio, dtype=np.float32)

soundfile.write('jarvis_tts_output_streaming.wav', np.concatenate(samples), 22050, 'PCM_16')

Integrating TTS With Jarvis¶

Within the jarvis-api-client container, refer to the Jupyter Notebook located in the /work/notebooks/Jarvis_AI_services_demo.ipynb folder for an example of how to integrate the TTS service with Jarvis. To run it, launch the jarvis-api-client container with bash jarvis_start_client.sh and run the following command inside the container:

jupyter notebook --ip=0.0.0.0 --allow-root --notebook-dir=/work/notebooks

Then, follow the link shown on the screen to access the notebook in your browser.

Troubleshooting And Support¶

Typically this section comes together after the user has been exposed to the product.