How to train Riva TTS models (FastPitch and HiFiGAN) with TAO Toolkit#

This tutorial walks you through the steps to train Riva TTS models (FastPitch and HiFiGAN) from scratch with LJSpeech dataset using TAO Toolkit.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will customize the Riva TTS pipeline by training Riva TTS models with NVIDIA’s TAO Toolkit.

NVIDIA TAO Toolkit Overview#

NVIDIA Train Adapt Optimize (TAO) Toolkit is a python-based AI toolkit for transfer learning that takes purpose-built pre-trained AI models and customizes them on your own data. TAO enables developers with limited AI expertise to create highly accurate AI models for production deployments.
TAO follows zero coding paradigm. There is no need to write any code to train models with TAO. Training can be done by just running a few commands with the TAO command-line interface.

Riva supports fine-tuning with TAO. The fine-tuned TAO model can easily be deployed for real-time inference on the Riva Speech Skills server.

For more information about the NVIDIA TAO framework, refer to the documentation here.

Text to Speech#

Text to Speech (TTS) is often the last step in building a conversational AI model. A TTS model converts text into audible speech. The main objective is to synthesize reasonable and natural speech for given text. Since there are no universal standards to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained.

TTS consists of two models: FastPitch and HiFi-GAN.

  • FastPitch is spectrogram model generates a Mel spectrogram from text input

  • HiFiGAN is a vocoder model to generate an audio output from the Mel spectrograms generated using FastPitch


TTS using TAO#

In this tutorial, we will train RIVA TTS models (FastPitch and HiFiGAN) on LJSpeech from scratch.

Installing and setting up TAO#

Install TAO inside a Python virtual environment. We recommend performing this step first and then launching the tutorial from the virtual environment.

In addition to installing the TAO Python package, ensure you meet the following software requirements:

  1. python 3.8.13

  2. docker-ce > 19.03.5

  3. docker-API 1.40

  4. nvidia-container-toolkit > 1.3.0-1

  5. nvidia-container-runtime > 3.4.0-1

  6. nvidia-docker2 > 2.5.0-1

  7. nvidia-driver >= 470.57

Installing TAO is a simple pip install.

! pip install nvidia-tao

After installing TAO, the next step is to setup the mounts for TAO. The TAO launcher uses Docker containers under the hood, and for our data and results directory to be visible to Docker, they need to be mapped. The launcher can be configured using the config file ~/.tao_mounts.json. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO launcher.

Replace the FIXME variables with the required paths enclosed in "" as a string.

IMPORTANT NOTE: The following code creates a sample ~/.tao_mounts.json file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case so these directories are correctly visible to the Docker container.

# please define these paths on your local host machine
import os

os.environ["HOST_DATA_DIR"] = FIXME
os.environ["HOST_SPECS_DIR"] = FIXME
os.environ["HOST_RESULTS_DIR"] = FIXME
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)

You can check the Docker image versions and the tasks that it performs. You can also check by issuing tao --help or:

! tao info --verbose

Set Relevant Paths#

# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key and use the same key for all commands:
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: tao <task name> <subcommand>

Let’s see this in further detail.

Downloading Specs#

TAO’s conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the download_specs command.

The -o argument indicates the folder where the default specification files will be downloaded. The -r argument instructs the script on where to save the logs. Ensure the -o points to an empty folder.

# download spec files for FastPitch
! tao spectro_gen download_specs \
    -r $RESULTS_DIR/spectro_gen \
    -o $SPECS_DIR/spectro_gen
# download spec files for HiFiGAN
! tao vocoder download_specs \
    -r $RESULTS_DIR/vocoder \
    -o $SPECS_DIR/vocoder

Download Data#

In this tutorial we will use the popular LJSpeech dataset. Let’s download it.

! wget -O $HOST_DATA_DIR/ljspeech.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

After downloading, untar the dataset and move it to the correct directory.

! tar -xvf $HOST_DATA_DIR/ljspeech.tar.bz2
! rm -rf $HOST_DATA_DIR/ljspeech
! mv LJSpeech-1.1 $HOST_DATA_DIR/ljspeech

Using your own dataset#

If you want to use your own dataset, you’ll have to organize your own dataset following the LJSpeech format.

Pre-Processing#

This step downloads audio to text file lists from NVIDIA for LJSpeech and generates the manifest files. If you use your own dataset, you’ll have to generate three files: ljs_audio_text_train_filelist.txt, ljs_audio_text_val_filelist.txt, and ljs_audio_text_test_filelist.txt yourself. Those files correspond to your train / val / test split. For each text file, the number of rows should be equal to the number of samples in this split. Each row should look similar to:

DUMMY/<file_name>.wav|<text_of_the_audio>

An example row is:

DUMMY/LJ045-0096.wav|Mrs. De Mohrenschildt thought that Oswald,

After having those three files in your data_dir, run the following command as you would do for the LJSpeech dataset:

Be patient! This step can take several minutes.

! tao spectro_gen dataset_convert \
    -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml \
    -r $RESULTS_DIR/spectro_gen/dataset_convert \
    data_dir=$DATA_DIR/ljspeech \
    dataset_name=ljspeech

Training#

The TAO interface enables you to configure the training parameters from the command-line interface.

The process of opening the training script, finding the parameters of interest (which might be spread across multiple files), and making the changes needed, is being replaced by a simple command-line interface.

For example, if the number of epochs are needed to be modified along with a change in the learning rate, you can add trainer.max_epochs=10 and optim.lr=0.02 and train the model. Sample commands are given below.

For training TTS models in TAO, we use the tao spectro_gen train and tao vocoder train commands with the following arguments:

  • `-e`: Path to the spec file
  • `-g`: Number of GPUs to use
  • `-r`: Path to the results folder
  • `-k`: User specified encryption key to use while saving/loading the model
  • Any overrides to the spec file. For example, `trainer.max_epochs`.

NOTE: In order to get a TTS pipeline, you need to train BOTH FastPitch (spectro_gen) and HiFi-GAN (vocoder). For HiFi-GAN, since it’s universal for a specific language, you might just download the pretrained weights from NGC and it will give you good performance.

Training FastPitch#

# Prior is needed for FastPitch training. If an empty folder is provided, prior will generate on-the-fly.
! mkdir -p $RESULTS_DIR/spectro_gen/train/prior_folder

If you provided an empty prior folder, this may take some time.

!tao spectro_gen train \
     -e $SPECS_DIR/spectro_gen/train.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/spectro_gen/train \
     train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
     validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
     prior_folder=$RESULTS_DIR/spectro_gen/train/prior_folder \
     trainer.max_epochs=5

Training HiFi-GAN#

Instead of passing trainer.max_epochs, HiFi-GAN requires the definition of trainer.max_steps. Defining trainer.max_epochs for HiFi-GAN has no effect.

!tao vocoder train \
     -e $SPECS_DIR/vocoder/train.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/vocoder/train \
     train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
     validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
     trainer.max_steps=10000

TTS model export#

With TAO, you can also export your model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for export_format in the spec file.

Export to RIVA#

!tao spectro_gen export \
     -e $SPECS_DIR/spectro_gen/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/spectro_gen/export \
     export_format=RIVA \
     export_to=spectro_gen.riva
!tao vocoder export \
     -e $SPECS_DIR/vocoder/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/vocoder/export \
     export_format=RIVA \
     export_to=vocoder.riva

Export to ONNX (Export to ONNX is not needed for RIVA)#

!tao spectro_gen export \
     -e $SPECS_DIR/spectro_gen/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/spectro_gen/export \
     export_format=ONNX \
     export_to=spectro_gen.eonnx
!tao vocoder export \
     -e $SPECS_DIR/vocoder/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/vocoder/export \
     export_format=ONNX \
     export_to=vocoder.eonnx

TTS Inference with TAO Toolkit#

In this section, we are going to run inference on the trained TTS models. As previously mentioned, since there are no universal standards to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained. Therefore, we do not provide evaluate functionality in TAO Toolkit for TTS but only provide infer functionality.

The inference in the following cells is not optimized for real-time performance. For real-time inference and best latency, you should deploy this model using RIVA. Refer to the How to deploy custom TTS models (FastPitch and HiFiGAN) trained with TAO Toolkit on Riva tutorial.

TTS Inference with TLT checkpoint#

In this section, we will run inference on the .tlt checkpoint trained with TAO Toolkit.

Generate spectrogram#

The first step for inference is generating a spectrogram. That’s a NumPy array (saved as .npy file) for a sentence which can be converted to voice by a vocoder. We use the FastPitch model we just trained to generate a spectrogram.

You may have to work with the infer.yaml file to set the texts you want for inference.

!tao spectro_gen infer \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/spectro_gen/infer \
     output_path=$RESULTS_DIR/spectro_gen/infer/spectro

Generate sound file#

The second step for inference is generating a .wav sound file based on a spectrogram you generated in the previous step.

!tao vocoder infer \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/vocoder/infer \
     input_path=$RESULTS_DIR/spectro_gen/infer/spectro \
     output_path=$RESULTS_DIR/vocoder/infer/wav
import os
import IPython.display as ipd
# change path of the file here
ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer/wav/0.wav')

Debug#

If the above sound file does not have good quality, you probably need to first figure out whether it’s a FastPitch or HiFi-GAN problem. Then, retrain or fine-tune the problematic model. For this purpose, you can download pre-trained HiFi-GAN from NVIDIA NGC and (1) generate the spectrogram with your trained FastPitch (2) generate the .wav file with NVIDIA pretrained HiFi-GAN. If the .wav file generated in this manner is good, you know your HiFi-GAN is not well-trained. Otherwise, the problem is with FastPitch.

TTS Inference using ONNX#

TAO Toolkti also provides the capability to run inference with the exported .eonnx model. The commands are very similar to the inference command for .tlt models. Again, the inputs in the spec file used is just for demo purposes, you may choose to try out your custom input.

Generate spectrogram#

The first step for inference is generating a spectrogram. That’s a NumPy array (saved as a .npy file) for a sentence which can be converted to voice by a vocoder. We use the FastPitch model we just trained to generate a spectrogram.

You may have to work with the infer.yaml file to set the texts you want for inference.

!tao spectro_gen infer_onnx \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/export/spectro_gen.eonnx \
     -r $RESULTS_DIR/spectro_gen/infer_onnx \
     output_path=$RESULTS_DIR/spectro_gen/infer_onnx/spectro

Generate the Sound File#

The second step for inference is generating a .wav sound file based on the spectrogram you generated in the previous step.

!tao vocoder infer_onnx \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/export/vocoder.eonnx \
     -r $RESULTS_DIR/vocoder/infer_onnx \
     input_path=$RESULTS_DIR/spectro_gen/infer_onnx/spectro \
     output_path=$RESULTS_DIR/vocoder/infer_onnx/wav

If everything works properly, the .wav file below should sound exactly the same as the .wav file in the previous section.

import os
import IPython.display as ipd
# change path of the file here
ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_onnx/wav/0.wav')

What’s Next?#

You can use TAO to build custom models for your own applications, or you could deploy the custom model to NVIDIA Riva