Speech Synthesis (TTS)

Speech Synthesis or Text-to-Speech (TTS) involves turning text into human speech. The NeMo TTS collection currently supports two pipelines for TTS:

1) The two stage pipeline. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram. 2) The “end-to-end” approach that uses one model to generate audio straight from text.

Quick Start:

import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")
# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
# Note vocoder return a batch of audio. In this example, we just take the first and only sample.
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)

Note

For an interactive version of the quick start above, refer to the TTS inference notebook that can be found on the github readme.

The full documentation tree is as follows:

Available Models

NeMo supports a variety of models that can be used for TTS. For beginners, we recommend starting with the Tacotron2 + WaveGlow combination. Then we suggest trying using (TalkNet or FastPitch) + HiFiGAN if you want to continue exploring the two stage pipeline, or the FastPitch_HifiGan_E2E model for the end-to-end pipeline.

TTS Models

Model

Base Class

Pretrained Checkpoint

Description

Tacotron2

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_tacotron2

LSTM encoder decoder based model that generates spectrograms

GlowTTS

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts

Glow-based spectrogram generator

FastSpeech 2

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2

Non-autoregressive transformer-based spectrogram generator that predicts duration, energy, and pitch

FastPitch

SpectrogramGenerator

(Coming soon)

Non-autoregressive transformer-based spectrogram generator that predicts duration and pitch

TalkNet

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet

Non-autoregressive convolution-based spectrogram generator

WaveGlow

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_waveglow_88m

Glow-based vocoder

SqueezeWave

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_squeezewave

Glow-based vocoder based on WaveGlow but with fewer parameters

UniGlow

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_uniglow

Glow-based vocoder based on WaveGlow but shares 1 set of parameters across all flow steps

MelGAN

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_melgan

GAN-based vocoder

HiFiGAN

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_hifigan

GAN-based vocoder

FastPitch_HifiGan_E2E

TextToWaveform

(Coming soon)

End-to-end model based on composing FastPitch and HiFiGAN

FastSpeech2_HifiGan_E2E

TextToWaveform

(Coming soon)

End-to-end model based on composing FastSpeech2 and HiFiGAN

Base Classes

Two Stage Pipeline

NeMo TTS has two base classes corresponding to the two stage pipeline:

The SpectrogramGenerator class has two important functions: parse which accepts raw python strings and returns a torch.tensor that represents tokenized text ready to pass to, and generate_spectrogram which accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms

The Vocoder class has one important functions convert_spectrogram_to_audio which accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio.

End-to-End Pipeline

Correspondingly, NeMo TTS has one base class for the end-to-end pipeline:

Similarly to the SpectrogramGenerator, TextToWaveform implements two functions: parse which accepts raw python strings and returns a torch.tensor that represents tokenized text ready to pass to, and generate_spectrogram which accepts a batch of tokenized text and returns a torch.tensor that represents a batch of audio.

TTS Training

Training of TTS models can be done using the scripts inside the NeMo examples/tts folders. The majority of the TTS YAML configurations should work out of the box with the LJSpeech dataset. If you want to train on other data, it is recommended that you walk through the Tacotron 2 Training notebook. Please pay special attention to the sample rate and FFT parameters for new data.

Some models, like FastSpeech 2, require supplementary data such as phoneme durations, pitches, and energies. For more information about how to preprocess datasets for such models, see the TTS Datasets page.