Speech Synthesis (TTS)

Speech Synthesis or Text-to-Speech (TTS) involves turning text into human speech. The NeMo TTS collection currently supports a two stage pipeline. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram.

Quick Start:

import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")
# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_gen.parse("You can type your sentence here to get nemo to produce speech.")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_gen.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)

Note

For an interactive version of the quick start above, refer to the TTS inference notebook that can be found on the github readme.

Available Models

NeMo supports a variety of models that can be used for TTS.

TTS Models

Model

Base Class

Pretrained Checkpoint

Description

Tacotron2

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_tacotron2

LSTM encoder decoder based model that generates spectrograms

GlowTTS

SpectrogramGenerator

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts

Glow-based spectrogram generator

WaveGlow

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_waveglow_88m

Glow-based vocoder

SqueezeWave

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_squeezewave

Glow-based vocoder based on WaveGlow but with less parameters

UniGlow

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_uniglow

Glow-based vocoder based on WaveGlow but shares 1 set of parameters across all flow steps

MelGAN

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_melgan

GAN-based vocoder

HiFiGAN

Vocoder

https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_hifigan

GAN-based vocoder

Base Classes

The NeMo TTS has two base classes corresponding to the two stage pipeline:

The SpectrogramGenerator class has two important functions: parse which accepts raw python strings and returns a torch.tensor that represents tokenized text ready to pass to generate_spectrogram which accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms

The Vocoder class has one important functions convert_spectrogram_to_audio which accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)[source]

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs)torch.tensor[source]

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

sepctrograms

classmethod list_available_models()List[PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs)torch.tensor[source]

A helper function that accepts raw pythong strings and turns it into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represented either tokenized or embedded text, depending on the model.

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)[source]

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for all TTS models that generate audio conditioned a on spectrogram

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs)torch.tensor[source]

Accepts a batch of spectrograms and returns a batch of audio

Parameters

spec – A torch tensor representing the spectrograms to be vocoded

Returns

audio

classmethod list_available_models()List[PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Training

Training of TTS models can be done using the scripts inside the NeMo examples/tts folders. The majority of the TTS YAML configurations should work out of the box with the LJSpeech dataset. If you want to train on other data, it is recommended that you walk through the Tacotron 2 Training notebook. Please pay special attention to the sample rate and FFT parameters for new data.