Speech Synthesis (TTS)¶

Speech Synthesis or Text-to-Speech (TTS) involves turning text into human speech. The NeMo TTS collection currently supports a two stage pipeline. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram.

Quick Start:

import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")
# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_gen.parse("You can type your sentence here to get nemo to produce speech.")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_gen.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)

Note

For an interactive version of the quick start above, refer to the TTS inference notebook that can be found on the github readme.

Available Models¶

NeMo supports a variety of models that can be used for TTS.

*TTS Models*¶
Model	Base Class	Pretrained Checkpoint	Description
Tacotron2	`SpectrogramGenerator`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_tacotron2	LSTM encoder decoder based model that generates spectrograms
GlowTTS	`SpectrogramGenerator`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts	Glow-based spectrogram generator
WaveGlow	`Vocoder`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_waveglow_88m	Glow-based vocoder
SqueezeWave	`Vocoder`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_squeezewave	Glow-based vocoder based on WaveGlow but with less parameters
UniGlow	`Vocoder`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_uniglow	Glow-based vocoder based on WaveGlow but shares 1 set of parameters across all flow steps
MelGAN	`Vocoder`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_melgan	GAN-based vocoder
HiFiGAN	`Vocoder`	https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_hifigan	GAN-based vocoder

Base Classes¶

The NeMo TTS has two base classes corresponding to the two stage pipeline:

SpectrogramGenerator

Vocoder

The SpectrogramGenerator class has two important functions: parse which accepts raw python strings and returns a torch.tensor that represents tokenized text ready to pass to generate_spectrogram which accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms

The Vocoder class has one important functions convert_spectrogram_to_audio which accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) → torch.tensor[source]¶

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters: tokens – A torch tensor representing the text to be generated
Returns: sepctrograms

classmethod list_available_models() → List[PretrainedModelInfo][source]¶: This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) → torch.tensor[source]¶: A helper function that accepts raw pythong strings and turns it into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represented either tokenized or embedded text, depending on the model.

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for all TTS models that generate audio conditioned a on spectrogram

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) → torch.tensor[source]¶

Accepts a batch of spectrograms and returns a batch of audio

Parameters: spec – A torch tensor representing the spectrograms to be vocoded
Returns: audio

classmethod list_available_models() → List[PretrainedModelInfo][source]¶: This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Training¶

Training of TTS models can be done using the scripts inside the NeMo examples/tts folders. The majority of the TTS YAML configurations should work out of the box with the LJSpeech dataset. If you want to train on other data, it is recommended that you walk through the Tacotron 2 Training notebook. Please pay special attention to the sample rate and FFT parameters for new data.