Speech Synthesis (TTS) ====================== Speech Synthesis or Text-to-Speech (TTS) involves turning text into human speech. The NeMo TTS collection currently supports a two stage pipeline. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram. Quick Start: .. code-block:: python import soundfile as sf from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder # Download and load the pretrained tacotron2 model spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2") # Download and load the pretrained waveglow model vocoder = Vocoder.from_pretrained("tts_waveglow_88m") # All spectrogram generators start by parsing raw strings to a tokenized version of the string parsed = spec_gen.parse("You can type your sentence here to get nemo to produce speech.") # They then take the tokenized string and produce a spectrogram spectrogram = spec_gen.generate_spectrogram(tokens=parsed) # Finally, a vocoder converts the spectrogram to audio audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram) # Save the audio to disk in a file called speech.wav sf.write("speech.wav", audio.to('cpu').numpy(), 22050) .. note:: For an interactive version of the quick start above, refer to the TTS inference notebook that can be found on the github readme. Available Models ################ NeMo supports a variety of models that can be used for TTS. .. list-table:: *TTS Models* :widths: 5 5 10 25 :header-rows: 1 * - Model - Base Class - Pretrained Checkpoint - Description * - Tacotron2 - :class:`SpectrogramGenerator` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_tacotron2 - LSTM encoder decoder based model that generates spectrograms * - GlowTTS - :class:`SpectrogramGenerator` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts - Glow-based spectrogram generator * - WaveGlow - :class:`Vocoder` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_waveglow_88m - Glow-based vocoder * - SqueezeWave - :class:`Vocoder` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_squeezewave - Glow-based vocoder based on WaveGlow but with less parameters * - UniGlow - :class:`Vocoder` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_uniglow - Glow-based vocoder based on WaveGlow but shares 1 set of parameters across all flow steps * - MelGAN - :class:`Vocoder` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_melgan - GAN-based vocoder * - HiFiGAN - :class:`Vocoder` - https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_hifigan - GAN-based vocoder Base Classes ############ The NeMo TTS has two base classes corresponding to the two stage pipeline: - :class:`SpectrogramGenerator` - :class:`Vocoder` The :class:`SpectrogramGenerator` class has two important functions: :py:meth:`parse` which accepts raw python strings and returns a torch.tensor that represents tokenized text ready to pass to :py:meth:`generate_spectrogram` which accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms The :class:`Vocoder` class has one important functions :py:meth:`convert_spectrogram_to_audio` which accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio. .. autoclass:: nemo.collections.tts.models.base.SpectrogramGenerator :show-inheritance: :members: .. autoclass:: nemo.collections.tts.models.base.Vocoder :show-inheritance: :members: Training ######## Training of TTS models can be done using the scripts inside the NeMo examples/tts folders. The majority of the TTS YAML configurations should work out of the box with the LJSpeech dataset. If you want to train on other data, it is recommended that you walk through the Tacotron 2 Training notebook. Please pay special attention to the sample rate and FFT parameters for new data.