Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Checkpoints
There are two main ways to load pretrained checkpoints in NeMo as described in Checkpoints.
Using the
restore_from()
method to load a local checkpoint file (.nemo
), orUsing the
from_pretrained()
method to download and set up a checkpoint from NGC.
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinished
training experiment, use the Experiment Manager to do so by setting the resume_if_exists
flag to True
.
Local Checkpoints
Save Model Checkpoints: NeMo automatically saves final model checkpoints with
.nemo
suffix. You could also manually save any model checkpoint usingmodel.save_to(<checkpoint_path>.nemo)
.Load Model Checkpoints: if you’d like to load a checkpoint saved at
<path/to/checkpoint/file.nemo>
, use therestore_from()
method below, where<MODEL_BASE_CLASS>
is the TTS model class of the original checkpoint.
import nemo.collections.tts as nemo_tts
model = nemo_tts.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
NGC Pretrained Checkpoints
The NGC NeMo Text to Speech collection aggregates model cards that contain detailed information about checkpoints of various models trained on various datasets. The tables below in Checkpoints list part of available TTS models from NGC including speech/text aligners, acoustic models, and vocoders.
Load Model Checkpoints
The models can be accessed via the from_pretrained()
method inside the TTS Model class. In general, you can load any of these models with code in the following format,
import nemo.collections.tts as nemo_tts
model = nemo_tts.models.<MODEL_BASE_CLASS>.from_pretrained(model_name="<MODEL_NAME>")
where <MODEL_NAME>
is the value in Model Name
column in the tables in Checkpoints. These names are predefined in the each model’s member function self.list_available_models()
. For example, the available NGC FastPitch model names can be found,
In [1]: import nemo.collections.tts as nemo_tts
In [2]: nemo_tts.models.FastPitchModel.list_available_models()
Out[2]:
[PretrainedModelInfo(
pretrained_model_name=tts_en_fastpitch,
description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is ARPABET-based.,
location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo,
class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
),
PretrainedModelInfo(
pretrained_model_name=tts_en_fastpitch_ipa,
description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is IPA-based.,
location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo,
class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
),
PretrainedModelInfo(
pretrained_model_name=tts_en_fastpitch_multispeaker,
description=This model is trained on HiFITTS sampled at 44100Hz with and can be used to generate male and female English voices with an American accent.,
location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo,
class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
),
PretrainedModelInfo(
pretrained_model_name=tts_de_fastpitch_singlespeaker,
description=This model is trained on a single male speaker data in OpenSLR Neutral German Dataset sampled at 22050Hz and can be used to generate male German voices.,
location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.10.0/files/tts_de_fastpitch_align.nemo,
class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
),
PretrainedModelInfo(
pretrained_model_name=tts_de_fastpitch_multispeaker_5,
description=This model is trained on 5 speakers in HUI-Audio-Corpus-German clean subset sampled at 44100Hz with and can be used to generate male and female German voices.,
location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo,
class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
)]
From the above key-value pair pretrained_model_name=tts_en_fastpitch
, you could get the model name tts_en_fastpitch
and load it by running,
model = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch")
If you would like to programmatically list the models available for a particular base class, you can use the
list_available_models()
method,
nemo_tts.models.<MODEL_BASE_CLASS>.list_available_models()
Inference and Audio Generation
NeMo TTS supports both cascaded and end-to-end models to synthesize audios. Most of steps in between are the same except that cascaded models need to load an extra vocoder model before generating audios. Below code snippet demonstrates steps of generating a audio sample from a text input using a cascaded FastPitch and HiFiGAN models. Please refer to NeMo TTS Collection API for detailed implementation of model classes.
import nemo.collections.tts as nemo_tts
# Load mel spectrogram generator
spec_generator = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
# Load vocoder
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan")
# Generate audio
import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
Fine-Tuning on Different Datasets
There are multiple TTS tutorials provided in the directory of tutorials/tts/. Most of these tutorials demonstrate how to instantiate a pre-trained model, and prepare the model for fine-tuning on datasets with the same language or different languages, the same speaker or different speakers.
cross-lingual fine-tuning: https://github.com/NVIDIA/NeMo/tree/stable/tutorials/tts/FastPitch_GermanTTS_Training.ipynb
cross-speaker fine-tuning: https://github.com/NVIDIA/NeMo/tree/stable/tutorials/tts/FastPitch_Finetuning.ipynb
NGC TTS Models
This section summarizes a full list of available NeMo TTS models that have been released in NGC NeMo Text to Speech Collection. You can download model checkpoints of your interest via either way below,
wget '<CHECKPOINT_URL_IN_THE_TABLE>'
curl -LO '<CHECKPOINT_URL_IN_THE_TABLE>'
Speech/Text Aligners
Locale |
Model Name |
Dataset |
Sampling Rate |
#Spk |
Phoneme Unit |
Model Class |
Overview |
Checkpoint |
---|---|---|---|---|---|---|---|---|
en-US |
tts_en_radtts_aligner |
LJSpeech |
22050Hz |
1 |
ARPABET |
nemo.collections.tts.models.aligner.AlignerModel |
|
|
en-US |
tts_en_radtts_aligner_ipa |
LJSpeech |
22050Hz |
1 |
IPA |
nemo.collections.tts.models.aligner.AlignerModel |
|
Mel-Spectrogram Generators
Locale |
Model Name |
Dataset |
Sampling Rate |
#Spk |
Symbols |
Model Class |
Overview |
Checkpoint |
---|---|---|---|---|---|---|---|---|
en-US |
tts_en_fastpitch |
LJSpeech |
22050Hz |
1 |
ARPABET |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
en-US |
tts_en_fastpitch_ipa |
LJSpeech |
22050Hz |
1 |
IPA |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
en-US |
tts_en_fastpitch_multispeaker |
HiFiTTS |
44100Hz |
10 |
ARPABET |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
en-US |
tts_en_lj_mixertts |
LJSpeech |
22050Hz |
1 |
ARPABET |
nemo.collections.tts.models.mixer_tts.MixerTTSModel |
|
|
en-US |
tts_en_lj_mixerttsx |
LJSpeech |
22050Hz |
1 |
ARPABET |
nemo.collections.tts.models.mixer_tts.MixerTTSModel |
|
|
en-US |
RAD-TTS |
TBD |
TBD |
TBD |
ARPABET |
nemo.collections.tts.models.radtts.RadTTSModel |
TBD |
|
en-US |
tts_en_tacotron2 |
LJSpeech |
22050Hz |
1 |
ARPABET |
nemo.collections.tts.models.tacotron2.Tacotron2Model |
|
|
de-DE |
tts_de_fastpitch_multispeaker_5 |
HUI Audio Corpus German |
44100Hz |
5 |
ARPABET |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
de-DE |
tts_de_fastpitch_singleSpeaker_thorstenNeutral_2102 |
Thorsten Müller Neutral 21.02 dataset |
22050Hz |
1 |
Graphemes |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
de-DE |
tts_de_fastpitch_singleSpeaker_thorstenNeutral_2210 |
Thorsten Müller Neutral 22.10 dataset |
22050Hz |
1 |
Graphemes |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
es |
tts_es_fastpitch_multispeaker |
OpenSLR crowdsourced Latin American Spanish |
44100Hz |
174 |
IPA |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
|
zh-CN |
tts_zh_fastpitch_sfspeech |
SFSpeech Chinese/English Bilingual Speech |
22050Hz |
1 |
pinyin |
nemo.collections.tts.models.fastpitch.FastPitchModel |
|
Vocoders
Locale |
Model Name |
Spectrogram Generator |
Dataset |
Sampling Rate |
#Spk |
Model Class |
Overview |
Checkpoint |
---|---|---|---|---|---|---|---|---|
en-US |
tts_en_hifigan |
librosa.filters.mel |
LJSpeech |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
en-US |
tts_en_lj_hifigan_ft_mixertts |
Mixer-TTS |
LJSpeech |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
en-US |
tts_en_lj_hifigan_ft_mixerttsx |
Mixer-TTS-X |
LJSpeech |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
en-US |
tts_en_hifitts_hifigan_ft_fastpitch |
FastPitch |
HiFiTTS |
44100Hz |
10 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
en-US |
tts_en_lj_univnet |
librosa.filters.mel |
LJSpeech |
22050Hz |
1 |
nemo.collections.tts.models.univnet.UnivNetModel |
|
|
en-US |
tts_en_libritts_univnet |
librosa.filters.mel |
LibriTTS |
24000Hz |
1 |
nemo.collections.tts.models.univnet.UnivNetModel |
|
|
en-US |
tts_en_waveglow_88m |
librosa.filters.mel |
LJSpeech |
22050Hz |
1 |
nemo.collections.tts.models.waveglow.WaveGlowModel |
|
|
de-DE |
tts_de_hui_hifigan_ft_fastpitch_multispeaker_5 |
FastPitch |
HUI Audio Corpus German |
44100Hz |
5 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
de-DE |
tts_de_hifigan_singleSpeaker_thorstenNeutral_2102 |
FastPitch |
Thorsten Müller Neutral 21.02 dataset |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
de-DE |
tts_de_hifigan_singleSpeaker_thorstenNeutral_2210 |
FastPitch |
Thorsten Müller Neutral 22.10 dataset |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
es |
tts_es_hifigan_ft_fastpitch_multispeaker |
FastPitch |
OpenSLR crowdsourced Latin American Spanish |
44100Hz |
174 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
|
zh-CN |
tts_zh_hifigan_sfspeech |
FastPitch |
SFSpeech Chinese/English Bilingual Speech |
22050Hz |
1 |
nemo.collections.tts.models.hifigan.HifiGanModel |
|
End2End models
Locale |
Model Name |
Dataset |
Sampling Rate |
#Spk |
Phoneme Unit |
Model Class |
Overview |
Checkpoint |
---|---|---|---|---|---|---|---|---|
en-US |
tts_en_lj_vits |
LJSpeech |
22050Hz |
1 |
IPA |
nemo.collections.tts.models.vits.VitsModel |
|
|
en-US |
tts_en_hifitts_vits |
HiFiTTS |
44100Hz |
10 |
IPA |
nemo.collections.tts.models.vits.VitsModel |
|
Codec models
Model Name |
Dataset |
Sampling Rate |
Model Class |
Overview |
Checkpoint |
---|---|---|---|---|---|
audio_codec_16khz_small |
Libri-Light |
16000Hz |
nemo.collections.tts.models.AudioCodecModel |
|
|
mel_codec_22khz_medium |
LibriVox and Common Voice |
22050Hz |
nemo.collections.tts.models.AudioCodecModel |
|
|
mel_codec_44khz_medium |
LibriVox and Common Voice |
44100Hz |
nemo.collections.tts.models.AudioCodecModel |
|
|
mel_codec_22khz_fullband_medium |
LibriVox and Common Voice |
22050Hz |
nemo.collections.tts.models.AudioCodecModel |
|
|
mel_codec_44khz_fullband_medium |
LibriVox and Common Voice |
44100Hz |
nemo.collections.tts.models.AudioCodecModel |
|