Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
NeMo TTS API
Model Classes
Mel-Spectrogram Generators
- class nemo.collections.tts.models.FastPitchModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.SpectrogramGenerator
,nemo.core.classes.exportable.Exportable
,nemo.collections.tts.parts.mixins.fastpitch_adapter_mixins.FastPitchAdapterModelMixin
FastPitch model (https://arxiv.org/abs/2006.06873) that is used to generate mel spectrogram from text.
- property disabled_deployment_input_names
Implement this method to return a set of input names disabled for export
- generate_spectrogram(tokens: torch.tensor, speaker: Optional[int] = None, pace: float = 1.0, reference_spec: Optional[torch.tensor] = None, reference_spec_lens: Optional[torch.tensor] = None) torch.tensor
Accepts a batch of text or text_tokens and returns a batch of spectrograms
- Parameters
tokens – A torch tensor representing the text to be generated
- Returns
spectrograms
- input_example(max_batch=1, max_dim=44)
Generates input examples for tracing etc. :returns: A tuple of input examples.
- property input_types
Define these to enable input neural type checks
- interpolate_speaker(original_speaker_1, original_speaker_2, weight_speaker_1, weight_speaker_2, new_speaker_id)
This method performs speaker interpolation between two original speakers the model is trained on.
- Inputs:
original_speaker_1: Integer speaker ID of first existing speaker in the model original_speaker_2: Integer speaker ID of second existing speaker in the model weight_speaker_1: Floating point weight associated in to first speaker during weight combination weight_speaker_2: Floating point weight associated in to second speaker during weight combination new_speaker_id: Integer speaker ID of new interpolated speaker in the model
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
- parse(str_input: str, normalize=True) torch.tensor
A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.
Note that some models have normalize parameter in this function which will apply normalizer if it is available.
- class nemo.collections.tts.models.MixerTTSModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.SpectrogramGenerator
,nemo.core.classes.exportable.Exportable
Mixer-TTS and Mixer-TTS-X models (https://arxiv.org/abs/2110.03584) that is used to generate mel spectrogram from text.
- generate_spectrogram(tokens: Optional[torch.Tensor] = None, tokens_len: Optional[torch.Tensor] = None, lm_tokens: Optional[torch.Tensor] = None, raw_texts: Optional[List[str]] = None, norm_text_for_lm_model: bool = True, lm_model: str = 'albert')
Accepts a batch of text or text_tokens and returns a batch of spectrograms
- Parameters
tokens – A torch tensor representing the text to be generated
- Returns
spectrograms
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
- parse(text: str, normalize=True) torch.Tensor
A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.
Note that some models have normalize parameter in this function which will apply normalizer if it is available.
- class nemo.collections.tts.models.RadTTSModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.SpectrogramGenerator
,nemo.core.classes.exportable.Exportable
- generate_spectrogram()
- parse()
- class nemo.collections.tts.models.Tacotron2Model(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.SpectrogramGenerator
Tacotron 2 Model that is used to generate mel spectrograms from text
- generate_spectrogram(*, tokens)
Accepts a batch of text or text_tokens and returns a batch of spectrograms
- Parameters
tokens – A torch tensor representing the text to be generated
- Returns
spectrograms
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
- parse(text: str, normalize=True) torch.Tensor
A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.
Note that some models have normalize parameter in this function which will apply normalizer if it is available.
- class nemo.collections.tts.models.SpectrogramEnhancerModel(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.modelPT.ModelPT
,nemo.core.classes.exportable.Exportable
GAN-based model to add details to blurry spectrograms from TTS models like Tacotron or FastPitch. Based on StyleGAN 2 [1] [1] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN (https://arxiv.org/abs/1912.04958)
- forward(*, input_spectrograms: torch.Tensor, lengths: torch.Tensor, mixing: bool = False, normalize: bool = True)
Generator forward pass. Noise inputs will be generated.
input_spectrograms: batch of spectrograms, typically synthetic lengths: length for every spectrogam in the batch mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use
returns: batch of enhanced spectrograms
For explanation of style mixing refer to [1] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948)
- forward_with_custom_noise(input_spectrograms: torch.Tensor, lengths: torch.Tensor, zs: Optional[List[torch.Tensor]] = None, ws: Optional[List[torch.Tensor]] = None, noise: Optional[torch.Tensor] = None, mixing: bool = False, normalize: bool = True)
Generator forward pass. Noise inputs will be generated if None.
input_spectrograms: batch of spectrograms, typically synthetic lenghts: length for every spectrogam in the batch zs: latent noise inputs on the unit sphere (either this or ws or neither) ws: latent noise inputs in the style space (either this or zs or neither) noise: per-pixel indepentent gaussian noise mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use
returns: batch of enhanced spectrograms
For explanation of style mixing refer to [1] For definititions of z, w [2] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948) [2] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN, 2019 (https://arxiv.org/abs/1912.04958)
- classmethod list_available_models()
Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.
- Returns
A list of PretrainedModelInfo entries
Speech-to-Text Aligner Models
- class nemo.collections.tts.models.AlignerModel(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.modelPT.ModelPT
Speech-to-text alignment model (https://arxiv.org/pdf/2108.10447.pdf) that is used to learn alignments between mel spectrogram and text.
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
Two-Stage Models
- class nemo.collections.tts.models.TwoStagesModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.Vocoder
Two Stages model used to convert mel spectrograms, to linear spectrograms, and then to audio
- convert_spectrogram_to_audio(spec: torch.Tensor, **kwargs) torch.Tensor
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- cuda(*args, **kwargs)
- PTL is overriding this method and changing the pytorch behavior of a module.
The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113
Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
Note
This method modifies the module in-place.
- Parameters
device (int, optional) – if specified, all parameters will be copied to that device
- Returns
self
- Return type
Module
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
Vocoders
- class nemo.collections.tts.models.GriffinLimModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.Vocoder
- convert_spectrogram_to_audio(spec, Ts=None)
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- cuda(*args, **kwargs)
- PTL is overriding this method and changing the pytorch behavior of a module.
The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113
Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
Note
This method modifies the module in-place.
- Parameters
device (int, optional) – if specified, all parameters will be copied to that device
- Returns
self
- Return type
Module
- class nemo.collections.tts.models.HifiGanModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.Vocoder
,nemo.core.classes.exportable.Exportable
HiFi-GAN model (https://arxiv.org/abs/2010.05646) that is used to generate audio from mel spectrogram.
- convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- forward(*, spec)
Runs the generator, for inputs and outputs see input_types, and output_types
- forward_for_export(spec)
Runs the generator, for inputs and outputs see input_types, and output_types
- input_example(max_batch=1, max_dim=256)
Generates input examples for tracing etc. :returns: A tuple of input examples.
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() Optional[Dict[str, str]]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
- class nemo.collections.tts.models.UnivNetModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.Vocoder
,nemo.core.classes.exportable.Exportable
UnivNet model (https://arxiv.org/abs/2106.07889) that is used to generate audio from mel spectrogram.
- convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- forward(*, spec)
Runs the generator, for inputs and outputs see input_types, and output_types
- forward_for_export(spec)
Runs the generator, for inputs and outputs see input_types, and output_types
- input_example(max_batch=1, max_dim=256)
Generates input examples for tracing etc. :returns: A tuple of input examples.
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() Optional[Dict[str, str]]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
- class nemo.collections.tts.models.WaveGlowModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.tts.models.base.GlowVocoder
,nemo.core.classes.exportable.Exportable
WaveGlow model (https://arxiv.org/abs/1811.00002) that is used to generate audio from mel spectrogram.
- convert_spectrogram_to_audio(spec: torch.Tensor, sigma: float = 1.0, denoise: bool = True, denoiser_strength: float = 0.01) torch.Tensor
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- property input_types
Define these to enable input neural type checks
- classmethod list_available_models() List[PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- property output_types
Define these to enable output neural type checks
Codecs
Base Classes
The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.
- class nemo.collections.tts.models.base.MelToSpec(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.modelPT.ModelPT
,abc.ABC
A base class for models that convert mel spectrograms to linear (magnitude) spectrograms
- abstract convert_mel_spectrogram_to_linear(mel: torch.tensor, **kwargs) torch.tensor
Accepts a batch of spectrograms and returns a batch of linear spectrograms
- Parameters
mel – A torch tensor representing the mel spectrograms [‘B’, ‘mel_freqs’, ‘T’]
- Returns
A torch tensor representing the linear spectrograms [‘B’, ‘n_freqs’, ‘T’]
- Return type
spec
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.modelPT.ModelPT
,abc.ABC
Base class for all TTS models that turn text into a spectrogram
- abstract generate_spectrogram(tokens: torch.tensor, **kwargs) torch.tensor
Accepts a batch of text or text_tokens and returns a batch of spectrograms
- Parameters
tokens – A torch tensor representing the text to be generated
- Returns
spectrograms
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- abstract parse(str_input: str, **kwargs) torch.tensor
A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.
Note that some models have normalize parameter in this function which will apply normalizer if it is available.
- class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.modelPT.ModelPT
,abc.ABC
A base class for models that convert spectrograms to audios. Note that this class takes as input either linear or mel spectrograms.
- abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) torch.tensor
Accepts a batch of spectrograms and returns a batch of audio.
- Parameters
spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.
- Returns
audio
- classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
Dataset Processing Classes
- class nemo.collections.tts.data.dataset.MixerTTSXDataset(*args: Any, **kwargs: Any)
- class nemo.collections.tts.data.dataset.TTSDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset
- class nemo.collections.tts.data.dataset.VocoderDataset(*args: Any, **kwargs: Any)
Bases:
nemo.core.classes.dataset.Dataset