Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

NeMo TTS API

Model Classes

Mel-Spectrogram Generators

class nemo.collections.tts.models.FastPitchModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable, nemo.collections.tts.parts.mixins.fastpitch_adapter_mixins.FastPitchAdapterModelMixin

FastPitch model (https://arxiv.org/abs/2006.06873) that is used to generate mel spectrogram from text.

property disabled_deployment_input_names

Implement this method to return a set of input names disabled for export

generate_spectrogram(tokens: torch.tensor, speaker: Optional[int] = None, pace: float = 1.0, reference_spec: Optional[torch.tensor] = None, reference_spec_lens: Optional[torch.tensor] = None) torch.tensor

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

input_example(max_batch=1, max_dim=44)

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types

Define these to enable input neural type checks

interpolate_speaker(original_speaker_1, original_speaker_2, weight_speaker_1, weight_speaker_2, new_speaker_id)

This method performs speaker interpolation between two original speakers the model is trained on.

Inputs:

original_speaker_1: Integer speaker ID of first existing speaker in the model original_speaker_2: Integer speaker ID of second existing speaker in the model weight_speaker_1: Floating point weight associated in to first speaker during weight combination weight_speaker_2: Floating point weight associated in to second speaker during weight combination new_speaker_id: Integer speaker ID of new interpolated speaker in the model

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

parse(str_input: str, normalize=True) torch.tensor

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

class nemo.collections.tts.models.MixerTTSModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

Mixer-TTS and Mixer-TTS-X models (https://arxiv.org/abs/2110.03584) that is used to generate mel spectrogram from text.

generate_spectrogram(tokens: Optional[torch.Tensor] = None, tokens_len: Optional[torch.Tensor] = None, lm_tokens: Optional[torch.Tensor] = None, raw_texts: Optional[List[str]] = None, norm_text_for_lm_model: bool = True, lm_model: str = 'albert')

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

property input_types

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

class nemo.collections.tts.models.RadTTSModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

generate_spectrogram()
parse()
class nemo.collections.tts.models.Tacotron2Model(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator

Tacotron 2 Model that is used to generate mel spectrograms from text

generate_spectrogram(*, tokens)

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

property input_types

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

class nemo.collections.tts.models.SpectrogramEnhancerModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, nemo.core.classes.exportable.Exportable

GAN-based model to add details to blurry spectrograms from TTS models like Tacotron or FastPitch. Based on StyleGAN 2 [1] [1] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN (https://arxiv.org/abs/1912.04958)

forward(*, input_spectrograms: torch.Tensor, lengths: torch.Tensor, mixing: bool = False, normalize: bool = True)

Generator forward pass. Noise inputs will be generated.

input_spectrograms: batch of spectrograms, typically synthetic lengths: length for every spectrogam in the batch mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use

returns: batch of enhanced spectrograms

For explanation of style mixing refer to [1] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948)

forward_with_custom_noise(input_spectrograms: torch.Tensor, lengths: torch.Tensor, zs: Optional[List[torch.Tensor]] = None, ws: Optional[List[torch.Tensor]] = None, noise: Optional[torch.Tensor] = None, mixing: bool = False, normalize: bool = True)

Generator forward pass. Noise inputs will be generated if None.

input_spectrograms: batch of spectrograms, typically synthetic lenghts: length for every spectrogam in the batch zs: latent noise inputs on the unit sphere (either this or ws or neither) ws: latent noise inputs in the style space (either this or zs or neither) noise: per-pixel indepentent gaussian noise mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use

returns: batch of enhanced spectrograms

For explanation of style mixing refer to [1] For definititions of z, w [2] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948) [2] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN, 2019 (https://arxiv.org/abs/1912.04958)

classmethod list_available_models()

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

A list of PretrainedModelInfo entries

Speech-to-Text Aligner Models

class nemo.collections.tts.models.AlignerModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT

Speech-to-text alignment model (https://arxiv.org/pdf/2108.10447.pdf) that is used to learn alignments between mel spectrogram and text.

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Two-Stage Models

class nemo.collections.tts.models.TwoStagesModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder

Two Stages model used to convert mel spectrograms, to linear spectrograms, and then to audio

convert_spectrogram_to_audio(spec: torch.Tensor, **kwargs) torch.Tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113

Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

property input_types

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

Vocoders

class nemo.collections.tts.models.GriffinLimModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder

convert_spectrogram_to_audio(spec, Ts=None)

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113

Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

class nemo.collections.tts.models.HifiGanModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

HiFi-GAN model (https://arxiv.org/abs/2010.05646) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)

Runs the generator, for inputs and outputs see input_types, and output_types

input_example(max_batch=1, max_dim=256)

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

class nemo.collections.tts.models.UnivNetModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

UnivNet model (https://arxiv.org/abs/2106.07889) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)

Runs the generator, for inputs and outputs see input_types, and output_types

input_example(max_batch=1, max_dim=256)

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

class nemo.collections.tts.models.WaveGlowModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.GlowVocoder, nemo.core.classes.exportable.Exportable

WaveGlow model (https://arxiv.org/abs/1811.00002) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.Tensor, sigma: float = 1.0, denoise: bool = True, denoiser_strength: float = 0.01) torch.Tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

property input_types

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types

Define these to enable output neural type checks

Codecs

class nemo.collections.tts.models.AudioCodecModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT

decode()
decode_audio()
dequantize()
encode()
encode_audio()
forward()
list_available_models()
pad_audio()
quantize()
should_update_disc()

Base Classes

The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.

class nemo.collections.tts.models.base.MelToSpec(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert mel spectrograms to linear (magnitude) spectrograms

abstract convert_mel_spectrogram_to_linear(mel: torch.tensor, **kwargs) torch.tensor

Accepts a batch of spectrograms and returns a batch of linear spectrograms

Parameters

mel – A torch tensor representing the mel spectrograms [‘B’, ‘mel_freqs’, ‘T’]

Returns

A torch tensor representing the linear spectrograms [‘B’, ‘n_freqs’, ‘T’]

Return type

spec

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) torch.tensor

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) torch.tensor

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert spectrograms to audios. Note that this class takes as input either linear or mel spectrograms.

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Dataset Processing Classes

class nemo.collections.tts.data.dataset.MixerTTSXDataset(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.data.dataset.TTSDataset

class nemo.collections.tts.data.dataset.TTSDataset(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.dataset.Dataset

class nemo.collections.tts.data.dataset.VocoderDataset(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.dataset.Dataset