NeMo TTS Collection API#

Model Classes#

Mel-Spectrogram Generators#

class nemo.collections.tts.models.FastPitchModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

FastPitch model (https://arxiv.org/abs/2006.06873) that is used to generate mel spectrogram from text.

property disabled_deployment_input_names#

Implement this method to return a set of input names disabled for export

forward(*, text, durs=None, pitch=None, energy=None, speaker=None, pace=1.0, spec=None, attn_prior=None, mel_lens=None, input_lens=None)[source]#
forward_for_export(text, pitch, pace, volume=None, batch_lengths=None, speaker=None)[source]#
generate_spectrogram(tokens: torch.tensor, speaker: Optional[int] = None, pace: float = 1.0) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

input_example(max_batch=1, max_dim=44)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

interpolate_speaker(original_speaker_1, original_speaker_2, weight_speaker_1, weight_speaker_2, new_speaker_id)[source]#

This method performs speaker interpolation between two original speakers the model is trained on.

Inputs:

original_speaker_1: Integer speaker ID of first existing speaker in the model original_speaker_2: Integer speaker ID of second existing speaker in the model weight_speaker_1: Floating point weight associated in to first speaker during weight combination weight_speaker_2: Floating point weight associated in to second speaker during weight combination new_speaker_id: Integer speaker ID of new interpolated speaker in the model

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(str_input: str, normalize=True) torch.tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#
property tb_logger#
class nemo.collections.tts.models.MixerTTSModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

Mixer-TTS and Mixer-TTS-X models (https://arxiv.org/abs/2110.03584) that is used to generate mel spectrogram from text.

forward(text, text_len, pitch=None, spect=None, spect_len=None, attn_prior=None, lm_tokens=None)[source]#
forward_for_export(text, lm_tokens=None)[source]#
generate_spectrogram(tokens: Optional[torch.Tensor] = None, tokens_len: Optional[torch.Tensor] = None, lm_tokens: Optional[torch.Tensor] = None, raw_texts: Optional[List[str]] = None, norm_text_for_lm_model: bool = True, lm_model: str = 'albert')[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

infer(text, text_len=None, text_mask=None, spect=None, spect_len=None, attn_prior=None, use_gt_durs=False, lm_tokens=None, pitch=None)[source]#
input_example(max_text_len=10, max_lm_tokens_len=10)[source]#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

run_aligner(text, text_len, text_mask, spect, spect_len, attn_prior)#
class nemo.collections.tts.models.RadTTSModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

batch_dict(batch_data)[source]#
configure_optimizers()[source]#
forward_for_export(text, lens, speaker_id, speaker_id_text, speaker_id_attributes)[source]#
generate_spectrogram(tokens: torch.tensor, speaker: int = 0, sigma: float = 1.0) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

property input_module#
property output_module#
parse(text: str, normalize=False) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#
property tb_logger#
class nemo.collections.tts.models.Tacotron2Model(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.SpectrogramGenerator

Tacotron 2 Model that is used to generate mel spectrograms from text

forward(*, tokens, token_len, audio=None, audio_len=None)[source]#
generate_spectrogram(*, tokens)[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#

Speech-to-Text Aligner Models#

class nemo.collections.tts.models.AlignerModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.modelPT.ModelPT

Speech-to-text alignment model (https://arxiv.org/pdf/2108.10447.pdf) that is used to learn alignments between mel spectrogram and text.

forward(*, spec, spec_len, text, text_len, attn_prior=None)[source]#
classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Two-Stage Models#

class nemo.collections.tts.models.TwoStagesModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.Vocoder

Two Stages model used to convert mel spectrograms, to linear spectrograms, and then to audio

convert_spectrogram_to_audio(spec: torch.Tensor, **kwargs) torch.Tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)[source]#
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113

Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

forward(*, mel)[source]#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

set_linear_vocoder(linvocoder: nemo.collections.tts.models.base.Vocoder)[source]#
set_mel_to_spec_model(mel2spec: nemo.collections.tts.models.base.MelToSpec)[source]#

Vocoders#

class nemo.collections.tts.models.GriffinLimModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.Vocoder

convert_spectrogram_to_audio(spec, Ts=None)[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)[source]#
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/mixins/device_dtype_mixin.py#L113

Here we are overriding this to maintain the default Pytorch nn.module behavior: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L728

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

class nemo.collections.tts.models.HifiGanModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

HiFi-GAN model (https://arxiv.org/abs/2010.05646) that is used to generate audio from mel spectrogram.

configure_optimizers()[source]#
convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

static get_warmup_steps(max_steps, warmup_steps, warmup_ratio)[source]#
input_example(max_batch=1, max_dim=256)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

load_state_dict(state_dict, strict=True)[source]#
property output_types#

Define these to enable output neural type checks

class nemo.collections.tts.models.UnivNetModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

UnivNet model (https://arxiv.org/abs/2106.07889) that is used to generate audio from mel spectrogram.

configure_optimizers()[source]#
convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

input_example(max_batch=1, max_dim=256)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

class nemo.collections.tts.models.WaveGlowModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.models.base.GlowVocoder, nemo.core.classes.exportable.Exportable

WaveGlow model (https://arxiv.org/abs/1811.00002) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.Tensor, sigma: float = 1.0, denoise: bool = True, denoiser_strength: float = 0.01) torch.Tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, audio, audio_len, run_inverse=True)[source]#
forward_for_export(spec, z=None)[source]#
property input_module#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

load_state_dict(state_dict, strict=True)[source]#
property mode#
property output_module#
property output_types#

Define these to enable output neural type checks

Base Classes#

The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.

class nemo.collections.tts.models.base.MelToSpec(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert mel spectrograms to linear (magnitude) spectrograms

abstract convert_mel_spectrogram_to_linear(mel: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of linear spectrograms

Parameters

mel – A torch tensor representing the mel spectrograms [‘B’, ‘mel_freqs’, ‘T’]

Returns

A torch tensor representing the linear spectrograms [‘B’, ‘n_freqs’, ‘T’]

Return type

spec

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) torch.tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert spectrograms to audios. Note that this class takes as input either linear or mel spectrograms.

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Dataset Processing Classes#

class nemo.collections.tts.torch.data.MixerTTSXDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.tts.torch.data.TTSDataset

add_lm_tokens(**kwargs)[source]#
class nemo.collections.tts.torch.data.TTSDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset

add_align_prior_matrix(**kwargs)[source]#
add_durations(**kwargs)[source]#
add_energy(**kwargs)[source]#
add_log_mel(**kwargs)[source]#
add_p_voiced(**kwargs)[source]#
add_pitch(**kwargs)[source]#
add_speaker_id(**kwargs)[source]#
add_voiced_mask(**kwargs)[source]#
static filter_files(data, ignore_file, min_duration, max_duration, total_duration)[source]#
general_collate_fn(batch)[source]#
get_log_mel(audio)[source]#
get_spec(audio)[source]#
join_data(data_dict)[source]#
class nemo.collections.tts.torch.data.VocoderDataset(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.dataset.Dataset