NeMo TTS Collection API

class nemo.collections.tts.models.FastPitchModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable, nemo.collections.tts.parts.mixins.fastpitch_adapter_mixins.FastPitchAdapterModelMixin

FastPitch model (https://arxiv.org/abs/2006.06873) that is used to generate mel spectrogram from text.

property disabled_deployment_input_names

generate_spectrogram(tokens: torch.tensor, speaker: Optional[int] = None, pace: float = 1.0, reference_spec: Optional[torch.tensor] = None, reference_spec_lens: Optional[torch.tensor] = None) → torch.tensor

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters
Returns

input_example(max_batch=1, max_dim=44)

property input_types

interpolate_speaker(original_speaker_1, original_speaker_2, weight_speaker_1, weight_speaker_2, new_speaker_id)

This method performs speaker interpolation between two original speakers the model is trained on.

Inputs:

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

property output_types

parse(str_input: str, normalize=True) → torch.tensor

class nemo.collections.tts.models.MixerTTSModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

Mixer-TTS and Mixer-TTS-X models (https://arxiv.org/abs/2110.03584) that is used to generate mel spectrogram from text.

generate_spectrogram(tokens: Optional[torch.Tensor] = None, tokens_len: Optional[torch.Tensor] = None, lm_tokens: Optional[torch.Tensor] = None, raw_texts: Optional[List[str]] = None, norm_text_for_lm_model: bool = True, lm_model: str = 'albert')

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters
Returns

property input_types

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

property output_types

parse(text: str, normalize=True) → torch.Tensor

class nemo.collections.tts.models.RadTTSModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator, nemo.core.classes.exportable.Exportable

generate_spectrogram()

parse()

class nemo.collections.tts.models.Tacotron2Model(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.SpectrogramGenerator

Tacotron 2 Model that is used to generate mel spectrograms from text

generate_spectrogram(*, tokens)

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters
Returns

property input_types

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

property output_types

parse(text: str, normalize=True) → torch.Tensor

class nemo.collections.tts.models.SpectrogramEnhancerModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, nemo.core.classes.exportable.Exportable

GAN-based model to add details to blurry spectrograms from TTS models like Tacotron or FastPitch. Based on StyleGAN 2 [1] [1] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN (https://arxiv.org/abs/1912.04958)

forward(*, input_spectrograms: torch.Tensor, lengths: torch.Tensor, mixing: bool = False, normalize: bool = True)

forward_with_custom_noise(input_spectrograms: torch.Tensor, lengths: torch.Tensor, zs: Optional[List[torch.Tensor]] = None, ws: Optional[List[torch.Tensor]] = None, noise: Optional[torch.Tensor] = None, mixing: bool = False, normalize: bool = True)

classmethod list_available_models()

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

Speech-to-Text Aligner Models

class nemo.collections.tts.models.AlignerModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT

Speech-to-text alignment model (https://arxiv.org/pdf/2108.10447.pdf) that is used to learn alignments between mel spectrogram and text.

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

Two-Stage Models

class nemo.collections.tts.models.TwoStagesModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder

Two Stages model used to convert mel spectrograms, to linear spectrograms, and then to audio

convert_spectrogram_to_audio(spec: torch.Tensor, **kwargs) → torch.Tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

cuda(*args, **kwargs)

PTL is overriding this method and changing the pytorch behavior of a module.

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters
Returns
Return type

property input_types

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

property output_types

Vocoders

class nemo.collections.tts.models.GriffinLimModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder

convert_spectrogram_to_audio(spec, Ts=None)

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

cuda(*args, **kwargs)

PTL is overriding this method and changing the pytorch behavior of a module.

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters
Returns
Return type

class nemo.collections.tts.models.HifiGanModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

HiFi-GAN model (https://arxiv.org/abs/2010.05646) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.tensor) → torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

forward(*, spec)

forward_for_export(spec)

input_example(max_batch=1, max_dim=256)

property input_types

classmethod list_available_models() → Optional[Dict[str, str]]

property output_types

class nemo.collections.tts.models.UnivNetModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.Vocoder, nemo.core.classes.exportable.Exportable

UnivNet model (https://arxiv.org/abs/2106.07889) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.tensor) → torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

forward(*, spec)

forward_for_export(spec)

input_example(max_batch=1, max_dim=256)

property input_types

classmethod list_available_models() → Optional[Dict[str, str]]

property output_types

class nemo.collections.tts.models.WaveGlowModel(*args: Any, **kwargs: Any)

Bases: nemo.collections.tts.models.base.GlowVocoder, nemo.core.classes.exportable.Exportable

WaveGlow model (https://arxiv.org/abs/1811.00002) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.Tensor, sigma: float = 1.0, denoise: bool = True, denoiser_strength: float = 0.01) → torch.Tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

property input_types

classmethod list_available_models() → List[PretrainedModelInfo]

property output_types

Codecs

class nemo.collections.tts.models.AudioCodecModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT

decode()

decode_audio()

dequantize()

encode()

encode_audio()

forward()

list_available_models()

pad_audio()

quantize()

should_update_disc()

Base Classes

The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.

class nemo.collections.tts.models.base.MelToSpec(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert mel spectrograms to linear (magnitude) spectrograms

abstract convert_mel_spectrogram_to_linear(mel: torch.tensor, **kwargs) → torch.tensor

Accepts a batch of spectrograms and returns a batch of linear spectrograms

Parameters
Returns
Return type

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) → torch.tensor

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters
Returns

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

abstract parse(str_input: str, **kwargs) → torch.tensor

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

A base class for models that convert spectrograms to audios. Note that this class takes as input either linear or mel spectrograms.

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) → torch.tensor

Accepts a batch of spectrograms and returns a batch of audio.

Parameters
Returns

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo]

Dataset Processing Classes

class nemo.collections.tts.data.dataset.MixerTTSXDataset(*args: Any, **kwargs: Any)

class nemo.collections.tts.data.dataset.TTSDataset(*args: Any, **kwargs: Any)

class nemo.collections.tts.data.dataset.VocoderDataset(*args: Any, **kwargs: Any)

Previous NeMo TTS Configuration Files

Next Resources and Documentation

NeMo TTS Collection API

Model Classes

Mel-Spectrogram Generators

Speech-to-Text Aligner Models

Two-Stage Models

Vocoders

Codecs

Base Classes

Dataset Processing Classes