NeMo TTS Collection API#

Model Classes#

Mel-Spectrogram Generators#

class nemo.collections.tts.models.FastPitchModel(*args: Any, **kwargs: Any)[source]#

Bases: SpectrogramGenerator, Exportable, FastPitchAdapterModelMixin

FastPitch model (https://arxiv.org/abs/2006.06873) that is used to generate mel spectrogram from text.

configure_callbacks()[source]#
property disabled_deployment_input_names#

Implement this method to return a set of input names disabled for export

forward(*, text, durs=None, pitch=None, energy=None, speaker=None, pace=1.0, spec=None, attn_prior=None, mel_lens=None, input_lens=None, reference_spec=None, reference_spec_lens=None)[source]#
forward_for_export(text, pitch, pace, volume=None, batch_lengths=None, speaker=None)[source]#
generate_spectrogram(tokens: torch.tensor, speaker: Optional[int] = None, pace: float = 1.0, reference_spec: Optional[torch.tensor] = None, reference_spec_lens: Optional[torch.tensor] = None) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

input_example(max_batch=1, max_dim=44)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

interpolate_speaker(original_speaker_1, original_speaker_2, weight_speaker_1, weight_speaker_2, new_speaker_id)[source]#

This method performs speaker interpolation between two original speakers the model is trained on.

Inputs:

original_speaker_1: Integer speaker ID of first existing speaker in the model original_speaker_2: Integer speaker ID of second existing speaker in the model weight_speaker_1: Floating point weight associated in to first speaker during weight combination weight_speaker_2: Floating point weight associated in to second speaker during weight combination new_speaker_id: Integer speaker ID of new interpolated speaker in the model

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(str_input: str, normalize=True) torch.tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#
property tb_logger#
class nemo.collections.tts.models.MixerTTSModel(*args: Any, **kwargs: Any)[source]#

Bases: SpectrogramGenerator, Exportable

Mixer-TTS and Mixer-TTS-X models (https://arxiv.org/abs/2110.03584) that is used to generate mel spectrogram from text.

forward(text, text_len, pitch=None, spect=None, spect_len=None, attn_prior=None, lm_tokens=None)[source]#
forward_for_export(text, lm_tokens=None)[source]#
generate_spectrogram(tokens: Optional[torch.Tensor] = None, tokens_len: Optional[torch.Tensor] = None, lm_tokens: Optional[torch.Tensor] = None, raw_texts: Optional[List[str]] = None, norm_text_for_lm_model: bool = True, lm_model: str = 'albert')[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

infer(text, text_len=None, text_mask=None, spect=None, spect_len=None, attn_prior=None, use_gt_durs=False, lm_tokens=None, pitch=None)[source]#
input_example(max_text_len=10, max_lm_tokens_len=10)[source]#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

run_aligner(text, text_len, text_mask, spect, spect_len, attn_prior)#
class nemo.collections.tts.models.RadTTSModel(cfg: omegaconf.DictConfig, trainer: pytorch_lightning.Trainer = None)[source]#

Bases: SpectrogramGenerator, Exportable

batch_dict(batch_data)[source]#
configure_optimizers()[source]#
forward_for_export(text, batch_lengths, speaker_id, speaker_id_text, speaker_id_attributes, pitch, pace, volume)[source]#
generate_spectrogram(tokens: torch.tensor, speaker: int = 0, sigma: float = 1.0) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

input_example(max_batch=1, max_dim=400)[source]#
property input_types#

Define these to enable input neural type checks

load_state_dict(state_dict, strict=True)[source]#
property output_types#

Define these to enable output neural type checks

parse(text: str, normalize=False) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#
property tb_logger#
class nemo.collections.tts.models.Tacotron2Model(*args: Any, **kwargs: Any)[source]#

Bases: SpectrogramGenerator

Tacotron 2 Model that is used to generate mel spectrograms from text

forward(*, tokens, token_len, audio=None, audio_len=None)[source]#
generate_spectrogram(*, tokens)[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

parse(text: str, normalize=True) torch.Tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

property parser#
class nemo.collections.tts.models.SpectrogramEnhancerModel(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT, Exportable

GAN-based model to add details to blurry spectrograms from TTS models like Tacotron or FastPitch. Based on StyleGAN 2 [1] [1] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN (https://arxiv.org/abs/1912.04958)

configure_optimizers()[source]#
forward(*, input_spectrograms: torch.Tensor, lengths: torch.Tensor, mixing: bool = False, normalize: bool = True)[source]#

Generator forward pass. Noise inputs will be generated.

input_spectrograms: batch of spectrograms, typically synthetic lengths: length for every spectrogam in the batch mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use

returns: batch of enhanced spectrograms

For explanation of style mixing refer to [1] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948)

forward_with_custom_noise(input_spectrograms: torch.Tensor, lengths: torch.Tensor, zs: Optional[List[torch.Tensor]] = None, ws: Optional[List[torch.Tensor]] = None, noise: Optional[torch.Tensor] = None, mixing: bool = False, normalize: bool = True)[source]#

Generator forward pass. Noise inputs will be generated if None.

input_spectrograms: batch of spectrograms, typically synthetic lenghts: length for every spectrogam in the batch zs: latent noise inputs on the unit sphere (either this or ws or neither) ws: latent noise inputs in the style space (either this or zs or neither) noise: per-pixel indepentent gaussian noise mixing: style mixing, usually True during training normalize: normalize spectrogram range to ~[0, 1], True for normal use

returns: batch of enhanced spectrograms

For explanation of style mixing refer to [1] For definititions of z, w [2] [1] Karras et. al. - A Style-Based Generator Architecture for Generative Adversarial Networks, 2018 (https://arxiv.org/abs/1812.04948) [2] Karras et. al. - Analyzing and Improving the Image Quality of StyleGAN, 2019 (https://arxiv.org/abs/1912.04958)

generate_noise(batch_size: int = 1) torch.Tensor[source]#
generate_zs(batch_size: int = 1, mixing: bool = False)[source]#
classmethod list_available_models()[source]#

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

A list of PretrainedModelInfo entries

log_illustration(target_spectrograms, input_spectrograms, enhanced_spectrograms, lengths)[source]#
move_to_correct_device(e)[source]#
normalize_spectrograms(spectrogram: torch.Tensor, lengths: torch.Tensor) torch.Tensor[source]#
pad_spectrograms(spectrograms)[source]#
unnormalize_spectrograms(spectrogram: torch.Tensor, lengths: torch.Tensor) torch.Tensor[source]#

Speech-to-Text Aligner Models#

class nemo.collections.tts.models.AlignerModel(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT

Speech-to-text alignment model (https://arxiv.org/pdf/2108.10447.pdf) that is used to learn alignments between mel spectrogram and text.

forward(*, spec, spec_len, text, text_len, attn_prior=None)[source]#
classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Two-Stage Models#

class nemo.collections.tts.models.TwoStagesModel(*args: Any, **kwargs: Any)[source]#

Bases: Vocoder

Two Stages model used to convert mel spectrograms, to linear spectrograms, and then to audio

convert_spectrogram_to_audio(spec: torch.Tensor, **kwargs) torch.Tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)[source]#
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: Lightning-AI/lightning

Here we are overriding this to maintain the default Pytorch nn.module behavior: pytorch/pytorch

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

forward(*, mel)[source]#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

set_linear_vocoder(linvocoder: Vocoder)[source]#
set_mel_to_spec_model(mel2spec: MelToSpec)[source]#

Vocoders#

class nemo.collections.tts.models.GriffinLimModel(*args: Any, **kwargs: Any)[source]#

Bases: Vocoder

convert_spectrogram_to_audio(spec, Ts=None)[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

cuda(*args, **kwargs)[source]#
PTL is overriding this method and changing the pytorch behavior of a module.

The PTL LightingModule override will move the module to device 0 if device is None. See the PTL method here: Lightning-AI/lightning

Here we are overriding this to maintain the default Pytorch nn.module behavior: pytorch/pytorch

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters

device (int, optional) – if specified, all parameters will be copied to that device

Returns

self

Return type

Module

class nemo.collections.tts.models.HifiGanModel(*args: Any, **kwargs: Any)[source]#

Bases: Vocoder, Exportable

HiFi-GAN model (https://arxiv.org/abs/2010.05646) that is used to generate audio from mel spectrogram.

configure_callbacks()[source]#
configure_optimizers()[source]#
convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

static get_warmup_steps(max_steps, warmup_steps, warmup_ratio)[source]#
input_example(max_batch=1, max_dim=256)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

load_state_dict(state_dict, strict=True)[source]#
property max_steps#
on_train_epoch_end() None[source]#
property output_types#

Define these to enable output neural type checks

update_lr(interval='step')[source]#
class nemo.collections.tts.models.UnivNetModel(*args: Any, **kwargs: Any)[source]#

Bases: Vocoder, Exportable

UnivNet model (https://arxiv.org/abs/2106.07889) that is used to generate audio from mel spectrogram.

configure_optimizers()[source]#
convert_spectrogram_to_audio(spec: torch.tensor) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

forward_for_export(spec)[source]#

Runs the generator, for inputs and outputs see input_types, and output_types

static get_warmup_steps(max_steps, warmup_steps, warmup_ratio)[source]#
input_example(max_batch=1, max_dim=256)[source]#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

classmethod list_available_models() Optional[Dict[str, str]][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

property output_types#

Define these to enable output neural type checks

class nemo.collections.tts.models.WaveGlowModel(*args: Any, **kwargs: Any)[source]#

Bases: GlowVocoder, Exportable

WaveGlow model (https://arxiv.org/abs/1811.00002) that is used to generate audio from mel spectrogram.

convert_spectrogram_to_audio(spec: torch.Tensor, sigma: float = 1.0, denoise: bool = True, denoiser_strength: float = 0.01) torch.Tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

forward(*, audio, audio_len, run_inverse=True)[source]#
forward_for_export(spec, z=None)[source]#
property input_module#
property input_types#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

load_state_dict(state_dict, strict=True)[source]#
property mode#
property output_module#
property output_types#

Define these to enable output neural type checks

Codecs#

class nemo.collections.tts.models.AudioCodecModel(cfg: omegaconf.DictConfig, trainer: pytorch_lightning.Trainer = None)[source]#

Bases: ModelPT

configure_callbacks()[source]#
configure_optimizers()[source]#
decode(tokens: torch.Tensor, tokens_len: torch.Tensor) Tuple[torch.Tensor, torch.Tensor][source]#

Convert discrete tokens into a continuous time-domain signal.

Parameters
  • tokens – discrete tokens for each codebook for each time frame, shape (batch, number of codebooks, number of frames)

  • tokens_len – valid lengths, shape (batch,)

Returns

Decoded output audio in the time domain and its length in number of samples audio_len. Note that audio_len will be a multiple of self.samples_per_frame.

decode_audio(inputs: torch.Tensor, input_len: torch.Tensor) Tuple[torch.Tensor, torch.Tensor][source]#

Apply decoder on the input. Note that the input is a non-quantized encoder output or a dequantized representation.

Parameters
  • inputs – encoded signal

  • input_len – valid length for each example in the batch

Returns

Decoded output audio in the time domain and its length in number of samples audio_len. Note that audio_len will be a multiple of self.samples_per_frame.

dequantize(tokens: torch.Tensor, tokens_len: torch.Tensor) torch.Tensor[source]#

Convert the discrete tokens into a continuous encoded representation.

Parameters
  • tokens – discrete tokens for each codebook for each time frame

  • tokens_len – valid length of each example in the batch

Returns

Continuous encoded representation of the discrete input representation.

property disc_update_prob: float#

Probability of updating the discriminator.

encode(audio: torch.Tensor, audio_len: torch.Tensor) Tuple[torch.Tensor, torch.Tensor][source]#

Convert input time-domain audio signal into a discrete representation (tokens).

Parameters
  • audio – input time-domain signal, shape (batch, number of samples)

  • audio_len – valid length for each example in the batch, shape (batch size,)

Returns

Tokens for each codebook for each frame, shape (batch, number of codebooks, number of frames), and the corresponding valid lengths, shape (batch,)

encode_audio(audio: torch.Tensor, audio_len: torch.Tensor) Tuple[torch.Tensor, torch.Tensor][source]#

Apply encoder on the input audio signal. Input will be padded with zeros so the last frame has full self.samples_per_frame samples.

Parameters
  • audio – input time-domain signal

  • audio_len – valid length for each example in the batch

Returns

Encoder output encoded and its length in number of frames encoded_len

forward(audio: torch.Tensor, audio_len: torch.Tensor) Tuple[torch.Tensor, torch.Tensor][source]#

Apply encoder, quantizer, decoder on the input time-domain signal.

Parameters
  • audio – input time-domain signal

  • audio_len – valid length for each example in the batch

Returns

Reconstructed time-domain signal output_audio and its length in number of samples output_audio_len.

get_dataset(cfg)[source]#
classmethod list_available_models() List[PretrainedModelInfo][source]#

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns

A list of PretrainedModelInfo entries

property max_steps#
on_train_epoch_end()[source]#
pad_audio(audio, audio_len)[source]#

Zero pad the end of the audio so that we do not have a partial end frame. The output will be zero-padded to have an integer number of frames of length self.samples_per_frame.

Parameters
  • audio – input time-domain signal

  • audio_len – valid length for each example in the batch

Returns

Padded time-domain signal padded_audio and its length padded_len.

quantize(encoded: torch.Tensor, encoded_len: torch.Tensor) torch.Tensor[source]#

Quantize the continuous encoded representation into a discrete representation for each frame.

Parameters
  • encoded – encoded signal representation

  • encoded_len – valid length of the encoded representation in frames

Returns

A tensor of tokens for each codebook for each frame.

should_update_disc(batch_idx) bool[source]#

Decide whether to update the descriminator based on the batch index and configured discriminator update period.

update_lr(interval='step')[source]#

Base Classes#

The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.

class nemo.collections.tts.models.base.MelToSpec(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT, ABC

A base class for models that convert mel spectrograms to linear (magnitude) spectrograms

abstract convert_mel_spectrogram_to_linear(mel: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of linear spectrograms

Parameters

mel – A torch tensor representing the mel spectrograms [‘B’, ‘mel_freqs’, ‘T’]

Returns

A torch tensor representing the linear spectrograms [‘B’, ‘n_freqs’, ‘T’]

Return type

spec

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT, ABC

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) torch.tensor[source]#

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

Note that some models have normalize parameter in this function which will apply normalizer if it is available.

set_export_config(args)[source]#
class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT, ABC

A base class for models that convert spectrograms to audios. Note that this class takes as input either linear or mel spectrograms.

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) torch.tensor[source]#

Accepts a batch of spectrograms and returns a batch of audio.

Parameters

spec – [‘B’, ‘n_freqs’, ‘T’], A torch tensor representing the spectrograms to be vocoded.

Returns

audio

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

Dataset Processing Classes#

class nemo.collections.tts.data.dataset.MixerTTSXDataset(*args: Any, **kwargs: Any)[source]#

Bases: TTSDataset

add_lm_tokens(**kwargs)[source]#
class nemo.collections.tts.data.dataset.TTSDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

add_align_prior_matrix(**kwargs)[source]#
add_durations(**kwargs)[source]#
add_energy(**kwargs)[source]#
add_log_mel(**kwargs)[source]#
add_p_voiced(**kwargs)[source]#
add_pitch(**kwargs)[source]#
add_reference_audio(**kwargs)[source]#
add_speaker_id(**kwargs)[source]#
add_voiced_mask(**kwargs)[source]#
static filter_files(data, ignore_file, min_duration, max_duration, total_duration)[source]#
general_collate_fn(batch)[source]#
get_log_mel(audio)[source]#
get_spec(audio)[source]#
join_data(data_dict)[source]#
pitch_shift(audio, sr, rel_audio_path_as_text_id)[source]#
class nemo.collections.tts.data.dataset.VocoderDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset