NeMo TTS Collection API

TTS Base Classes

The classes below are the base of the TTS pipeline. To read more about them, see the Base Classes section of the intro page.

class nemo.collections.tts.models.base.SpectrogramGenerator(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all TTS models that turn text into a spectrogram

abstract generate_spectrogram(tokens: torch.tensor, **kwargs) torch.tensor[source]

Accepts a batch of text or text_tokens and returns a batch of spectrograms

Parameters

tokens – A torch tensor representing the text to be generated

Returns

spectrograms

classmethod list_available_models() List[PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) torch.tensor[source]

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

class nemo.collections.tts.models.base.Vocoder(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all TTS models that generate audio conditioned a on spectrogram

abstract convert_spectrogram_to_audio(spec: torch.tensor, **kwargs) torch.tensor[source]

Accepts a batch of spectrograms and returns a batch of audio

Parameters

spec – A torch tensor representing the spectrograms to be vocoded

Returns

audio

classmethod list_available_models() List[PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

class nemo.collections.tts.models.base.TextToWaveform(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.modelPT.ModelPT, abc.ABC

Base class for all end-to-end TTS models that generate a waveform from text

abstract convert_text_to_waveform(*, tokens: torch.tensor, **kwargs) List[torch.tensor][source]

Accepts a batch of text and returns a list containing a batch of audio

Parameters

tokens – A torch tensor representing the text to be converted to speech

Returns

A list of length batch_size containing torch tensors representing the waveform output

Return type

audio

classmethod list_available_models() List[PretrainedModelInfo][source]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

abstract parse(str_input: str, **kwargs) torch.tensor[source]

A helper function that accepts raw python strings and turns them into a tensor. The tensor should have 2 dimensions. The first is the batch, which should be of size 1. The second should represent time. The tensor should represent either tokenized or embedded text, depending on the model.

TTS Datasets

class nemo.collections.tts.data.datalayers.AudioDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

class nemo.collections.tts.data.datalayers.MelAudioDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

class nemo.collections.tts.data.datalayers.SplicedAudioDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

class nemo.collections.tts.data.datalayers.NoisySpecsDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

static decollate_padded(batch: Dict[str, Any], idx: int) Dict[str, Any]

select the idx-th data, get rid of padded zeros and return it.

Important data like x, y are all converted to ndarray. :param batch: :param idx: :return: DataDict

Values can be an str or ndarray.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

tar_dir

A modified dataset for training deep-griffin-lim iteration. Contains MSTFT (mag), STFT (y) , and noisy STFT which is used for initial phase. By using different levels of noise, the Degli model can learn to improve any phase, and thus it can be used iteratively.

Parameters
  • destination (str, Path) – Path to a directory containing the main data set folder, Similar to the directory

  • script (provided to the preprocessor) –

  • dataset. (which generates this) –

  • subdir (str) – Either ‘train’, or ‘valid’, when using the standard script for generation.

  • n_fft (int) – STFT parameter. Also detrmines the STFT filter length.

  • hop_length (int) – STFT parameter.

  • num_snr (int) – number of noisy samples per clean audio in the original dataset.

class nemo.collections.asr.data.audio_to_text.FastPitchDataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.collections.asr.data.audio_to_text._AudioTextDataset

Dataset used for FastPitch that has both duration and pitch information per input char. See https://github.com/NVIDIA/NeMo/pull/1799 for information on how to extract duration and pitch information.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.

class nemo.collections.tts.data.datalayers.FastSpeech2Dataset(*args: Any, **kwargs: Any)[source]

Bases: nemo.core.classes.dataset.Dataset

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Returns definitions of module output ports.