Spectrogram Generator

Spectrogram Generator models take in text input and generate a Mel spectrogram. There are several types of Spectrogram Generator architecture; TAO Toolkit supports the FastPitch architecture.

The FastPitch model generates Mel spectrograms and predicts a pitch contour from raw input text. It allows additional control over synthesized utterances through the following options:

  • Modify the pitch contour to control the prosody.

  • Increase or decrease the fundamental frequency in a natural way, which preserves the perceived identity of the speaker.

  • Alter the rate of speech.

  • Specify input as graphemes or phonemes.

  • Switch speakers (if the model has been trained with data from multiple speakers).

The following tasks have been implemented for FastPitch in the TAO Toolkit

  • download_specs

  • dataset_convert

  • train

  • infer

  • export

Downloading Sample Spec Files

Example specification files for all the tasks associated with the spectrogram generator component of TTS can be downloaded using the following command:

tao spectro_gen download_specs   \
                -o <target_path> \
                -r <results_path>

Required Arguments

  • -o: The target path where the spec files will be stored

  • -r: The results and output log directory

Preparing the Dataset

The spectrogram generator for TAO Toolkit implements the dataset_convert task to convert and prepare datasets that follow the LJSpeech dataset format.

The dataset_convert task generates manifest files and .txt files with normalized transcripts.

The dataset for TTS consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line (.json).

Each line of the manifest should be in the following format:

{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}

The audio_filepath field should provide an absolute path to the .wav file corresponding to the utterance. The text field should contain the full transcript for the utterance, and the duration field should reflect the duration of the utterance in seconds.

Each entry in the manifest (describing one audio file) should be bordered by { and } and must be contained on one line. The fields that describe the file should be separated by commas and have the form "field_name": value, as shown above.

Since the manifest specifies the path for each utterance, the audio files do not have to be located in the same directory as the manifest, or even in any specific directory structure.

Creating an Experiment Spec File

The spec file for TTS using FastPitch includes the trainer, model, training_dataset, validation_dataset, and prior_folder.

The following is a shortened example of a spec file for training on the LJSpeech dataset.

sample_rate: 22050
train_dataset: ???
validation_dataset: ???
prior_folder: ???

model:
  learn_alignment: true
  n_speakers: 1
  symbols_embedding_dim: 384
  max_token_duration: 75
  n_mel_channels: 80
  pitch_embedding_kernel_size: 3
  n_window_size: 1024
  n_window_stride: 256
  pitch_fmin: 80
  pitch_fmax: 640
  pitch_avg: 211.27540199742586
  pitch_std: 52.1851002822779

  train_ds:
    dataset:
      _target_: "nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset"
      manifest_filepath: ${train_dataset}
      max_duration: null
      min_duration: 0.1
      int_values: false
      normalize: true
      sample_rate: ${sample_rate}
      trim: false
      sup_data_path: ${prior_folder}
      n_window_stride: ${model.n_window_stride}
      n_window_size: ${model.n_window_size}
      pitch_fmin: ${model.pitch_fmin}
      pitch_fmax: ${model.pitch_fmax}
      pitch_avg: ${model.pitch_avg}
      pitch_std: ${model.pitch_std}
      vocab:
        notation: phonemes
        punct: true
        spaces: true
        stresses: true
        add_blank_at: None
        pad_with_space: True
        chars: true
        improved_version_g2p: true
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 32
      num_workers: 12

  validation_ds:
    dataset:
      _target_: "nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset"
      manifest_filepath: ${validation_dataset}
      max_duration: null
      min_duration: null
      int_values: false
      normalize: true
      sample_rate: ${sample_rate}
      trim: false
      sup_data_path: ${prior_folder}
      n_window_stride: ${model.n_window_stride}
      n_window_size: ${model.n_window_size}
      pitch_fmin: ${model.pitch_fmin}
      pitch_fmax: ${model.pitch_fmax}
      pitch_avg: ${model.pitch_avg}
      pitch_std: ${model.pitch_std}
      vocab:
        notation: phonemes
        punct: true
        spaces: true
        stresses: true
        add_blank_at: None
        pad_with_space: True
        chars: true
        improved_version_g2p: true
    dataloader_params:
      drop_last: false
      shuffle: false
      batch_size: 32
      num_workers: 8

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    dither: 0.0
    features: ${model.n_mel_channels}
    frame_splicing: 1
    highfreq: 8000
    log: true
    log_zero_guard_type: add
    log_zero_guard_value: 1e-05
    lowfreq: 0
    mag_power: 1.0
    n_fft: ${model.n_window_size}
    n_window_size: ${model.n_window_size}
    n_window_stride:  ${model.n_window_stride}
    normalize: null
    pad_to: 1
    pad_value: 0
    preemph: null
    sample_rate: ${sample_rate}
    window: hann
    window_size: null
    window_stride: null

  input_fft: #n_embed and padding_idx are added by the model
    _target_: nemo.collections.tts.modules.transformer.FFTransformerEncoder
    n_layer: 6
    n_head: 1
    d_model: ${model.symbols_embedding_dim}
    d_head: 64
    d_inner: 1536
    kernel_size: 3
    dropout: 0.1
    dropatt: 0.1
    dropemb: 0.0
    d_embed: ${model.symbols_embedding_dim}

  output_fft:
    _target_: nemo.collections.tts.modules.transformer.FFTransformerDecoder
    n_layer: 6
    n_head: 1
    d_model: ${model.symbols_embedding_dim}
    d_head: 64
    d_inner: 1536
    kernel_size: 3
    dropout: 0.1
    dropatt: 0.1
    dropemb: 0.0

  alignment_module:
    _target_: nemo.collections.tts.modules.aligner.AlignmentEncoder
    n_text_channels: ${model.symbols_embedding_dim}

  duration_predictor:
    _target_: nemo.collections.tts.modules.fastpitch.TemporalPredictor
    input_size: ${model.symbols_embedding_dim}
    kernel_size: 3
    filter_size: 256
    dropout: 0.1
    n_layers: 2

  pitch_predictor:
    _target_: nemo.collections.tts.modules.fastpitch.TemporalPredictor
    input_size: ${model.symbols_embedding_dim}
    kernel_size: 3
    filter_size: 256
    dropout: 0.1
    n_layers: 2

  optim:
    name: lamb
    lr: 1e-1
    betas: [0.9, 0.98]
    weight_decay: 1e-6

    sched:
      name: NoamAnnealing
      warmup_steps: 1000
      last_epoch: -1
      d_model: 1  # Disable scaling based on model dim

trainer:
  max_epochs: 100

The specification can be roughly grouped into three categories:

  • Parameters to configure the trainer

  • Parameters that describe the model

  • Parameters to configure the experiment

This specification can be used with the tao spectro_gen train command.

If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add model.validation_ds.batch_size=1 to your command, which would override the batch size of 32 in the configuration shown above. An example of this is shown in the training instructions below.

Configuring the Trainer

The following parameter is used to configure the trainer element of the Spectrogram Generator.

Parameter

Datatype

Description

Supported Values

max_epochs

int

Specifies the maximum number of epochs to train the model. A field for the trainer parameter.

>0

Configuring the model

The parameters to help configure the FastPitch model are included in the model element. This includes parameters for configuring the following elements:

  1. dataset_config

  2. preprocessor

  3. input_fft

  4. output_fft

  5. alignment_module

  6. duration_predictor

  7. pitch_predictor

  8. optimizer

There are also some global parameters:

Parameter

Datatype

Description

Supported Values

learn_alignment

bool

Enable learning alignment

Valid filepaths

n_speakers

int

The number of speakers in the dataset

symbols_embedding_dim

int

The dimension of the symbols embedding

max_token_duration

int

The maximum duration to clamp the tokens to

pitch_embedding_kernel_size

int

The kernel size of the Conv1d layer generating pitch embeddings

pitch_fmin

float

The fmin input to librosa.pyin. The default value is librosa.note_to_hz(‘C2’)

pitch_fmax

float

The fmax input to librosa.pyin. The default value is librosa.note_to_hz(‘C7’)

pitch_avg

float

The average used to normalize the pitch

pitch_std

float

The std deviation used to normalize the pitch

n_window_stride

int

The stride of the window for fft in samples.

n_window_size

int

The size of the window for fft in samples.

n_mel_channels

int

The number of Mel channels to output

Dataset Configs

The datasets that you use should be specified by <xyz>_ds parameters, depending on the use case:

  • For training using tao spectro_gen train, you should have training_ds to describe your training dataset, and validation_ds to describe your validation dataset.

Each <xyz>_ds config contains two main groups of configuration

  • dataset: The configuration component describing the dataset

  • dataloader: The configuration componenet describing the dataloader

The configurable fields for the dataset field are described in the following table:

Parameter

Datatype

Description

Supported Values

manifest_filepath

string

The filepath to the manifest (.json file) that describes the audio data

Valid filepaths.

min_duration

float

All files with a duration less than the given value (in seconds) will be dropped. The default value is 0.1.

max_duration

float

All files with a duration greater than the given value (in seconds) will be dropped.

sample_rate

int

The target sample rate to load the audio, in Hz.

trim

bool

Whether to trim silence from beginning and end of audio signal using librosa.effects.trim(). The default value is False.

True/False

int_values

bool

If true, load samples as 32-bit integers. The default value is False.

True/False

n_window_stride

int

The stride of window for fft in samples.

n_window_size

int

The size of window for fft in samples.

normalize

bool

The flag to determine whether to normalize the transcript text

True/False

pitch_fmin

float

The fmin input to librosa.pyin. The default value is librosa.note_to_hz(‘C2’)

pitch_fmax

float

The fmax input to librosa.pyin. The default value is librosa.note_to_hz(‘C7’)

pitch_avg

float

The average used to normalize the pitch

pitch_std

float

The std deviation used to normalize the pitch

Note

The pitch_avg and pitch_std parameters provided by default are calculated for the LJSpeech dataset. These values must be re-calculated per speaker.

Similarly, the pitch_fmin and pitch_fmax need to adjusted based on the dataset. The default values may result in poor behaviour.

Vocabulary

This subsection under the dataset component of the <xyz>_ds config defines the configurable fields to generate a vocabulary.

Parameter

Datatype

Description

Supported Values

notation

str

Either ‘chars` or phonemes as general notation

phonemes

punct

bool

Whether to reserve graphemes for basic punctuation

True/False

spaces

bool

Whether to prepend spaces to every punctuation symbol.

True/False

chars

bool

Whether to additionally use chars together with phonemes

True/False

add_blank_at

str

Add blank to labels in the specified order. If this string is empty, then there will be no blank in the labels.

last/last_but_one/None

pad_with_space

bool

Whether to pad text with spaces at the beginning and at the end

True/False

improved_version_g2p

bool

Whether to use the new version of g2p

True/False

Dataloader

Parameter

Datatype

Description

Supported Values

num_workers

integer

The number of worker threads for loading
the dataset
2

shuffle

bool

Whether to shuffle the data. We recommend True for training data and False for validation.

True/False

batch_size

integer

The training data batch size

Preprocessor Config

Parameter

Datatype

Description

Supported Values

dither

float

Amount of white-noise dithering.

>= 0

features

int

Number of mel spectrogram freq bins to output.

derived from model.n_mel_channels

frame_splicing

int

Number of spectrogram frames per model step

highfreq

int

Upper bound on mel basis in Hz.

log

bool

Whether to log the spectrogram

log_zero_guard_type

str

Need to avoid taking the log of zero. There are
two options: “add” or “clamp”.


low_zero_guard_value



float, str



Add or clamp requires the number to add with or
clamp to. log_zero_guard_value can either be a
float or “tiny” or “eps”. torch.finfo is used if
“tiny” or “eps” is passed.




lowfreq

int

Lower bound on mel basis in Hz.

mag_power

int


prior to multiplication with mel basis.


n_fft

int

The size of window for fft in samples. Use one of
window_size or n_window_size.

Derived from model.n_window_size
n_window_size

int

The size of window for fft in samples. Use one of
window_stride or n_window_stride.

Derived from model.n_window_size

n_window_stride

int

The stride of the window for fft.

Derived from model.n_window_stride

normalize




str





other options disable feature normalization.
‘all_features’ normalizes the entire spectrogram

per channel / freq instead.




pad_to

int


a multiple of pad_to.


pad_value

float

The value that shorter mels are padded with.

preemph

float

Amount of pre emphasis to add to audio. Can be
disabled by passing None.


sample_rate

int

The target sample rate to load the audio, in Hz.

Derived from sample_rate

window

string


‘hamming’, ‘blackman’, ‘bartlett’]


window_size

int

Size of window for fft in seconds

window_stride

int

Stride of window for fft in seconds

INPUT / OUTPUT FFT

Parameter

Datatype

Description

Supported Values

n_layer

int

Number of Transformer layers

n_head

int

Number of heads in the MultiHeadAttn module

d_model

int

Hidden size of input and output

Derived from model.symbols_embedding_dim

d_head

int

Hidden size of attention module

d_inner

int

Hidden size of conv layer

kernel_size

int

Kernel size of conv layer

dropout

float

Dropout parameter

dropatt

float

Dropout parameter for attention

d_embed

int

Hidden size of embeddings (input fft only)

Derived from model.symbols_embedding_dim

Alignment Module

Parameter

Datatype

Description

Supported Values

n_text_channels

int

Should match d_model

Duration Predictor / Pitch Predictor

A simple stack of conv, relu, layernorm, dropout layers.

Parameter

Datatype

Description

Supported Values

input_size

int

Should match d_model

Derived from model.symbols_embedding_dim

kernel_size

int

Kernel size for conv layers

filter_size

int

Filter size for conv layers

dropout

float

Dropout parameter

n_layers

int

Number of layers

Training the Model

To train a model from scratch, use the following command:

tao spectro_gen train \
                -e <experiment_spec> \
                -g <num_gpus> \
                -r /path/to/the/results/directory \
                -k <encryption_key>

As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime.

For example, the following command can be used to override the training manifest and validation manifest, the number of epochs to train, and the place to save the model checkpoint:

tao spectro_gen train \
    -e $SPECS_DIR/spectro_gen/train.yaml \
    -g 1 \
    -k $KEY \
    -r $RESULTS_DIR/spectro_gen/train \
    train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
    validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
    prior_folder=$RESULTS_DIR/spectro_gen/train/prior_folder \
    trainer.max_epochs=5

Required Arguments

  • -e: The experiment specification file to set up training, as in the example given above

  • -r: The path to the results and log directory. Log files, checkpoints, etc., will be stored here

  • -k: The key to encrypt the model

  • Other arguments to override fields in the specification file.

Optional Arguments

  • -g: The number of GPUs to be used in the training in a multi-GPU scenario. The default value is 1.

Training Procedure

At the start of each training experiment, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, how many hours are in each loaded dataset, and how much of each dataset has been filtered.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
[NeMo W 2021-10-29 21:29:06 exp_manager:414] Exp_manager is logging to /results/spectro_gen/train, but it already exists.
[NeMo W 2021-10-29 21:29:06 exp_manager:332] There was no checkpoint folder at checkpoint_dir :/results/spectro_gen/train/checkpoints. Training from scratch.
[NeMo I 2021-10-29 21:29:06 exp_manager:220] Experiments will be logged at /results/spectro_gen/train
[NeMo I 2021-10-29 21:29:06 exp_manager:569] TensorboardLogger has been set up
[NeMo W 2021-10-29 21:29:06 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:240: LightningDeprecationWarning: `ModelCheckpoint(every_n_val_epochs)` is deprecated in v1.4 and will be removed in v1.6. Please use `every_n_epochs` instead.
      rank_zero_deprecation(

[NeMo I 2021-10-29 21:29:12 collections:173] Dataset loaded with 12500 files totalling 22.84 hours
[NeMo I 2021-10-29 21:29:12 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-10-29 21:29:40 collections:173] Dataset loaded with 100 files totalling 0.18 hours
[NeMo I 2021-10-29 21:29:40 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-10-29 21:29:42 features:252] PADDING: 1
[NeMo I 2021-10-29 21:29:42 features:269] STFT using torch
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
-------------------------------------------------------------------------------------------------

You should next see a full printout of the number of parameters in each module and submodule, as well as the total number of trainable and non-trainable parameters in the model.

In the following table, the fastpitch module contains 45.8 million parameters and its submodule fastpitch.encoder container 21.9 million parameters. The ReLU, PositionalEmbedding, and Dropout modules are listed with no parameters.

    | Name                                           | Type                              | Params
-------------------------------------------------------------------------------------------------------
0   | mel_loss                                       | MelLoss                           | 0
1   | pitch_loss                                     | PitchLoss                         | 0
2   | duration_loss                                  | DurationLoss                      | 0
3   | aligner                                        | AlignmentEncoder                  | 1.0 M
4   | aligner.softmax                                | Softmax                           | 0
5   | aligner.log_softmax                            | LogSoftmax                        | 0
6   | aligner.key_proj                               | Sequential                        | 947 K
7   | aligner.key_proj.0                             | ConvNorm                          | 885 K
8   | aligner.key_proj.0.conv                        | Conv1d                            | 885 K
9   | aligner.key_proj.1                             | ReLU                              | 0
10  | aligner.key_proj.2                             | ConvNorm                          | 61.5 K
11  | aligner.key_proj.2.conv                        | Conv1d                            | 61.5 K
12  | aligner.query_proj                             | Sequential                        | 57.9 K
13  | aligner.query_proj.0                           | ConvNorm                          | 38.6 K
14  | aligner.query_proj.0.conv                      | Conv1d                            | 38.6 K
15  | aligner.query_proj.1                           | ReLU                              | 0
16  | aligner.query_proj.2                           | ConvNorm                          | 12.9 K
17  | aligner.query_proj.2.conv                      | Conv1d                            | 12.9 K
18  | aligner.query_proj.3                           | ReLU                              | 0
19  | aligner.query_proj.4                           | ConvNorm                          | 6.5 K
20  | aligner.query_proj.4.conv                      | Conv1d                            | 6.5 K
21  | forward_sum_loss                               | ForwardSumLoss                    | 0
22  | forward_sum_loss.log_softmax                   | LogSoftmax                        | 0
23  | forward_sum_loss.ctc_loss                      | CTCLoss                           | 0
24  | bin_loss                                       | BinLoss                           | 0
25  | preprocessor                                   | AudioToMelSpectrogramPreprocessor | 0
26  | preprocessor.featurizer                        | FilterbankFeatures                | 0
27  | fastpitch                                      | FastPitchModule                   | 45.8 M
28  | fastpitch.encoder                              | FFTransformerEncoder              | 21.9 M
29  | fastpitch.encoder.pos_emb                      | PositionalEmbedding               | 0
30  | fastpitch.encoder.drop                         | Dropout                           | 0
31  | fastpitch.encoder.layers                       | ModuleList                        | 21.8 M
32  | fastpitch.encoder.layers.0                     | TransformerLayer                  | 3.6 M
33  | fastpitch.encoder.layers.0.dec_attn            | MultiHeadAttn                     | 99.3 K
34  | fastpitch.encoder.layers.0.dec_attn.qkv_net    | Linear                            | 73.9 K
35  | fastpitch.encoder.layers.0.dec_attn.drop       | Dropout                           | 0
36  | fastpitch.encoder.layers.0.dec_attn.dropatt    | Dropout                           | 0
37  | fastpitch.encoder.layers.0.dec_attn.o_net      | Linear                            | 24.6 K
..
..
213 | fastpitch.duration_predictor.layers.1.norm     | LayerNorm                         | 512
214 | fastpitch.duration_predictor.layers.1.dropout  | Dropout                           | 0
215 | fastpitch.duration_predictor.fc                | Linear                            | 257
216 | fastpitch.pitch_predictor                      | TemporalPredictor                 | 493 K
217 | fastpitch.pitch_predictor.layers               | Sequential                        | 493 K
218 | fastpitch.pitch_predictor.layers.0             | ConvReLUNorm                      | 295 K
219 | fastpitch.pitch_predictor.layers.0.conv        | Conv1d                            | 295 K
220 | fastpitch.pitch_predictor.layers.0.norm        | LayerNorm                         | 512
221 | fastpitch.pitch_predictor.layers.0.dropout     | Dropout                           | 0
222 | fastpitch.pitch_predictor.layers.1             | ConvReLUNorm                      | 197 K
223 | fastpitch.pitch_predictor.layers.1.conv        | Conv1d                            | 196 K
224 | fastpitch.pitch_predictor.layers.1.norm        | LayerNorm                         | 512
225 | fastpitch.pitch_predictor.layers.1.dropout     | Dropout                           | 0
226 | fastpitch.pitch_predictor.fc                   | Linear                            | 257
227 | fastpitch.pitch_emb                            | Conv1d                            | 1.5 K
228 | fastpitch.proj                                 | Linear                            | 30.8 K
-------------------------------------------------------------------------------------------------------
45.8 M    Trainable params
0         Non-trainable params
45.8 M    Total params
183.035   Total estimated model params size (MB)

As the model starts training, you should see a progress bar per epoch.

Epoch 0:   0%|                                | 0/395 [00:00<00:00, 5504.34it/s][W reducer.cpp:1151] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:   6%|▊             | 23/395 [05:06<1:19:07, 12.76s/it, loss=38, v_num=]
...

At the end of training, TAO Toolkit will save the last checkpoint at the path specified by the experiment spec file before finishing.

[NeMo I 2021-01-20 22:38:48 train:120] Experiment logs saved to '$RESULTS_DIR/spectro_gen/train'
[NeMo I 2021-01-20 22:38:48 train:123] Trained model saved to '$RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt'
INFO: Internal process exited

Current Limitations

  • Currently, only .wav audio files are supported.

  • Training only supports single speaker dataset.

  • The spectrogram generator can only be trained from scratch.

Running Inference on a Model

To perform inference on individual text lines, use the following command:

tao spectro_gen infer -e <experiment_spec> \
                      -m <model_checkpoint> \
                      -g <num_gpus> \
                      -k $KEY \
                      -r </path/to/results/directory/for/logs> \
                      output_path=</path/to/result/directory/for/spectrogram>

Required Arguments

  • -e: The experiment specification file to set up inference. This spec file only needs a file_paths parameter that contains a list of individual file paths.

  • -m: The path to the model checkpoint, which should be a .tlt file.

  • -k: The key to encrypt the model

Optional Arguments

  • -g: The number of GPUs to use for inference in a multi-GPU scenario. The default value is 1.

  • -r: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.

  • Other arguments to override fields in the specification file.

Inference Procedure

At the start of inference, TAO Toolkit will print out the experiment specification, including the audio filepaths on which inference will be performed.

When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on.

[NeMo W 2021-10-29 23:08:27 exp_manager:26] Exp_manager is logging to `/results/spectro_gen/infer``, but it already exists.
[NeMo W 2021-10-29 23:08:33 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    dataset:
      _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
      manifest_filepath: /data/ljspeech/ljspeech_train.json
      ...
      ...
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 32
      num_workers: 12

[NeMo W 2021-10-29 23:08:33 modelPT:137] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    dataset:
      _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
      ...
      ...
    dataloader_params:
      drop_last: false
      shuffle: false
      batch_size: 32
      num_workers: 8

[NeMo I 2021-10-29 23:08:43 features:252] PADDING: 1
[NeMo I 2021-10-29 23:08:43 features:269] STFT using torch
Results for by the... saved to /results/spectro_gen/infer/spectro/0.npy
Results for direct... saved to /results/spectro_gen/infer/spectro/1.npy
Results for uneasy... saved to /results/spectro_gen/infer/spectro/2.npy
[NeMo I 2021-10-29 23:08:51 infer:79] Experiment logs saved to '/results/spectro_gen/infer'

The path to the Mel spectrograms generated by the infer task are shown in the last lines of the log.

Current Limitations

  • Currently, only .wav audio files are supported.

Model Export

You can export a trained FastPitch model to Riva format, which contains all the model artifacts necessary for deployment to Riva Services. For more details about Riva, see this page.

To export a FastPitch model to the Riva format, use the following command:

tao spectro_gen export -e <experiment_spec> \
                       -m <model_checkpoint> \
                       -r <results_dir> \
                       -k <encryption_key> \
                       export_format=RIVA \
                       export_to=<filename.riva>

Required Arguments

  • -e: The experiment specification file for export. See the Export Spec File section below for more details.

  • -m: The path to the model checkpoint to be exported, which should be a .tlt file

  • -k: The encryption key

Optional Arguments

  • -r: The path to the directory where results will be stored.

Export Spec File

The following is an example spec file for model export:

# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format: RIVA
export_format: RIVA

# Output EFF archive containing model checkpoint and artifacts required for Riva Services
export_to: exported-model.riva

Parameter

Datatype

Description

Default

restore_from

string

The path to the pre-trained model to be exported

trained_model.tlt

export_format

string

The export format

N/A

export_to

string

The target path for the export model

exported-model.riva

A successful run of the model export generates the following log:

[NeMo W 2021-10-29 23:14:22 exp_manager:26] Exp_manager is logging to `/results/spectro_gen/export``, but it already exists.
[NeMo W 2021-10-29 23:14:28 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    dataset:
      _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
      manifest_filepath: /data/ljspeech/ljspeech_train.json
      max_duration: null
      min_duration: 0.1
      int_values: false
      normalize: true
      sample_rate: 22050
      ...
      ...
      ...
[NeMo I 2021-10-29 23:14:35 export:57] Model restored from '/results/spectro_gen/train/checkpoints/trained-model.tlt'
[NeMo W 2021-10-29 23:14:38 export_utils:198] Swapped 0 modules
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
[NeMo I 2021-10-29 23:14:58 export:72] Experiment logs saved to '/results/spectro_gen/export'
[NeMo I 2021-10-29 23:14:58 export:73] Exported model to '/results/spectro_gen/export/spectro_gen.riva'
[NeMo I 2021-10-29 23:15:03 export:80] Exported model is compliant with Riva