# Spectrogram Generator

Spectrogram Generator models take in text input and generate a Mel spectrogram. There are several types of Spectrogram Generator architecture; TAO Toolkit supports the FastPitch architecture.

The FastPitch model generates Mel spectrograms and predicts a pitch contour from raw input text. It allows additional control over synthesized utterances through the following options:

• Modify the pitch contour to control the prosody.

• Increase or decrease the fundamental frequency in a natural way, which preserves the perceived identity of the speaker.

• Alter the rate of speech.

• Specify input as graphemes or phonemes.

• Switch speakers (if the model has been trained with data from multiple speakers).

The following tasks have been implemented for FastPitch in the TAO Toolkit

• dataset_convert

• train

• infer

• export

• finetune

• pitch_stats

Example specification files for all the tasks associated with the spectrogram generator component of TTS can be downloaded using the following command:

Copy
Copied!

tao spectro_gen download_specs   \
-o <target_path> \
-r <results_path>


### Required Arguments

• -o: The target path where the spec files will be stored

• -r: The results and output log directory

## Preparing the Dataset

The spectrogram generator for TAO Toolkit implements the dataset_convert task to convert and prepare datasets that follow the LJSpeech dataset format.

The dataset_convert task generates manifest files and .txt files with normalized transcripts.

The dataset for TTS consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line (.json).

Each line of the manifest should be in the following format:

Copy
Copied!

{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}


The audio_filepath field should provide an absolute path to the .wav file corresponding to the utterance. The text field should contain the full transcript for the utterance, and the duration field should reflect the duration of the utterance in seconds.

Each entry in the manifest (describing one audio file) should be bordered by { and } and must be contained on one line. The fields that describe the file should be separated by commas and have the form "field_name": value, as shown above.

Since the manifest specifies the path for each utterance, the audio files do not have to be located in the same directory as the manifest, or even in any specific directory structure.

## Creating an Experiment Spec File

The spec file for TTS using FastPitch includes the trainer, model, training_dataset, validation_dataset, and prior_folder.

The following is a shortened example of a spec file for training on the LJSpeech dataset.

Copy
Copied!

sample_rate: 22050
train_dataset: ???
validation_dataset: ???
prior_folder: ???

model:
learn_alignment: true
n_speakers: 1
symbols_embedding_dim: 384
max_token_duration: 75
n_mel_channels: 80
pitch_embedding_kernel_size: 3
n_window_size: 1024
n_window_stride: 256
pitch_fmin: 80
pitch_fmax: 640
pitch_avg: 211.27540199742586
pitch_std: 52.1851002822779

train_ds:
dataset:
_target_: "nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset"
manifest_filepath: ${train_dataset} max_duration: null min_duration: 0.1 int_values: false normalize: true sample_rate:${sample_rate}
trim: false
sup_data_path: ${prior_folder} n_window_stride:${model.n_window_stride}
n_window_size: ${model.n_window_size} pitch_fmin:${model.pitch_fmin}
pitch_fmax: ${model.pitch_fmax} pitch_avg:${model.pitch_avg}
pitch_std: ${model.pitch_std} vocab: notation: phonemes punct: true spaces: true stresses: true add_blank_at: None pad_with_space: True chars: true improved_version_g2p: true dataloader_params: drop_last: false shuffle: true batch_size: 32 num_workers: 12 validation_ds: dataset: _target_: "nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset" manifest_filepath:${validation_dataset}
max_duration: null
min_duration: null
int_values: false
normalize: true
sample_rate: ${sample_rate} trim: false sup_data_path:${prior_folder}
n_window_stride: ${model.n_window_stride} n_window_size:${model.n_window_size}
pitch_fmin: ${model.pitch_fmin} pitch_fmax:${model.pitch_fmax}
pitch_avg: ${model.pitch_avg} pitch_std:${model.pitch_std}
vocab:
notation: phonemes
punct: true
spaces: true
stresses: true
chars: true
improved_version_g2p: true
drop_last: false
shuffle: false
batch_size: 32
num_workers: 8

preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
dither: 0.0
features: ${model.n_mel_channels} frame_splicing: 1 highfreq: 8000 log: true log_zero_guard_type: add log_zero_guard_value: 1e-05 lowfreq: 0 mag_power: 1.0 n_fft:${model.n_window_size}
n_window_size: ${model.n_window_size} n_window_stride:${model.n_window_stride}
normalize: null
preemph: null
sample_rate: ${sample_rate} window: hann window_size: null window_stride: null input_fft: #n_embed and padding_idx are added by the model _target_: nemo.collections.tts.modules.transformer.FFTransformerEncoder n_layer: 6 n_head: 1 d_model:${model.symbols_embedding_dim}
d_inner: 1536
kernel_size: 3
dropout: 0.1
dropatt: 0.1
dropemb: 0.0
d_embed: ${model.symbols_embedding_dim} output_fft: _target_: nemo.collections.tts.modules.transformer.FFTransformerDecoder n_layer: 6 n_head: 1 d_model:${model.symbols_embedding_dim}
d_inner: 1536
kernel_size: 3
dropout: 0.1
dropatt: 0.1
dropemb: 0.0

alignment_module:
_target_: nemo.collections.tts.modules.aligner.AlignmentEncoder
n_text_channels: ${model.symbols_embedding_dim} duration_predictor: _target_: nemo.collections.tts.modules.fastpitch.TemporalPredictor input_size:${model.symbols_embedding_dim}
kernel_size: 3
filter_size: 256
dropout: 0.1
n_layers: 2

pitch_predictor:
_target_: nemo.collections.tts.modules.fastpitch.TemporalPredictor
input_size: {model.symbols_embedding_dim} kernel_size: 3 filter_size: 256 dropout: 0.1 n_layers: 2 optim: name: lamb lr: 1e-1 betas: [0.9, 0.98] weight_decay: 1e-6 sched: name: NoamAnnealing warmup_steps: 1000 last_epoch: -1 d_model: 1 # Disable scaling based on model dim trainer: max_epochs: 100  The specification can be roughly grouped into three categories: • Parameters to configure the trainer • Parameters that describe the model • Parameters to configure the experiment This specification can be used with the tao spectro_gen train command. If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add model.validation_ds.batch_size=1 to your command, which would override the batch size of 32 in the configuration shown above. An example of this is shown in the training instructions below. ### Configuring the Trainer The following parameter is used to configure the trainer element of the Spectrogram Generator. Parameter Datatype Description Supported Values max_epochs int Specifies the maximum number of epochs to train the model. A field for the trainer parameter. >0 ### Configuring the model The parameters to help configure the FastPitch model are included in the model element. This includes parameters for configuring the following elements: 1. dataset_config 2. preprocessor 3. input_fft 4. output_fft 5. alignment_module 6. duration_predictor 7. pitch_predictor 8. optimizer There are also some global parameters: Parameter Datatype Description Supported Values learn_alignment bool Enable learning alignment Valid filepaths n_speakers int The number of speakers in the dataset symbols_embedding_dim int The dimension of the symbols embedding max_token_duration int The maximum duration to clamp the tokens to pitch_embedding_kernel_size int The kernel size of the Conv1d layer generating pitch embeddings pitch_fmin float The fmin input to librosa.pyin. The default value is librosa.note_to_hz(‘C2’) pitch_fmax float The fmax input to librosa.pyin. The default value is librosa.note_to_hz(‘C7’) pitch_avg float The average used to normalize the pitch pitch_std float The std deviation used to normalize the pitch n_window_stride int The stride of the window for fft in samples. n_window_size int The size of the window for fft in samples. n_mel_channels int The number of Mel channels to output #### Dataset Configs The datasets that you use should be specified by <xyz>_ds parameters, depending on the use case: • For training using tao spectro_gen train, you should have training_ds to describe your training dataset, and validation_ds to describe your validation dataset. Each <xyz>_ds config contains two main groups of configuration • dataset: The configuration component describing the dataset • dataloader: The configuration componenet describing the dataloader The configurable fields for the dataset field are described in the following table: Parameter Datatype Description Supported Values manifest_filepath string The filepath to the manifest (.json file) that describes the audio data Valid filepaths. min_duration float All files with a duration less than the given value (in seconds) will be dropped. The default value is 0.1. max_duration float All files with a duration greater than the given value (in seconds) will be dropped. sample_rate int The target sample rate to load the audio, in Hz. trim bool Whether to trim silence from beginning and end of audio signal using librosa.effects.trim(). The default value is False. True/False int_values bool If true, load samples as 32-bit integers. The default value is False. True/False n_window_stride int The stride of window for fft in samples. n_window_size int The size of window for fft in samples. normalize bool The flag to determine whether to normalize the transcript text True/False pitch_fmin float The fmin input to librosa.pyin. The default value is librosa.note_to_hz(‘C2’) pitch_fmax float The fmax input to librosa.pyin. The default value is librosa.note_to_hz(‘C7’) pitch_avg float The average used to normalize the pitch pitch_std float The std deviation used to normalize the pitch Note The pitch_avg and pitch_std parameters provided by default are calculated for the LJSpeech dataset. These values must be re-calculated per speaker. Similarly, the pitch_fmin and pitch_fmax need to adjusted based on the dataset. The default values may result in poor behaviour. Vocabulary This subsection under the dataset component of the <xyz>_ds config defines the configurable fields to generate a vocabulary.  Parameter Datatype Description Supported Values notation str Either ‘chars or phonemes as general notation phonemes punct bool Whether to reserve graphemes for basic punctuation True/False spaces bool Whether to prepend spaces to every punctuation symbol. True/False chars bool Whether to additionally use chars together with phonemes True/False add_blank_at str Add blank to labels in the specified order. If this string is empty, then there will be no blank in the labels. last/last_but_one/None pad_with_space bool Whether to pad text with spaces at the beginning and at the end True/False improved_version_g2p bool Whether to use the new version of g2p True/False Dataloader Parameter Datatype Description Supported Values num_workers integer The number of worker threads for loading the dataset 2 shuffle bool Whether to shuffle the data. We recommend True for training data and False for validation. True/False batch_size integer The training data batch size #### Preprocessor Config Parameter Datatype Description Supported Values dither float Amount of white-noise dithering. >= 0 features int Number of mel spectrogram freq bins to output. derived from model.n_mel_channels frame_splicing int Number of spectrogram frames per model step highfreq int Upper bound on mel basis in Hz. log bool Whether to log the spectrogram log_zero_guard_type str Need to avoid taking the log of zero. There are two options: “add” or “clamp”. low_zero_guard_value float, str Add or clamp requires the number to add with or clamp to. log_zero_guard_value can either be a float or “tiny” or “eps”. torch.finfo is used if “tiny” or “eps” is passed. lowfreq int Lower bound on mel basis in Hz. mag_power int prior to multiplication with mel basis. n_fft int The size of window for fft in samples. Use one of window_size or n_window_size. Derived from model.n_window_size n_window_size int The size of window for fft in samples. Use one of window_stride or n_window_stride. Derived from model.n_window_size n_window_stride int The stride of the window for fft. Derived from model.n_window_stride normalize str other options disable feature normalization. ‘all_features’ normalizes the entire spectrogram per channel / freq instead. pad_to int a multiple of pad_to. pad_value float The value that shorter mels are padded with. preemph float Amount of pre emphasis to add to audio. Can be disabled by passing None. sample_rate int The target sample rate to load the audio, in Hz. Derived from sample_rate window string ‘hamming’, ‘blackman’, ‘bartlett’] window_size int Size of window for fft in seconds window_stride int Stride of window for fft in seconds #### INPUT / OUTPUT FFT Parameter Datatype Description Supported Values n_layer int Number of Transformer layers n_head int Number of heads in the MultiHeadAttn module d_model int Hidden size of input and output Derived from model.symbols_embedding_dim d_head int Hidden size of attention module d_inner int Hidden size of conv layer kernel_size int Kernel size of conv layer dropout float Dropout parameter dropatt float Dropout parameter for attention d_embed int Hidden size of embeddings (input fft only) Derived from model.symbols_embedding_dim #### Alignment Module Parameter Datatype Description Supported Values n_text_channels int Should match d_model #### Duration Predictor / Pitch Predictor A simple stack of conv, relu, layernorm, dropout layers. Parameter Datatype Description Supported Values input_size int Should match d_model Derived from model.symbols_embedding_dim kernel_size int Kernel size for conv layers filter_size int Filter size for conv layers dropout float Dropout parameter n_layers int Number of layers ## Training the Model To train a model from scratch, use the following command: Copy Copied!  tao spectro_gen train \ -e <experiment_spec> \ -g <num_gpus> \ -r /path/to/the/results/directory \ -k <encryption_key>  As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime. For example, the following command can be used to override the training manifest and validation manifest, the number of epochs to train, and the place to save the model checkpoint: Copy Copied!  tao spectro_gen train \ -eSPECS_DIR/spectro_gen/train.yaml \
-g 1 \
-k $KEY \ -r$RESULTS_DIR/spectro_gen/train \
train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \ validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
[NeMo I 2021-01-20 22:38:48 train:123] Trained model saved to '$RESULTS_DIR/spectro_gen/train/checkpoints/trained-model.tlt' INFO: Internal process exited  ### Current Limitations • Currently, only .wav audio files are supported. • Training only supports single speaker dataset. • The spectrogram generator can only be trained from scratch. ## Running Inference on a Model To perform inference on individual text lines, use the following command: Copy Copied!  tao spectro_gen infer -e <experiment_spec> \ -m <model_checkpoint> \ -g <num_gpus> \ -k$KEY \
-r </path/to/results/directory/for/logs> \
output_path=</path/to/result/directory/for/spectrogram>


### Required Arguments

• -e: The experiment specification file to set up inference. This spec file only needs a file_paths parameter that contains a list of individual file paths.

• -m: The path to the model checkpoint, which should be a .tlt file.

• -k: The key to encrypt the model

### Optional Arguments

• -g: The number of GPUs to use for inference in a multi-GPU scenario. The default value is 1.

• -r: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.

• Other arguments to override fields in the specification file.

### Inference Procedure

At the start of inference, TAO Toolkit will print out the experiment specification, including the audio filepaths on which inference will be performed.

When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on.

Copy
Copied!

[NeMo W 2021-10-29 23:08:27 exp_manager:26] Exp_manager is logging to /results/spectro_gen/infer, but it already exists.
[NeMo W 2021-10-29 23:08:33 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
dataset:
_target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
manifest_filepath: /data/ljspeech/ljspeech_train.json
...
...
drop_last: false
shuffle: true
batch_size: 32
num_workers: 12

[NeMo W 2021-10-29 23:08:33 modelPT:137] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
dataset:
_target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
...
...
drop_last: false
shuffle: false
batch_size: 32
num_workers: 8

[NeMo I 2021-10-29 23:08:43 features:252] PADDING: 1
[NeMo I 2021-10-29 23:08:43 features:269] STFT using torch
Results for by the... saved to /results/spectro_gen/infer/spectro/0.npy
Results for direct... saved to /results/spectro_gen/infer/spectro/1.npy
Results for uneasy... saved to /results/spectro_gen/infer/spectro/2.npy
[NeMo I 2021-10-29 23:08:51 infer:79] Experiment logs saved to '/results/spectro_gen/infer'


The path to the Mel spectrograms generated by the infer task are shown in the last lines of the log.

### Current Limitations

• Currently, only .wav audio files are supported.

## Fine-Tuning the Model

To fine-tune a model from a checkpoint, use the following command:

Copy
Copied!

!tao spectro_gen finetune -e <experiment_spec> \
-m <model_checkpoint> \
-g <num_gpus> \
train_dataset=<train.json> \
validation_dataset=<val.json> \
prior_folder=<prior_dir, could be an empty dir> \
n_speakers=2 \
pitch_fmin=<pitch statistic, see pitch section> \
pitch_fmax=<pitch statistic, see pitch section> \
pitch_avg=<pitch statistic, see pitch section> \
pitch_std=<pitch statistic, see pitch section> \
trainer.max_steps=<num_steps>


### Required Arguments

• -e: The experiment specification file to set up fine-tuning.

• -m: The path to the model checkpoint from which to fine-tune. The model checkpoint should be a .tlt file.

• train_dataset: The path to the training manifest, which should be created using dataset_convert dataset_name=merge. See the section below for more details.

• validation_dataset: The path to the validation manifest.

• prior_folder: A folder used to store dataset files. If the folder is empty, these files will be computed on the first run and saved to this directory. Future runs will load these files from the directory if they exist.

• n_speakers: This value should be 2: One for the original speaker, one for the new finetuning speaker.

• pitch_fmin: The Fmin to be used for pitch extraction. See the section below on how to set this value.

• pitch_fmax: The Fmax to be used for pitch extraction. See the section below on how to set this value.

• pitch_avg: The pitch average to be used for pitch extraction. See the section below on how to set this value.

• pitch_std: The pitch standard deviation to be used for pitch extraction. See the section below on how to set this value.

• trainer.max_steps: The number of steps used to finetune the model. We recommend adding 1000 for each minute in the finetuning data.

### Optional Arguments

• -g: The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1).

• -r: The path to the results and log directory. Log files, checkpoints, etc., will be stored here.

• Other arguments to override fields in the specification file.

Warning

In order to prevent unauthorized use of someone’s voice, TAO will only run finetuning if the text transcripts used in the finetuning data comes from the NVIDIA Custom Voice Recorder tool. Users do not have to use the tool to record their own voice, but the transcripts used must be the same.

Warning

The data from the NVIDIA Custom Voice Recorder tool cannot be used to train a FastPitch model from scratch. Instead, use the data with the finetune endpoint of TAO Text-To-Speech with a pretrained FastPitch model.

### Pitch Statistics

To fine tune FastPitch, you need to find and set 4 pitch hyperparameters:

• Fmin

• Fmax

• Mean

• Std

TAO features the pitch_stats task to help with this process. You must set Fmin and Fmax first. You can then iterate over the finetuning dataset to extract the pitch mean and standard deviation.

#### Obtaining the fmin and fmax

To get the fmin and fmax values, you will need to start with some defaults and iterate through random samples of the dataset to ensure that the pyin function from librosa extracts the pitch correctly. Then, look at the plotted spectrograms, as well as the predicted f0 (the cyan line), which should match the lowest energy band in the spectrogram. Here is an example of a good match between the predicted f0 and the spectrogram.

The following is an example of a bad match between the f0 and the spectrogram. The fmin was likely set too high. The fo algorithm is missing the first two vocalizations and is correctly matching the last half of speech. To fix this, set the fmin value lower.

The following is an example of samples that have low frequency noise. To eliminate the effects of noise, set the fmin value above the noise frequency. Unfortunately, this will result in degraded TTS quality. It would be best to re-record the data in an environment with less noise.

To generate these plots, runn the pitch_stats entrypoint with the following options:

Copy
Copied!

tao spectro_gen pitch_stats num_files=10 \
pitch_fmin=64 \
pitch_fmax=512 \
output_path=/results/spectro_gen/pitch_stats \
compute_stats=false \
render_plots=true \
manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir$RESULTS_DIR/spectro_gen/pitch_stats


### Required Arguments:

• pitch_fmin: The minimum frequency value set by the user as input to extract the pitch

• pitch_fmax: The maximum frequence value set by the user as input to extract the pitch

• output_path: The path to the directory where the pitch plots are generated

• compute_stats: A boolean flag that specifies whether to compute the pitch_mean and pitch_std

• render_plots: A boolean flag that specifies whether to generate the pitch plots at the output_path

• manifest_filepath: The path to the dataset

• num_files: Number of files in the input dataset to visualize the f0 plot.

• results_dir: The path to the directory where the logs are generated

Note

We recommend setting the compute_stats option to false so you don’t spend time iterating over the entire dataset to compute pitch_mean and pitch_std until you are satisfied with the fmin and fmax values.

#### Computing the pitch_mean and pitch_std

After you set the pitch_fmin and pitch_fmax, you need to extract the pitch over all training files. After filtering out all 0.0 and nan values from the pitch, you will compute the mean and standard deviation. You can then use these values to fine tune FastPitch. To generate the mean and standard deviation, run the pitch_stats task with the following options:

Copy
Copied!

tao spectro_gen pitch_stats num_files=10 \
pitch_fmin=64 \
pitch_fmax=512 \
output_path=/results/spectro_gen/pitch_stats \
compute_stats=true \
render_plots=false \
manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir$RESULTS_DIR/spectro_gen/pitch_stats


Note

In the above example, the compute_stats option is set to true while the render_plots option is set to false so that the spectrograms aren’t rendered and predicted f0 again, but we do compute the mean and standard deviation values.

### Manifest Creation

For best results, you should fine tune FastPitch by adding the original data as well as data from the new speaker. To create a training manifest file that combines the data, you can use spectro_gen dataset_convert dataset_name=merge with the following parameters:

Copy
Copied!

!tao spectro_gen dataset_convert dataset_name=merge \
original_json=<original_data.json> \
finetune_json=<finetuning_data.json> \
save_path=<path_to_save_new_json> \
-r <results_dir> \
-e <experiment_spec>


The important arguments are as follows:

• original_json: The .json file that contains the original data

• finetune_json: The .json file that contains the finetuning data

A merged .json file will be saved at save_path.

Note

The above code assumes that the original and fine-tuned dataset have gone through dataset_convert to generate the manifest.json files, as mentioned in the preparing the dataset section.

Warning

When merging manifest files, ensure that the audio clips from the original data and the new speaker data share the same sampling rate. If the sampling rates don’t match, you can either resample the data using the command line (method 1) or as part of the code (method 2):

1. Use the the sox package CLI tool.

Copy
Copied!

sox input.wav output.wav rate $RATE  Where, $RATE is the target sample frequency in Hz.

2. Use the librosa load function.

Copy
Copied!

import librosa
"/path/to/audio.wav",
sr=<target_sampling_rate>
)
librosa.output.write_wav(
"/path/to/target/audio.wav",
audio,
sr=sampling_rate
)


## Model Export

You can export a trained FastPitch model to Riva format, which contains all the model artifacts necessary for deployment to Riva Services. For more details about Riva, see this page.

To export a FastPitch model to the Riva format, use the following command:

Copy
Copied!

tao spectro_gen export -e <experiment_spec> \
-m <model_checkpoint> \
-r <results_dir> \
-k <encryption_key> \
export_format=RIVA \
export_to=<filename.riva>


### Required Arguments

• -e: The experiment specification file for export. See the Export Spec File section below for more details.

• -m: The path to the model checkpoint to be exported, which should be a .tlt file

• -k: The encryption key

### Optional Arguments

• -r: The path to the directory where results will be stored.

### Export Spec File

The following is an example spec file for model export:

Copy
Copied!

# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format: RIVA
export_format: RIVA

# Output EFF archive containing model checkpoint and artifacts required for Riva Services
export_to: exported-model.riva


Parameter

Datatype

Description

Default

restore_from

string

The path to the pre-trained model to be exported

trained_model.tlt

export_format

string

The export format

N/A

export_to

string

The target path for the export model

exported-model.riva

A successful run of the model export generates the following log:

Copy
Copied!

[NeMo W 2021-10-29 23:14:22 exp_manager:26] Exp_manager is logging to /results/spectro_gen/export, but it already exists.
[NeMo W 2021-10-29 23:14:28 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
dataset:
_target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
manifest_filepath: /data/ljspeech/ljspeech_train.json
max_duration: null
min_duration: 0.1
int_values: false
normalize: true
sample_rate: 22050
...
...
...
[NeMo I 2021-10-29 23:14:35 export:57] Model restored from '/results/spectro_gen/train/checkpoints/trained-model.tlt'
[NeMo W 2021-10-29 23:14:38 export_utils:198] Swapped 0 modules
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
[NeMo I 2021-10-29 23:14:58 export:72] Experiment logs saved to '/results/spectro_gen/export'
[NeMo I 2021-10-29 23:14:58 export:73] Exported model to '/results/spectro_gen/export/spectro_gen.riva'
[NeMo I 2021-10-29 23:15:03 export:80] Exported model is compliant with Riva
`