# Spectrogram Generator

Spectrogram Generator models take in text input and generate a Mel spectrogram. There are several types of Spectrogram Generator architecture; TAO Toolkit supports the FastPitch architecture.

The FastPitch model generates Mel spectrograms and predicts a pitch contour from raw input text. It allows additional control over synthesized utterances through the following options:

• Modify the pitch contour to control the prosody.

• Increase or decrease the fundamental frequency in a natural way, which preserves the perceived identity of the speaker.

• Alter the rate of speech.

• Specify input as graphemes or phonemes.

• Switch speakers (if the model has been trained with data from multiple speakers).

The following tasks have been implemented for FastPitch in the TAO Toolkit

• dataset_convert

• train

• infer

• export

• finetune

• pitch_stats

Example specification files for all the tasks associated with the spectrogram generator component of TTS can be downloaded using the following command:

Copy
Copied!

### Required Arguments

• -o: The target path where the spec files will be stored

• -r: The results and output log directory

## Preparing the Dataset

The spectrogram generator for TAO Toolkit implements the dataset_convert task to convert and prepare datasets that follow the LJSpeech dataset format.

The dataset_convert task generates manifest files and .txt files with normalized transcripts.

The dataset for TTS consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line (.json).

Each line of the manifest should be in the following format:

Copy
Copied!

{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}

The audio_filepath field should provide an absolute path to the .wav file corresponding to the utterance. The text field should contain the full transcript for the utterance, and the duration field should reflect the duration of the utterance in seconds.

Each entry in the manifest (describing one audio file) should be bordered by { and } and must be contained on one line. The fields that describe the file should be separated by commas and have the form "field_name": value, as shown above.

Since the manifest specifies the path for each utterance, the audio files do not have to be located in the same directory as the manifest, or even in any specific directory structure.

## Creating an Experiment Spec File

The spec file for TTS using FastPitch includes the trainer, model, training_dataset, validation_dataset, and prior_folder.

The following is a shortened example of a spec file for training on the LJSpeech dataset.

Copy
Copied!

### Required Arguments

• -e: The experiment specification file to set up inference. This spec file only needs a file_paths parameter that contains a list of individual file paths.

• -m: The path to the model checkpoint, which should be a .tlt file.

• -k: The key to encrypt the model

### Optional Arguments

• -g: The number of GPUs to use for inference in a multi-GPU scenario. The default value is 1.

• -r: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.

• Other arguments to override fields in the specification file.

### Inference Procedure

At the start of inference, TAO Toolkit will print out the experiment specification, including the audio filepaths on which inference will be performed.

When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on.

Copy
Copied!

[NeMo W 2021-10-29 23:08:27 exp_manager:26] Exp_manager is logging to /results/spectro_gen/infer, but it already exists. [NeMo W 2021-10-29 23:08:33 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : dataset: _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset manifest_filepath: /data/ljspeech/ljspeech_train.json ... ... dataloader_params: drop_last: false shuffle: true batch_size: 32 num_workers: 12 [NeMo W 2021-10-29 23:08:33 modelPT:137] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : dataset: _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset ... ... dataloader_params: drop_last: false shuffle: false batch_size: 32 num_workers: 8 [NeMo I 2021-10-29 23:08:43 features:252] PADDING: 1 [NeMo I 2021-10-29 23:08:43 features:269] STFT using torch Results for by the... saved to /results/spectro_gen/infer/spectro/0.npy Results for direct... saved to /results/spectro_gen/infer/spectro/1.npy Results for uneasy... saved to /results/spectro_gen/infer/spectro/2.npy [NeMo I 2021-10-29 23:08:51 infer:79] Experiment logs saved to '/results/spectro_gen/infer'

The path to the Mel spectrograms generated by the infer task are shown in the last lines of the log.

### Current Limitations

• Currently, only .wav audio files are supported.

## Fine-Tuning the Model

To fine-tune a model from a checkpoint, use the following command:

Copy
Copied!

!tao spectro_gen finetune -e <experiment_spec> \ -m <model_checkpoint> \ -g <num_gpus> \ train_dataset=<train.json> \ validation_dataset=<val.json> \ prior_folder=<prior_dir, could be an empty dir> \ n_speakers=2 \ pitch_fmin=<pitch statistic, see pitch section> \ pitch_fmax=<pitch statistic, see pitch section> \ pitch_avg=<pitch statistic, see pitch section> \ pitch_std=<pitch statistic, see pitch section> \ trainer.max_steps=<num_steps>

### Required Arguments

• -e: The experiment specification file to set up fine-tuning.

• -m: The path to the model checkpoint from which to fine-tune. The model checkpoint should be a .tlt file.

• train_dataset: The path to the training manifest, which should be created using dataset_convert dataset_name=merge. See the section below for more details.

• validation_dataset: The path to the validation manifest.

• prior_folder: A folder used to store dataset files. If the folder is empty, these files will be computed on the first run and saved to this directory. Future runs will load these files from the directory if they exist.

• n_speakers: This value should be 2: One for the original speaker, one for the new finetuning speaker.

• pitch_fmin: The Fmin to be used for pitch extraction. See the section below on how to set this value.

• pitch_fmax: The Fmax to be used for pitch extraction. See the section below on how to set this value.

• pitch_avg: The pitch average to be used for pitch extraction. See the section below on how to set this value.

• pitch_std: The pitch standard deviation to be used for pitch extraction. See the section below on how to set this value.

• trainer.max_steps: The number of steps used to finetune the model. We recommend adding 1000 for each minute in the finetuning data.

### Optional Arguments

• -g: The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1).

• -r: The path to the results and log directory. Log files, checkpoints, etc., will be stored here.

• Other arguments to override fields in the specification file.

Warning

In order to prevent unauthorized use of someone’s voice, TAO will only run finetuning if the text transcripts used in the finetuning data comes from the NVIDIA Custom Voice Recorder tool. Users do not have to use the tool to record their own voice, but the transcripts used must be the same.

Warning

The data from the NVIDIA Custom Voice Recorder tool cannot be used to train a FastPitch model from scratch. Instead, use the data with the finetune endpoint of TAO Text-To-Speech with a pretrained FastPitch model.

### Pitch Statistics

To fine tune FastPitch, you need to find and set 4 pitch hyperparameters:

• Fmin

• Fmax

• Mean

• Std

TAO features the pitch_stats task to help with this process. You must set Fmin and Fmax first. You can then iterate over the finetuning dataset to extract the pitch mean and standard deviation.

#### Obtaining the fmin and fmax

To get the fmin and fmax values, you will need to start with some defaults and iterate through random samples of the dataset to ensure that the pyin function from librosa extracts the pitch correctly. Then, look at the plotted spectrograms, as well as the predicted f0 (the cyan line), which should match the lowest energy band in the spectrogram. Here is an example of a good match between the predicted f0 and the spectrogram.

The following is an example of a bad match between the f0 and the spectrogram. The fmin was likely set too high. The fo algorithm is missing the first two vocalizations and is correctly matching the last half of speech. To fix this, set the fmin value lower.

The following is an example of samples that have low frequency noise. To eliminate the effects of noise, set the fmin value above the noise frequency. Unfortunately, this will result in degraded TTS quality. It would be best to re-record the data in an environment with less noise.

To generate these plots, runn the pitch_stats entrypoint with the following options:

Copy
Copied!

tao spectro_gen pitch_stats num_files=10 \ pitch_fmin=64 \ pitch_fmax=512 \ output_path=/results/spectro_gen/pitch_stats \ compute_stats=false \ render_plots=true \ manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir$RESULTS_DIR/spectro_gen/pitch_stats

### Required Arguments:

• pitch_fmin: The minimum frequency value set by the user as input to extract the pitch

• pitch_fmax: The maximum frequence value set by the user as input to extract the pitch

• output_path: The path to the directory where the pitch plots are generated

• compute_stats: A boolean flag that specifies whether to compute the pitch_mean and pitch_std

• render_plots: A boolean flag that specifies whether to generate the pitch plots at the output_path

• manifest_filepath: The path to the dataset

• num_files: Number of files in the input dataset to visualize the f0 plot.

• results_dir: The path to the directory where the logs are generated

Note

We recommend setting the compute_stats option to false so you don’t spend time iterating over the entire dataset to compute pitch_mean and pitch_std until you are satisfied with the fmin and fmax values.

#### Computing the pitch_mean and pitch_std

After you set the pitch_fmin and pitch_fmax, you need to extract the pitch over all training files. After filtering out all 0.0 and nan values from the pitch, you will compute the mean and standard deviation. You can then use these values to fine tune FastPitch. To generate the mean and standard deviation, run the pitch_stats task with the following options:

Copy
Copied!

tao spectro_gen pitch_stats num_files=10 \ pitch_fmin=64 \ pitch_fmax=512 \ output_path=/results/spectro_gen/pitch_stats \ compute_stats=true \ render_plots=false \ manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir$RESULTS_DIR/spectro_gen/pitch_stats

Note

In the above example, the compute_stats option is set to true while the render_plots option is set to false so that the spectrograms aren’t rendered and predicted f0 again, but we do compute the mean and standard deviation values.

### Manifest Creation

For best results, you should fine tune FastPitch by adding the original data as well as data from the new speaker. To create a training manifest file that combines the data, you can use spectro_gen dataset_convert dataset_name=merge with the following parameters:

Copy
Copied!

!tao spectro_gen dataset_convert dataset_name=merge \ original_json=<original_data.json> \ finetune_json=<finetuning_data.json> \ save_path=<path_to_save_new_json> \ -r <results_dir> \ -e <experiment_spec>

The important arguments are as follows:

• original_json: The .json file that contains the original data

• finetune_json: The .json file that contains the finetuning data

A merged .json file will be saved at save_path.

Note

The above code assumes that the original and fine-tuned dataset have gone through dataset_convert to generate the manifest.json files, as mentioned in the preparing the dataset section.

Warning

When merging manifest files, ensure that the audio clips from the original data and the new speaker data share the same sampling rate. If the sampling rates don’t match, you can either resample the data using the command line (method 1) or as part of the code (method 2):

1. Use the the sox package CLI tool.

Copy
Copied!

sox input.wav output.wav rate $RATE Where,$RATE is the target sample frequency in Hz.

2. Use the librosa load function.

Copy
Copied!

import librosa audio, sampling_rate = librosa.load( "/path/to/audio.wav", sr=<target_sampling_rate> ) librosa.output.write_wav( "/path/to/target/audio.wav", audio, sr=sampling_rate )

## Model Export

You can export a trained FastPitch model to Riva format, which contains all the model artifacts necessary for deployment to Riva Services. For more details about Riva, see this page.

To export a FastPitch model to the Riva format, use the following command:

Copy
Copied!

tao spectro_gen export -e <experiment_spec> \ -m <model_checkpoint> \ -r <results_dir> \ -k <encryption_key> \ export_format=RIVA \ export_to=<filename.riva>

### Required Arguments

• -e: The experiment specification file for export. See the Export Spec File section below for more details.

• -m: The path to the model checkpoint to be exported, which should be a .tlt file

• -k: The encryption key

### Optional Arguments

• -r: The path to the directory where results will be stored.

### Export Spec File

The following is an example spec file for model export:

Copy
Copied!

# Name of the .tlt EFF archive to be loaded/model to be exported. restore_from: trained-model.tlt # Set export format: RIVA export_format: RIVA # Output EFF archive containing model checkpoint and artifacts required for Riva Services export_to: exported-model.riva

Parameter

Datatype

Description

Default

restore_from

string

The path to the pre-trained model to be exported

trained_model.tlt

export_format

string

The export format

N/A

export_to

string

The target path for the export model

exported-model.riva

A successful run of the model export generates the following log:

Copy
Copied!

[NeMo W 2021-10-29 23:14:22 exp_manager:26] Exp_manager is logging to /results/spectro_gen/export`, but it already exists. [NeMo W 2021-10-29 23:14:28 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : dataset: _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset manifest_filepath: /data/ljspeech/ljspeech_train.json max_duration: null min_duration: 0.1 int_values: false normalize: true sample_rate: 22050 ... ... ... [NeMo I 2021-10-29 23:14:35 export:57] Model restored from '/results/spectro_gen/train/checkpoints/trained-model.tlt' [NeMo W 2021-10-29 23:14:38 export_utils:198] Swapped 0 modules Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. [NeMo I 2021-10-29 23:14:58 export:72] Experiment logs saved to '/results/spectro_gen/export' [NeMo I 2021-10-29 23:14:58 export:73] Exported model to '/results/spectro_gen/export/spectro_gen.riva' [NeMo I 2021-10-29 23:15:03 export:80] Exported model is compliant with Riva