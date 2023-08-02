To fine-tune a model from a checkpoint, use the following command:

Copy Copied! !tao spectro_gen finetune -e <experiment_spec> \ -m <model_checkpoint> \ -g <num_gpus> \ train_dataset=<train.json> \ validation_dataset=<val.json> \ prior_folder=<prior_dir, could be an empty dir> \ n_speakers=2 \ pitch_fmin=<pitch statistic, see pitch section> \ pitch_fmax=<pitch statistic, see pitch section> \ pitch_avg=<pitch statistic, see pitch section> \ pitch_std=<pitch statistic, see pitch section> \ trainer.max_steps=<num_steps>

-e : The experiment specification file to set up fine-tuning.

-m : The path to the model checkpoint from which to fine-tune. The model checkpoint should be a .tlt file.

train_dataset : The path to the training manifest, which should be created using dataset_convert dataset_name=merge . See the section below for more details.

validation_dataset : The path to the validation manifest.

prior_folder : A folder used to store dataset files. If the folder is empty, these files will be computed on the first run and saved to this directory. Future runs will load these files from the directory if they exist.

n_speakers : This value should be 2: One for the original speaker, one for the new finetuning speaker.

pitch_fmin : The Fmin to be used for pitch extraction. See the section below on how to set this value.

pitch_fmax : The Fmax to be used for pitch extraction. See the section below on how to set this value.

pitch_avg : The pitch average to be used for pitch extraction. See the section below on how to set this value.

pitch_std : The pitch standard deviation to be used for pitch extraction. See the section below on how to set this value.

trainer.max_steps : The number of steps used to finetune the model. We recommend adding 1000 for each minute in the finetuning data.

-g : The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1).

-r : The path to the results and log directory. Log files, checkpoints, etc., will be stored here.

Other arguments to override fields in the specification file.

Warning In order to prevent unauthorized use of someone’s voice, TAO will only run finetuning if the text transcripts used in the finetuning data comes from the NVIDIA Custom Voice Recorder tool. Users do not have to use the tool to record their own voice, but the transcripts used must be the same.

Warning The data from the NVIDIA Custom Voice Recorder tool cannot be used to train a FastPitch model from scratch. Instead, use the data with the finetune endpoint of TAO Text-To-Speech with a pretrained FastPitch model.





To fine tune FastPitch, you need to find and set 4 pitch hyperparameters:

Fmin

Fmax

Mean

Std

TAO features the pitch_stats task to help with this process. You must set Fmin and Fmax first. You can then iterate over the finetuning dataset to extract the pitch mean and standard deviation.

Obtaining the fmin and fmax

To get the fmin and fmax values, you will need to start with some defaults and iterate through random samples of the dataset to ensure that the pyin function from librosa extracts the pitch correctly. Then, look at the plotted spectrograms, as well as the predicted f0 (the cyan line), which should match the lowest energy band in the spectrogram. Here is an example of a good match between the predicted f0 and the spectrogram.





The following is an example of a bad match between the f0 and the spectrogram. The fmin was likely set too high. The fo algorithm is missing the first two vocalizations and is correctly matching the last half of speech. To fix this, set the fmin value lower.





The following is an example of samples that have low frequency noise. To eliminate the effects of noise, set the fmin value above the noise frequency. Unfortunately, this will result in degraded TTS quality. It would be best to re-record the data in an environment with less noise.





To generate these plots, run the pitch_stats entrypoint with the following options:

Copy Copied! tao spectro_gen pitch_stats num_files=10 \ pitch_fmin=64 \ pitch_fmax=512 \ output_path=/results/spectro_gen/pitch_stats \ compute_stats=false \ render_plots=true \ manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir $RESULTS_DIR/spectro_gen/pitch_stats

pitch_fmin : The minimum frequency value set by the user as input to extract the pitch

pitch_fmax : The maximum frequence value set by the user as input to extract the pitch

output_path : The path to the directory where the pitch plots are generated

compute_stats : A boolean flag that specifies whether to compute the pitch_mean and pitch_std

render_plots : A boolean flag that specifies whether to generate the pitch plots at the output_path

manifest_filepath : The path to the dataset

num_files : Number of files in the input dataset to visualize the f0 plot.

results_dir : The path to the directory where the logs are generated

Note We recommend setting the compute_stats option to false so you don’t spend time iterating over the entire dataset to compute pitch_mean and pitch_std until you are satisfied with the fmin and fmax values.

Computing the pitch_mean and pitch_std

After you set the pitch_fmin and pitch_fmax , you need to extract the pitch over all training files. After filtering out all 0.0 and nan values from the pitch, you will compute the mean and standard deviation. You can then use these values to fine tune FastPitch. To generate the mean and standard deviation, run the pitch_stats task with the following options:

Copy Copied! tao spectro_gen pitch_stats num_files=10 \ pitch_fmin=64 \ pitch_fmax=512 \ output_path=/results/spectro_gen/pitch_stats \ compute_stats=true \ render_plots=false \ manifest_filepath=$DATA_DIR/6097_5_mins/6097_manifest_train.json \ --results_dir $RESULTS_DIR/spectro_gen/pitch_stats

Note In the above example, the compute_stats option is set to true while the render_plots option is set to false so that the spectrograms aren’t rendered and predicted f0 again, but we do compute the mean and standard deviation values.

For best results, you should fine tune FastPitch by adding the original data as well as data from the new speaker. To create a training manifest file that combines the data, you can use spectro_gen dataset_convert dataset_name=merge with the following parameters:

Copy Copied! !tao spectro_gen dataset_convert dataset_name=merge \ original_json=<original_data.json> \ finetune_json=<finetuning_data.json> \ save_path=<path_to_save_new_json> \ -r <results_dir> \ -e <experiment_spec>

The important arguments are as follows:

original_json : The .json file that contains the original data

finetune_json : The .json file that contains the finetuning data

A merged .json file will be saved at save_path .

Note The above code assumes that the original and fine-tuned dataset have gone through dataset_convert to generate the manifest.json files, as mentioned in the preparing the dataset section.

Warning When merging manifest files, ensure that the audio clips from the original data and the new speaker data share the same sampling rate. If the sampling rates don’t match, you can either resample the data using the command line (method 1) or as part of the code (method 2): Use the the sox package CLI tool. Copy Copied! sox input.wav output.wav rate $RATE Where, $RATE is the target sample frequency in Hz. Use the librosa load function. Copy Copied! import librosa audio, sampling_rate = librosa.load( "/path/to/audio.wav", sr=<target_sampling_rate> ) librosa.output.write_wav( "/path/to/target/audio.wav", audio, sr=sampling_rate )



