Speaker Adapter for Custom Voice
Speaker Adapter for Custom Voice#
Speaker adapter is a TTS technology that allows us to fine-tune datasets with duration close to 30 minutes on a model trained with larger data with satisfactory results. Using Adapters reduces the FastPitch training time significantly.
Download the Models#
Download the pretrained models that we will be fine-tuning.
## Download fastpitch pretrained model. ngc registry model download-version "nvidia/riva/tts_fastpitch_speaker_adapter_ipa:trainable_v1.0" ## Download hifiGan pretrained model. ngc registry model download-version "nvidia/riva/tts_en_hifigan_adapter:trainable_v1.0"
Fine-Tune the Existing Model#
After we have downloaded the models, we can use them to fine-tune the model to a new voice. Use the tutorial for fine-tuning speaker adapter models. The tutorial already has the recipe and recommended hyperparameters for fine-tuning, however, we will need to update the following parameters in section
pretrained_fastpitch_checkpoint="<Path to pretrained FastPitch.nemo ckpt, downloaded from ngc in previous section.>" finetuned_hifigan_on_multispeaker_checkpoint="<Path to pretrained HifiGan.nemo ckpt, downloaded from ngc in previous section.>" use_ipa=True ##Set this to true, since we will do finetuning based on an IPA pretrained model.
Generate the Riva Checkpoint#
After the fine-tuning finishes, we will have
.nemo checkpoints. We use
.nemo checkpoints to generate a
.riva checkpoint. The processes of converting
.riva is documented in creating Riva files. We generate
.riva checkpoints for both HiFi-GAN and FastPitch. Sample commands to generate
nemo2riva --key tlt_encode --out FastPitch.riva FastPitch.nemo
nemo2riva --key tlt_encode --out HifiGan.riva HifiGan.nemo
We should now have a
.riva checkpoint for both HiFi-GAN and FastPitch. We next use
riva-build from the
riva_servicemaker container to create a deployable RMIR checkpoint. Refer to the Riva documentation for more details. We can also use the sample command below with the appropriate values to generate the RMIR:
wget -o ipa.txt https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --num_speakers=1 \ --phone_set=ipa \ --phone_dictionary_file=ipa.txt \ --sample_rate 44100 \ --voice_name <voice_name> \ --upper_case_chars=True