Speaker Adapter for Custom Voice
Contents
Speaker Adapter for Custom Voice#
Speaker adapter is a TTS technology that allows us to fine-tune datasets with duration close to 30 minutes on a model trained with larger data with satisfactory results. Using Adapters reduces the FastPitch training time significantly.
Download the Models#
Download the pretrained models that we will be fine-tuning.
## Download FastPitch pretrained model.
ngc registry model download-version "nvidia/riva/tts_fastpitch_speaker_adapter_ipa:trainable_v1.0"
## Download HiFi-GAN pretrained model.
ngc registry model download-version "nvidia/riva/tts_en_hifigan_adapter:trainable_v1.0"
Fine-Tune the Existing Model#
After we have downloaded the models, we can use them to fine-tune the model to a new voice. Use the tutorial for fine-tuning speaker adapter models. The tutorial already has the recipe and recommended hyper parameters for fine-tuning, however, we will need to update the following parameters in section Set finetuning params
:
pretrained_fastpitch_checkpoint="<Path to pretrained FastPitch.nemo ckpt, downloaded from ngc in previous section.>"
finetuned_hifigan_on_multispeaker_checkpoint="<Path to pretrained HifiGan.nemo ckpt, downloaded from ngc in previous section.>"
## To do finetuning based on an IPA pretrained model
use_ipa=True
Generate the Riva Checkpoint#
After the fine-tuning finishes, we will have .nemo
checkpoints. We use .nemo
checkpoints to generate a .riva
checkpoint. The processes of converting .nemo
to .riva
is documented in creating Riva files. We generate .riva
checkpoints for both HiFi-GAN and FastPitch. Sample commands to generate .riva
are mentioned below.
# FastPitch
nemo2riva --key tlt_encode --out FastPitch.riva FastPitch.nemo
# HiFi-GAN
nemo2riva --key tlt_encode --out HifiGan.riva HifiGan.nemo
Generate RMIR#
We should now have a .riva
checkpoint for both HiFi-GAN and FastPitch. We next use riva-build
from the riva_servicemaker
container to create a deployable RMIR checkpoint. Refer to the Riva documentation for more details. We can also use the sample command below with the appropriate values to generate the RMIR:
wget -o ipa.txt https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--num_speakers=1 \
--phone_set=ipa \
--phone_dictionary_file=ipa.txt \
--sample_rate 44100 \
--voice_name <voice_name> \
--upper_case_chars=True
Conclusion#
We now have everything we need to deploy our model. We can use tts_deploy
from Riva to deploy our model.