Speaker Adapter for Custom Voice

Speaker Adapter for Custom Voice#

Speaker adapter is a TTS technology that allows us to fine-tune datasets with duration close to 30 minutes on a model trained with larger data with satisfactory results. Using Adapters reduces the FastPitch training time significantly.

Download the Models#

Download the pretrained models that we will be fine-tuning.

## Download FastPitch pretrained model.
ngc registry model download-version "nvidia/riva/tts_fastpitch_speaker_adapter_ipa:trainable_v1.0"
## Download HiFi-GAN pretrained model.
ngc registry model download-version "nvidia/riva/tts_en_hifigan_adapter:trainable_v1.0"

Fine-Tune the Existing Model#

After we have downloaded the models, we can use them to fine-tune the model to a new voice. Use the tutorial for fine-tuning speaker adapter models. The tutorial already has the recipe and recommended hyper parameters for fine-tuning, however, we will need to update the following parameters in section Set finetuning params:

pretrained_fastpitch_checkpoint="<Path to pretrained FastPitch.nemo ckpt, downloaded from ngc in previous section.>"

finetuned_hifigan_on_multispeaker_checkpoint="<Path to pretrained HifiGan.nemo ckpt, downloaded from ngc in previous section.>"

## To do finetuning based on an IPA pretrained model
use_ipa=True

Generate the Riva Checkpoint#

After the fine-tuning finishes, we will have .nemo checkpoints. We use .nemo checkpoints to generate a .riva checkpoint. The processes of converting .nemo to .riva is documented in creating Riva files. We generate .riva checkpoints for both HiFi-GAN and FastPitch. Sample commands to generate .riva are mentioned below.

# FastPitch
nemo2riva --key tlt_encode --out FastPitch.riva FastPitch.nemo

# HiFi-GAN
nemo2riva --key tlt_encode --out HifiGan.riva HifiGan.nemo

Generate RMIR#

We should now have a .riva checkpoint for both HiFi-GAN and FastPitch. We next use riva-build from the Riva container to create a deployable RMIR checkpoint. Refer to the Riva documentation for more details. We can also use the sample command below with the appropriate values to generate the RMIR:

wget -o ipa.txt https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt
riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --num_speakers=1 \
    --phone_set=ipa \
    --phone_dictionary_file=ipa.txt \
    --sample_rate 44100 \
    --voice_name <voice_name> \
    --upper_case_chars=True

Conclusion#

We now have everything we need to deploy our model. We can use tts_deploy from Riva to deploy our model.

NVIDIA Riva