Custom Models
Contents
Custom Models#
Model Deployment#
Like all Riva models, Riva TTS requires the following steps:
Create
.rivafiles for each model from a.nemofile as outlined in the NeMo section.Create
.rmirfiles for each Riva Speech AI Skill (for example, ASR, NLP, and TTS) usingriva-build.Create model directories using
riva_deploy.Deploy the model directory using
riva_server.
The following sections provide examples for steps 1 and 2 as outlined above. For steps 3 and 4, refer to Using riva-deploy and Riva Speech Container (Advanced).
Creating Riva Files#
Riva files can be created from .nemo files. As mentioned before in the NeMo
section, the generation of Riva files from .nemo files must be done on a Linux x86_64
workstation only.
The following is an example of how a
HiFi-GAN model can be converted to a .riva file from a .nemo file.
Download the
.nemofile from NGC onto the host system.Run the NeMo container and share the
.nemofile with the container including the-voption.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
-v $(pwd):/NeMo \
--shm-size=8g \
-p 8888:8888 \
-p 6006:6006 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device=/dev/snd \
nvcr.io/nvidia/nemo:22.08
After the container has launched, use
nemo2rivato convert.nemoto.riva.
pip3 install nvidia-pyindex
ngc registry resource download-version "nvidia/riva/riva_quickstart:2.18.0"
pip3 install "riva_quickstart_v2.18.0/nemo2riva-2.18.0-py3-none-any.whl"
nemo2riva --key encryption_key --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo
Repeat this process for each .nemo model to generate .riva files. It is suggested that
you do so for FastPitch before continuing to the next step. Ensure that you are getting the latest
tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest
nemo2riva-2.18.0_beta-py3-none-any.whl version when performing the above step:
Customization#
After creating the .riva file and prior to running riva-build, there are a few customization
options that can be adjusted. These are optional, however, if you are interested, the instructions
for building the default Riva pipeline, skip ahead to Riva-build Pipeline Instructions.
Custom Pronunciations#
Speech synthesis models deployed in Riva are configured with a language-specific pronunciation
dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence
of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example
with heteronyms like bass (the fish) and bass (the musical instrument), the dictionary is
ignored and the synthesis model uses context clues from the sentence to predict an appropriate
pronunciation.
Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.
While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.
There are two key parameters that can be configured through riva-build or in the
preprocessor configuration that affects the phoneme path:
--phone_dictionary_filepath to the pronunciation dictionary. To start with, leave this parameter empty. If the.rivafile was created from a.nemomodel that contained an dictionary artifact, and this argument is not set, Riva will use the NeMo dictionary file that the model was trained with. To add custom entries and modify pronunciation, modify the NeMo dictionary artifact, save it to another file, and pass that file-path toriva-buildwith this argument.--preprocessor.g2p_ignore_ambiguousIfTrue, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults toTrue.--upper_case_charsshould be set toTrueifipais used. This affects grapheme inputs as theipaphone set includes lower-cased English characters.--phone_setcan be used to specify whether the model was trained witharpabetoripa. If this flag is not used, Riva attempts to auto-detect the correct phone set.
Note
--arpabet_file is deprecated as of Riva 2.8.0 and replaced by --phone_dictionary_file.
Note
Riva supports both arpabet and ipa depending on what the acoustic model was trained on.
For more information, refer to the ARPABET wikipedia page. For more information
on IPA, refer to the TTS Phoneme Support page.
To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.
Multi-Speaker Models#
Riva supports models with multiple speakers.
To enable this feature, specify the following parameters before building the model.
--voice_nameis the name of the model. Defaults toEnglish-US.Female-1.--subvoicesis a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option--subvoices=Male:0,Female:1. If not provided, the desired embedding can be requested by integer index.
The voice name and subvoices are maintained in the generated .rmir file, and caried into the generated Triton
repositories. During inference, modify the voice name of the request by appending voice_name with a
period followed by a valid subvoice. For example, <voice_name>.<subvoice>.
Custom Voice#
Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and HiFi-GAN models need to be fine-tuned on this dataset. Refer to the Riva fine-tuning tutorial for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.
Custom Text Normalization#
Riva supports custom text normalization rules built from NeMo’s WFST text normalization (TN) tool.
For details on customizing TN, refer to the NeMo WFST tutorial.
After the WFST has been customized, use NeMo to deploy it using its export_grammar script. Refer to the
documentation for more information.
This produces two files: tokenize_and_classify.far and verbalize.far. These are passed to the
riva-build step using the --wfst_tokenizer_model and --wfst_verbalizer_model arguments.
Additionally, riva-build also supports --wfst_pre_process_model and --wfst_post_process_model arguments to pass the pre and post processing FAR files for text normalization.
Riva-build Pipeline Instructions#
FastPitch and HiFi-GAN#
Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the Riva container:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
/servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file> \
--wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
--wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \
--sample_rate=<sample_rate> \
--subvoices=<subvoices> \
Where:
<rmir_filename>is the Rivarmirfile that is generated<encryption_key>is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC istlt_encode, unless specified under a specific model in the list of pretrained quick start pipelines.pipeline_nameis an optional user-defined name for the components in the model repository<fastpitch_riva_filename>is the name of therivafile for FastPitch<hifigan_riva_filename>is the name of therivafile for HiFi-GAN<abbr_file>is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET<voice_name>is the name of the model<subvoices>is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.<wfst_tokenizer_model>is the location of thetokenize_and_classify.farfile that is generated from running the NeMo’s Text Processing’sexport_grammar.shscript<wfst_verbalizer_model>is the location of theverbalize.farfile that is generated from running the NeMo’s Text Processing’sexport_grammar.shscript<sample_rate>is the sample rate of audio that the models were trained on
Upon successful completion of this command, a file named <rmir_filename> is created in the
/servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include
:<encryption_key> at the end of the RMIR and riva filenames, otherwise this is
unnecessary.
For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the Riva-build Optional Parameters section and set the various min_batch_size, max_batch_size, and opt_batch_size parameters to 1 while executing the riva-build command.
Pretrained Quick Start Pipelines#
Pipeline |
|
|---|---|
FastPitch + HiFi-GAN IPA (en-US Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=en-US \
--num_speakers=12 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name English-US \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--upper_case_chars=True \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_abbreviations_file>
|
FastPitch + HiFi-GAN IPA (zh-CN Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=zh-CN \
--num_speakers=10 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name Mandarin-CN \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Male-Sad:15 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--wfst_post_process_model=<far_post_process_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']'
|
FastPitch + HiFi-GAN IPA (es-ES Female) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:BSzv7YAjcH4nJS \
<riva_hifigan_file>:BSzv7YAjcH4nJS \
--language_code=es-ES \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Spanish-ES-Female-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (es-ES Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:PPihyG3Moru5in \
<riva_hifigan_file>:PPihyG3Moru5in \
--language_code=es-ES \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Spanish-ES-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (es-US Multi-Speaker) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--language_code=es-US \
--num_speakers=12 \
--phone_set=ipa \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name Spanish-US \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Female-Sad:14,Male-Sad:15 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']'
|
FastPitch + HiFi-GAN IPA (it-IT Female) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:R62srgxeXBgVxg \
<riva_hifigan_file>:R62srgxeXBgVxg \
--language_code=it-IT \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Italian-IT-Female-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (it-IT Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:dVRvg47ZqCdQrR \
<riva_hifigan_file>:dVRvg47ZqCdQrR \
--language_code=it-IT \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name Italian-IT-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN IPA (de-DE Male) |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:ZzZjce65zzGZ9o \
<riva_hifigan_file>:ZzZjce65zzGZ9o \
--language_code=de-DE \
--phone_dictionary_file=<dict_file> \
--sample_rate 22050 \
--voice_name German-DE-Male-1 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
T5TTS + AudioCodec IPA |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_t5tts_file>:<key> \
<riva_audiocodec_file>:<key> \
<riva_neuralg2p_file>:<key> \
--num_speakers=11 \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 22050 \
--voice_name English-US-T5TTS \
--subvoices Female-1:0,Male-1:1,Male-Calm:8,Female-Calm:9,Female-Fearful:11,Male-Neutral:12,Male-Angry:14,Female-Angry:16,Female-Neutral:17,Male-Fearful:20,Female-Happy:21 \
--phone_set=ipa \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.g2p_ignore_ambiguous=False \
--abbreviations_file=<txt_abbreviations_file>
|
RadTTS + HiFi-GAN IPA |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_radtts_file>:<key> \
<riva_hifigan_file>:<key> \
--num_speakers=12 \
--phone_dictionary_file=<txt_phone_dictionary_file> \
--sample_rate 44100 \
--voice_name English-US-RadTTS \
--subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
--phone_set=ipa \
--upper_case_chars=True \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_abbreviations_file>
|
FastPitch + HiFi-GAN ARPABET |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--arpabet_file=cmudict-0.7b_nv22.08 \
--sample_rate 44100 \
--voice_name English-US \
--subvoices Male-1:0,Female-1:1 \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--preprocessor.enable_emphasis_tag=True \
--preprocessor.start_of_emphasis_token='[' \
--preprocessor.end_of_emphasis_token=']' \
--abbreviations_file=<txt_file>
|
FastPitch + HiFi-GAN LJSpeech |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<riva_fastpitch_file>:<key> \
<riva_hifigan_file>:<key> \
--arpabet_file=..cmudict-0.7b_nv22.08 \
--voice_name ljspeech \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file> \
--abbreviations_file=<txt_file>
|
All text normalization .far files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.
Riva-build Optional Parameters#
For details about the parameters passed to riva-build to customize the TTS pipeline, issue:
riva-build speech_synthesis -h
The following list includes descriptions for all optional parameters currently recognized by riva-build: