Custom Models#

Model Deployment#

Like all Riva models, Riva TTS requires the following steps:

Create .riva files for each model from a .nemo file as outlined in the NeMo section.

Create .rmir files for each Riva Speech AI Skill (for example, ASR, NLP, and TTS) using riva-build.

Create model directories using riva_deploy.

Deploy the model directory using riva_server.

The following sections provide examples for steps 1 and 2 as outlined above. For steps 3 and 4, refer to Using riva-deploy and Riva Speech Container (Advanced).

Creating Riva Files#

Riva files can be created from .nemo files. As mentioned before in the NeMo section, the generation of Riva files from .nemo files must be done on a Linux x86_64 workstation only.

The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file.

Download the .nemo file from NGC onto the host system.
Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:22.08

After the container has launched, use nemo2riva to convert .nemo to .riva.

pip3 install nvidia-pyindex
ngc registry resource download-version "nvidia/riva/riva_quickstart:2.18.0"
pip3 install "riva_quickstart_v2.18.0/nemo2riva-2.18.0-py3-none-any.whl"
nemo2riva --key encryption_key --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

Repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Ensure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-2.18.0_beta-py3-none-any.whl version when performing the above step:

Customization#

After creating the .riva file and prior to running riva-build, there are a few customization options that can be adjusted. These are optional, however, if you are interested, the instructions for building the default Riva pipeline, skip ahead to Riva-build Pipeline Instructions.

Custom Pronunciations#

Speech synthesis models deployed in Riva are configured with a language-specific pronunciation dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example with heteronyms like bass (the fish) and bass (the musical instrument), the dictionary is ignored and the synthesis model uses context clues from the sentence to predict an appropriate pronunciation.

Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.

While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.

There are two key parameters that can be configured through riva-build or in the preprocessor configuration that affects the phoneme path:

--phone_dictionary_file path to the pronunciation dictionary. To start with, leave this parameter empty. If the .riva file was created from a .nemo model that contained an dictionary artifact, and this argument is not set, Riva will use the NeMo dictionary file that the model was trained with. To add custom entries and modify pronunciation, modify the NeMo dictionary artifact, save it to another file, and pass that file-path to riva-build with this argument.
--preprocessor.g2p_ignore_ambiguous If True, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults to True.
--upper_case_chars should be set to True if ipa is used. This affects grapheme inputs as the ipa phone set includes lower-cased English characters.
--phone_set can be used to specify whether the model was trained with arpabet or ipa. If this flag is not used, Riva attempts to auto-detect the correct phone set.

Note

--arpabet_file is deprecated as of Riva 2.8.0 and replaced by --phone_dictionary_file.

Note

Riva supports both arpabet and ipa depending on what the acoustic model was trained on. For more information, refer to the ARPABET wikipedia page. For more information on IPA, refer to the TTS Phoneme Support page.

To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.

Multi-Speaker Models#

Riva supports models with multiple speakers.

To enable this feature, specify the following parameters before building the model.

--voice_name is the name of the model. Defaults to English-US.Female-1.
--subvoices is a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option --subvoices=Male:0,Female:1. If not provided, the desired embedding can be requested by integer index.

The voice name and subvoices are maintained in the generated .rmir file, and caried into the generated Triton repositories. During inference, modify the voice name of the request by appending voice_name with a period followed by a valid subvoice. For example, <voice_name>.<subvoice>.

Custom Voice#

Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and HiFi-GAN models need to be fine-tuned on this dataset. Refer to the Riva fine-tuning tutorial for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.

Custom Text Normalization#

Riva supports custom text normalization rules built from NeMo’s WFST text normalization (TN) tool. For details on customizing TN, refer to the NeMo WFST tutorial. After the WFST has been customized, use NeMo to deploy it using its export_grammar script. Refer to the documentation for more information. This produces two files: tokenize_and_classify.far and verbalize.far. These are passed to the riva-build step using the --wfst_tokenizer_model and --wfst_verbalizer_model arguments. Additionally, riva-build also supports --wfst_pre_process_model and --wfst_post_process_model arguments to pass the pre and post processing FAR files for text normalization.

Riva-build Pipeline Instructions#

FastPitch and HiFi-GAN#

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the Riva container:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
    /servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file> \
    --wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
    --wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \
    --sample_rate=<sample_rate> \
    --subvoices=<subvoices> \

Where:

<rmir_filename> is the Riva rmir file that is generated
<encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode, unless specified under a specific model in the list of pretrained quick start pipelines.
pipeline_name is an optional user-defined name for the components in the model repository
<fastpitch_riva_filename> is the name of the riva file for FastPitch
<hifigan_riva_filename> is the name of the riva file for HiFi-GAN
<abbr_file> is the name of the file containing abbreviations and their corresponding expansions
<dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET
<voice_name> is the name of the model
<subvoices> is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.
<wfst_tokenizer_model> is the location of the tokenize_and_classify.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script
<wfst_verbalizer_model> is the location of the verbalize.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script
<sample_rate> is the sample rate of audio that the models were trained on

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR and riva filenames, otherwise this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the Riva-build Optional Parameters section and set the various min_batch_size, max_batch_size, and opt_batch_size parameters to 1 while executing the riva-build command.

Pretrained Quick Start Pipelines#

Pipeline	`riva-build` command
FastPitch + HiFi-GAN IPA (en-US Multi-Speaker)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --language_code=en-US \ --num_speakers=12 \ --phone_set=ipa \ --phone_dictionary_file=<txt_phone_dictionary_file> \ --sample_rate 44100 \ --voice_name English-US \ --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --upper_case_chars=True \ --preprocessor.enable_emphasis_tag=True \ --preprocessor.start_of_emphasis_token='[' \ --preprocessor.end_of_emphasis_token=']' \ --abbreviations_file=<txt_abbreviations_file>
FastPitch + HiFi-GAN IPA (zh-CN Multi-Speaker)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --language_code=zh-CN \ --num_speakers=10 \ --phone_set=ipa \ --phone_dictionary_file=<txt_phone_dictionary_file> \ --sample_rate 44100 \ --voice_name Mandarin-CN \ --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Male-Sad:15 \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --wfst_post_process_model=<far_post_process_file> \ --preprocessor.enable_emphasis_tag=True \ --preprocessor.start_of_emphasis_token='[' \ --preprocessor.end_of_emphasis_token=']'
FastPitch + HiFi-GAN IPA (es-ES Female)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:BSzv7YAjcH4nJS \ <riva_hifigan_file>:BSzv7YAjcH4nJS \ --language_code=es-ES \ --phone_dictionary_file=<dict_file> \ --sample_rate 22050 \ --voice_name Spanish-ES-Female-1 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>
FastPitch + HiFi-GAN IPA (es-ES Male)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:PPihyG3Moru5in \ <riva_hifigan_file>:PPihyG3Moru5in \ --language_code=es-ES \ --phone_dictionary_file=<dict_file> \ --sample_rate 22050 \ --voice_name Spanish-ES-Male-1 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>
FastPitch + HiFi-GAN IPA (es-US Multi-Speaker)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --language_code=es-US \ --num_speakers=12 \ --phone_set=ipa \ --phone_dictionary_file=<txt_phone_dictionary_file> \ --sample_rate 44100 \ --voice_name Spanish-US \ --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Female-Sad:14,Male-Sad:15 \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --preprocessor.enable_emphasis_tag=True \ --preprocessor.start_of_emphasis_token='[' \ --preprocessor.end_of_emphasis_token=']'
FastPitch + HiFi-GAN IPA (it-IT Female)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:R62srgxeXBgVxg \ <riva_hifigan_file>:R62srgxeXBgVxg \ --language_code=it-IT \ --phone_dictionary_file=<dict_file> \ --sample_rate 22050 \ --voice_name Italian-IT-Female-1 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>
FastPitch + HiFi-GAN IPA (it-IT Male)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:dVRvg47ZqCdQrR \ <riva_hifigan_file>:dVRvg47ZqCdQrR \ --language_code=it-IT \ --phone_dictionary_file=<dict_file> \ --sample_rate 22050 \ --voice_name Italian-IT-Male-1 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>
FastPitch + HiFi-GAN IPA (de-DE Male)	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:ZzZjce65zzGZ9o \ <riva_hifigan_file>:ZzZjce65zzGZ9o \ --language_code=de-DE \ --phone_dictionary_file=<dict_file> \ --sample_rate 22050 \ --voice_name German-DE-Male-1 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>
T5TTS + AudioCodec IPA	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_t5tts_file>:<key> \ <riva_audiocodec_file>:<key> \ <riva_neuralg2p_file>:<key> \ --num_speakers=11 \ --phone_dictionary_file=<txt_phone_dictionary_file> \ --sample_rate 22050 \ --voice_name English-US-T5TTS \ --subvoices Female-1:0,Male-1:1,Male-Calm:8,Female-Calm:9,Female-Fearful:11,Male-Neutral:12,Male-Angry:14,Female-Angry:16,Female-Neutral:17,Male-Fearful:20,Female-Happy:21 \ --phone_set=ipa \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --preprocessor.g2p_ignore_ambiguous=False \ --abbreviations_file=<txt_abbreviations_file>
RadTTS + HiFi-GAN IPA	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_radtts_file>:<key> \ <riva_hifigan_file>:<key> \ --num_speakers=12 \ --phone_dictionary_file=<txt_phone_dictionary_file> \ --sample_rate 44100 \ --voice_name English-US-RadTTS \ --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \ --phone_set=ipa \ --upper_case_chars=True \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --preprocessor.enable_emphasis_tag=True \ --preprocessor.start_of_emphasis_token='[' \ --preprocessor.end_of_emphasis_token=']' \ --abbreviations_file=<txt_abbreviations_file>
FastPitch + HiFi-GAN ARPABET	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --arpabet_file=cmudict-0.7b_nv22.08 \ --sample_rate 44100 \ --voice_name English-US \ --subvoices Male-1:0,Female-1:1 \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --preprocessor.enable_emphasis_tag=True \ --preprocessor.start_of_emphasis_token='[' \ --preprocessor.end_of_emphasis_token=']' \ --abbreviations_file=<txt_file>
FastPitch + HiFi-GAN LJSpeech	riva-build speech_synthesis \ <rmir_filename>:<key> \ <riva_fastpitch_file>:<key> \ <riva_hifigan_file>:<key> \ --arpabet_file=..cmudict-0.7b_nv22.08 \ --voice_name ljspeech \ --wfst_tokenizer_model=<far_tokenizer_file> \ --wfst_verbalizer_model=<far_verbalizer_file> \ --abbreviations_file=<txt_file>

All text normalization .far files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.

Riva-build Optional Parameters#

For details about the parameters passed to riva-build to customize the TTS pipeline, issue:

riva-build speech_synthesis -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

NVIDIA Riva

Custom Models

Contents

Custom Models#

Model Deployment#

Creating Riva Files#

Customization#

Custom Pronunciations#

Multi-Speaker Models#

Custom Voice#

Custom Text Normalization#

Riva-build Pipeline Instructions#

FastPitch and HiFi-GAN#

Pretrained Quick Start Pipelines#

Riva-build Optional Parameters#