Custom Models#

Model Deployment#

Like all Riva models, Riva TTS requires the following steps:

  1. Create .riva files for each model from a .nemo file as outlined in the NeMo section.

  2. Create .rmir files for each Riva Speech AI Skill (for example, ASR, NLP, and TTS) using riva-build.

  3. Create model directories using riva_deploy.

  4. Deploy the model directory using riva_server.

The following sections provide examples for steps 1 and 2 as outlined above. For steps 3 and 4, refer to Using riva-deploy and Riva Speech Container (Advanced).

Creating Riva Files#

Riva files can be created from .nemo files. As mentioned before in the NeMo section, the generation of Riva files from .nemo files must be done on a Linux x86_64 workstation only.

The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file.

  1. Download the .nemo file from NGC onto the host system.

  2. Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:22.08
  1. After the container has launched, use nemo2riva to convert .nemo to .riva.

pip3 install nvidia-pyindex
ngc registry resource download-version "nvidia/riva/riva_quickstart:2.14.0"
pip3 install "riva_quickstart_v2.14.0/nemo2riva-2.14.0-py3-none-any.whl"
nemo2riva --key encryption_key --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

Repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Ensure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-2.14.0_beta-py3-none-any.whl version when performing the above step:

Customization#

After creating the .riva file and prior to running riva-build, there are a few customization options that can be adjusted. These are optional, however, if you are interested, the instructions for building the default Riva pipeline, skip ahead to Riva-build Pipeline Instructions.

Custom Pronunciations#

Speech synthesis models deployed in Riva are configured with a language-specific pronunciation dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example with heteronyms like bass (the fish) and bass (the musical instrument), the dictionary is ignored and the synthesis model uses context clues from the sentence to predict an appropriate pronunciation.

Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.

While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.

There are two key parameters that can be configured through riva-build or in the preprocessor configuration that affects the phoneme path:

  • --phone_dictionary_file path to the pronunciation dictionary. To start with, leave this parameter empty. If the .riva file was created from a .nemo model that contained an dictionary artifact, and this argument is not set, Riva will use the NeMo dictionary file that the model was trained with. To add custom entries and modify pronunciation, modify the NeMo dictionary artifact, save it to another file, and pass that file-path to riva-build with this argument.

  • --preprocessor.g2p_ignore_ambiguous If True, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults to True.

  • --upper_case_chars should be set to True if ipa is used. This affects grapheme inputs as the ipa phone set includes lower-cased English characters.

  • --phone_set can be used to specify whether the model was trained with arpabet or ipa. If this flag is not used, Riva attempts to auto-detect the correct phone set.

Note

--arpabet_file is deprecated as of Riva 2.8.0 and replaced by --phone_dictionary_file.

Note

Riva supports both arpabet and ipa depending on what the acoustic model was trained on. For more information, refer to the ARPABET wikipedia page. For more information on IPA, refer to the TTS Phoneme Support page.

To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.

Multi-Speaker Models#

Riva supports models with multiple speakers.

To enable this feature, specify the following parameters before building the model.

  • --voice_name is the name of the model. Defaults to English-US.Female-1.

  • --subvoices is a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option --subvoices=Male:0,Female:1. If not provided, the desired embedding can be requested by integer index.

The voice name and subvoices are maintained in the generated .rmir file, and caried into the generated Triton repositories. During inference, modify the voice name of the request by appending voice_name with a period followed by a valid subvoice. For example, <voice_name>.<subvoice>.

Custom Voice#

Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and HiFi-GAN models need to be fine-tuned on this dataset. Refer to the Riva fine-tuning tutorial for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.

Custom Text Normalization#

Riva supports custom text normalization rules built from NeMo’s WFST text normalization (TN) tool. For details on customizing TN, refer to the NeMo WFST tutorial. After the WFST has been customized, use NeMo to deploy it using its export_grammar script. Refer to the documentation for more information. This produces two files: tokenize_and_classify.far and verbalize.far. These are passed to the riva-build step using the --wfst_tokenizer_model and --wfst_verbalizer_model arguments.

Riva-build Pipeline Instructions#

FastPitch and HiFi-GAN#

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
    /servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file> \
    --wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
    --wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \
    --sample_rate=<sample_rate> \
    --subvoices=<subvoices> \

Where:

  • <rmir_filename> is the Riva rmir file that is generated

  • <encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <fastpitch_riva_filename> is the name of the riva file for FastPitch

  • <hifigan_riva_filename> is the name of the riva file for HiFi-GAN

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET

  • <voice_name> is the name of the model

  • <subvoices> is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.

  • <wfst_tokenizer_model> is the location of the tokenize_and_classify.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

  • <wfst_verbalizer_model> is the location of the verbalize.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

  • <sample_rate> is the sample rate of audio that the models were trained on

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR and riva filenames, otherwise this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the Riva-build Optional Parameters section and set the various min_batch_size, max_batch_size, and opt_batch_size parameters to 1 while executing the riva-build command.

Pretrained Quick Start Pipelines#

Pipeline

riva-build command

FastPitch + HiFi-GAN IPA (en-US Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=en-US \
    --num_speakers=12 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name English-US \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --upper_case_chars=True \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_abbreviations_file>

FastPitch + HiFi-GAN IPA (zh-CN Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=zh-CN \
    --num_speakers=10 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name Mandarin-CN \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Male-Sad:15 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']'

FastPitch + HiFi-GAN IPA (es-ES Female)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=es-ES \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Spanish-ES-Female-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (es-ES Male)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=es-ES \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Spanish-ES-Male-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (es-US Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=es-US \
    --num_speakers=12 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name Spanish-US \
    --subvoices Female-Calm:0,Male-1:1,Male-Happy:2,Female-Narrator:6,Male-Calm:7,Female-Angry:8,Male-Neutral:9,Male-Narrator:11,Female-Sad:12,Female-1:14,Male-Angry:15,Female-Neutral:16
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']'

FastPitch + HiFi-GAN IPA (it-IT Female)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=it-IT \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Italian-IT-Female-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (it-IT Male)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=it-IT \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Italian-IT-Male-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

RadTTS + HiFi-GAN IPA

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_radtts_file>:<key> \
    <riva_hifigan_file>:<key> \
    --num_speakers=12 \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name English-US-RadTTS \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
    --phone_set=ipa \
    --upper_case_chars=True \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_abbreviations_file>

FastPitch + HiFi-GAN ARPABET

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --arpabet_file=cmudict-0.7b_nv22.08 \
    --sample_rate 44100 \
    --voice_name English-US \
    --subvoices Male-1:0,Female-1:1 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN LJSpeech

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --arpabet_file=..cmudict-0.7b_nv22.08 \
    --voice_name ljspeech \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

All text normalization .far files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.

Riva-build Optional Parameters#

For details about the parameters passed to riva-build to customize the TTS pipeline, issue:

riva-build speech_synthesis -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build speech_synthesis [-h] [-f] [-v]
                                   [--language_code LANGUAGE_CODE]
                                   [--max_batch_size MAX_BATCH_SIZE]
                                   [--voice_name VOICE_NAME]
                                   [--num_speakers NUM_SPEAKERS]
                                   [--subvoices SUBVOICES]
                                   [--sample_rate SAMPLE_RATE]
                                   [--chunk_length CHUNK_LENGTH]
                                   [--overlap_length OVERLAP_LENGTH]
                                   [--num_mels NUM_MELS]
                                   [--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
                                   [--abbreviations_file ABBREVIATIONS_FILE]
                                   [--has_mapping_file HAS_MAPPING_FILE]
                                   [--mapping_file MAPPING_FILE]
                                   [--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
                                   [--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
                                   [--arpabet_file ARPABET_FILE]
                                   [--phone_dictionary_file PHONE_DICTIONARY_FILE]
                                   [--phone_set PHONE_SET]
                                   [--upper_case_chars UPPER_CASE_CHARS]
                                   [--upper_case_g2p UPPER_CASE_G2P]
                                   [--postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE]
                                   [--postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE]
                                   [--postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE]
                                   [--postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE]
                                   [--postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE]
                                   [--postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING]
                                   [--postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT]
                                   [--postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--postprocessor.optimization_graph_level POSTPROCESSOR.OPTIMIZATION_GRAPH_LEVEL]
                                   [--postprocessor.fade_length POSTPROCESSOR.FADE_LENGTH]
                                   [--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
                                   [--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
                                   [--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
                                   [--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
                                   [--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
                                   [--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
                                   [--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
                                   [--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL]
                                   [--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
                                   [--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
                                   [--preprocessor.language PREPROCESSOR.LANGUAGE]
                                   [--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
                                   [--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
                                   [--preprocessor.mapping PREPROCESSOR.MAPPING]
                                   [--preprocessor.tolower PREPROCESSOR.TOLOWER]
                                   [--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
                                   [--preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG]
                                   [--preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN]
                                   [--preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN]
                                   [--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE]
                                   [--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE]
                                   [--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE]
                                   [--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE]
                                   [--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING]
                                   [--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoderFastPitch.optimization_graph_level ENCODERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL]
                                   [--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE]
                                   [--encoderFastPitch.use_onnx_runtime]
                                   [--encoderFastPitch.use_torchscript]
                                   [--encoderFastPitch.use_trt_fp32]
                                   [--encoderFastPitch.fp16_needs_obey_precision_pass]
                                   [--encoderRadTTS.max_sequence_idle_microseconds ENCODERRADTTS.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoderRadTTS.max_batch_size ENCODERRADTTS.MAX_BATCH_SIZE]
                                   [--encoderRadTTS.min_batch_size ENCODERRADTTS.MIN_BATCH_SIZE]
                                   [--encoderRadTTS.opt_batch_size ENCODERRADTTS.OPT_BATCH_SIZE]
                                   [--encoderRadTTS.preferred_batch_size ENCODERRADTTS.PREFERRED_BATCH_SIZE]
                                   [--encoderRadTTS.batching_type ENCODERRADTTS.BATCHING_TYPE]
                                   [--encoderRadTTS.preserve_ordering ENCODERRADTTS.PRESERVE_ORDERING]
                                   [--encoderRadTTS.instance_group_count ENCODERRADTTS.INSTANCE_GROUP_COUNT]
                                   [--encoderRadTTS.max_queue_delay_microseconds ENCODERRADTTS.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoderRadTTS.optimization_graph_level ENCODERRADTTS.OPTIMIZATION_GRAPH_LEVEL]
                                   [--encoderRadTTS.trt_max_workspace_size ENCODERRADTTS.TRT_MAX_WORKSPACE_SIZE]
                                   [--encoderRadTTS.use_onnx_runtime]
                                   [--encoderRadTTS.use_torchscript]
                                   [--encoderRadTTS.use_trt_fp32]
                                   [--encoderRadTTS.fp16_needs_obey_precision_pass]
                                   [--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE]
                                   [--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE]
                                   [--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE]
                                   [--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE]
                                   [--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING]
                                   [--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--chunkerFastPitch.optimization_graph_level CHUNKERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL]
                                   [--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE]
                                   [--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE]
                                   [--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE]
                                   [--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE]
                                   [--hifigan.batching_type HIFIGAN.BATCHING_TYPE]
                                   [--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING]
                                   [--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT]
                                   [--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--hifigan.optimization_graph_level HIFIGAN.OPTIMIZATION_GRAPH_LEVEL]
                                   [--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE]
                                   [--hifigan.use_onnx_runtime]
                                   [--hifigan.use_torchscript]
                                   [--hifigan.use_trt_fp32]
                                   [--hifigan.fp16_needs_obey_precision_pass]
                                   output_path source_path [source_path ...]

Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  -v, --verbose         Verbose log outputs
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --voice_name VOICE_NAME
                        Set the voice name for speech synthesis
  --num_speakers NUM_SPEAKERS
                        Number of unqiue speakers.
  --subvoices SUBVOICES
                        Comma-seprated list of subvoices (no whitespace).
  --sample_rate SAMPLE_RATE
                        Sample rate of the output signal
  --chunk_length CHUNK_LENGTH
                        Chunk length in mel frames to synthesize at one time
  --overlap_length OVERLAP_LENGTH
                        Chunk length in mel frames to overlap neighboring
                        chunks
  --num_mels NUM_MELS   number of mels
  --num_samples_per_frame NUM_SAMPLES_PER_FRAME
                        number of samples per frame
  --abbreviations_file ABBREVIATIONS_FILE
                        Path to file with list of abbreviations and
                        corresponding expansions
  --has_mapping_file HAS_MAPPING_FILE
  --mapping_file MAPPING_FILE
                        Path to phoneme mapping file
  --wfst_tokenizer_model WFST_TOKENIZER_MODEL
                        Sparrowhawk model to use for tokenization and
                        classification, must be in .far format
  --wfst_verbalizer_model WFST_VERBALIZER_MODEL
                        Sparrowhawk model to use for verbalizer, must be in
                        .far format.
  --arpabet_file ARPABET_FILE
                        Path to pronunciation dictionary (deprecated)
  --phone_dictionary_file PHONE_DICTIONARY_FILE
                        Path to pronunciation dictionary
  --phone_set PHONE_SET
                        Phonetic set that the model was trained on. An unset
                        value will attempt to auto-detect the phone set used
                        during training. Supports either "arpabet", "ipa",
                        "none".
  --upper_case_chars UPPER_CASE_CHARS
                        Whether character representations for this model are
                        upper case or lower case.
  --upper_case_g2p UPPER_CASE_G2P
                        Whether character representations for this model are
                        upper case or lower case.

postprocessor:
  --postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE
  --postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE
  --postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE
  --postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --postprocessor.optimization_graph_level POSTPROCESSOR.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --postprocessor.fade_length POSTPROCESSOR.FADE_LENGTH
                        Cross fade length in samples used in between audio
                        chunks

preprocessor:
  --preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
  --preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
  --preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
  --preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
  --preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
  --preprocessor.language PREPROCESSOR.LANGUAGE
  --preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
                        maximum length of every emitted sequence
  --preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
                        maximum length of input string
  --preprocessor.mapping PREPROCESSOR.MAPPING
  --preprocessor.tolower PREPROCESSOR.TOLOWER
  --preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE
  --preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG
                        Boolean flag that controls if the emphasis tag should
                        be parsed or not during pre-processing
  --preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN
                        field to indicate start of emphasis in the given text
  --preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN
                        field to indicate end of emphasis in the given text

encoderFastPitch:
  --encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE
  --encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE
  --encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE
  --encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --encoderFastPitch.optimization_graph_level ENCODERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --encoderFastPitch.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --encoderFastPitch.use_torchscript
                        Use TorchScript instead of TensorRT
  --encoderFastPitch.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --encoderFastPitch.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

encoderRadTTS:
  --encoderRadTTS.max_sequence_idle_microseconds ENCODERRADTTS.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoderRadTTS.max_batch_size ENCODERRADTTS.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoderRadTTS.min_batch_size ENCODERRADTTS.MIN_BATCH_SIZE
  --encoderRadTTS.opt_batch_size ENCODERRADTTS.OPT_BATCH_SIZE
  --encoderRadTTS.preferred_batch_size ENCODERRADTTS.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoderRadTTS.batching_type ENCODERRADTTS.BATCHING_TYPE
  --encoderRadTTS.preserve_ordering ENCODERRADTTS.PRESERVE_ORDERING
                        Preserve ordering
  --encoderRadTTS.instance_group_count ENCODERRADTTS.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoderRadTTS.max_queue_delay_microseconds ENCODERRADTTS.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --encoderRadTTS.optimization_graph_level ENCODERRADTTS.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --encoderRadTTS.trt_max_workspace_size ENCODERRADTTS.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --encoderRadTTS.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --encoderRadTTS.use_torchscript
                        Use TorchScript instead of TensorRT
  --encoderRadTTS.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --encoderRadTTS.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

chunkerFastPitch:
  --chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE
  --chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE
  --chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE
  --chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --chunkerFastPitch.optimization_graph_level CHUNKERFASTPITCH.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration

hifigan:
  --hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE
  --hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE
  --hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --hifigan.batching_type HIFIGAN.BATCHING_TYPE
  --hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING
                        Preserve ordering
  --hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --hifigan.optimization_graph_level HIFIGAN.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --hifigan.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --hifigan.use_torchscript
                        Use TorchScript instead of TensorRT
  --hifigan.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --hifigan.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network