Custom Models#

Model Deployment#

Like all Riva models, Riva TTS requires the following steps:

  1. Create .riva files for each model from a .nemo file as outlined in the NeMo section.

  2. Create .rmir files for each Riva Speech AI Skill (for example, ASR, NLP, and TTS) using riva-build.

  3. Create model directories using riva_deploy.

  4. Deploy the model directory using riva_server.

The following sections provide examples for steps 1 and 2 as outlined above. For steps 3 and 4, refer to Using riva-deploy and Riva Speech Container (Advanced).

Creating Riva Files#

Riva files can be created from .nemo files. As mentioned before in the NeMo section, the generation of Riva files from .nemo files must be done on a Linux x86_64 workstation only.

The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file.

  1. Download the .nemo file from NGC onto the host system.

  2. Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:22.08
  1. After the container has launched, use nemo2riva to convert .nemo to .riva.

pip3 install nvidia-pyindex
ngc registry resource download-version "nvidia/riva/riva_quickstart:2.24.0"
pip3 install "riva_quickstart_v2.24.0/nemo2riva-2.24.0-py3-none-any.whl"
nemo2riva --key encryption_key --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

Repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Ensure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-2.24.0_beta-py3-none-any.whl version when performing the above step:

Customization#

After creating the .riva file and prior to running riva-build, there are a few customization options that can be adjusted. These are optional, however, if you are interested, the instructions for building the default Riva pipeline, skip ahead to Riva-build Pipeline Instructions.

Custom Pronunciations#

Speech synthesis models deployed in Riva are configured with a language-specific pronunciation dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example with heteronyms like bass (the fish) and bass (the musical instrument), the dictionary is ignored and the synthesis model uses context clues from the sentence to predict an appropriate pronunciation.

Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.

While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.

There are two key parameters that can be configured through riva-build or in the preprocessor configuration that affects the phoneme path:

  • --phone_dictionary_file path to the pronunciation dictionary. To start with, leave this parameter empty. If the .riva file was created from a .nemo model that contained an dictionary artifact, and this argument is not set, Riva will use the NeMo dictionary file that the model was trained with. To add custom entries and modify pronunciation, modify the NeMo dictionary artifact, save it to another file, and pass that file-path to riva-build with this argument.

  • --preprocessor.g2p_ignore_ambiguous If True, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults to True.

  • --upper_case_chars should be set to True if ipa is used. This affects grapheme inputs as the ipa phone set includes lower-cased English characters.

  • --phone_set can be used to specify whether the model was trained with arpabet or ipa. If this flag is not used, Riva attempts to auto-detect the correct phone set.

Note

--arpabet_file is deprecated as of Riva 2.8.0 and replaced by --phone_dictionary_file.

Note

Riva supports both arpabet and ipa depending on what the acoustic model was trained on. For more information, refer to the ARPABET wikipedia page. For more information on IPA, refer to the TTS Phoneme Support page.

To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.

Multi-Speaker Models#

Riva supports models with multiple speakers.

To enable this feature, specify the following parameters before building the model.

  • --voice_name is the name of the model. Defaults to English-US.Female-1.

  • --subvoices is a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option --subvoices=Male:0,Female:1. If not provided, the desired embedding can be requested by integer index.

The voice name and subvoices are maintained in the generated .rmir file, and caried into the generated Triton repositories. During inference, modify the voice name of the request by appending voice_name with a period followed by a valid subvoice. For example, <voice_name>.<subvoice>.

Custom Voice#

Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and HiFi-GAN models need to be fine-tuned on this dataset. Refer to the Riva fine-tuning tutorial for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.

Custom Text Normalization#

Riva supports custom text normalization rules built from NeMo’s WFST text normalization (TN) tool. For details on customizing TN, refer to the NeMo WFST tutorial. After the WFST has been customized, use NeMo to deploy it using its export_grammar script. Refer to the documentation for more information. This produces two files: tokenize_and_classify.far and verbalize.far. These are passed to the riva-build step using the --wfst_tokenizer_model and --wfst_verbalizer_model arguments. Additionally, riva-build also supports --wfst_pre_process_model and --wfst_post_process_model arguments to pass the pre and post processing FAR files for text normalization.

Riva-build Pipeline Instructions#

FastPitch and HiFi-GAN#

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the Riva container:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
    /servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file> \
    --wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
    --wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \
    --sample_rate=<sample_rate> \
    --subvoices=<subvoices> \

Where:

  • <rmir_filename> is the Riva rmir file that is generated

  • <encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode, unless specified under a specific model in the list of pretrained quick start pipelines.

  • pipeline_name is an optional user-defined name for the components in the model repository

  • <fastpitch_riva_filename> is the name of the riva file for FastPitch

  • <hifigan_riva_filename> is the name of the riva file for HiFi-GAN

  • <abbr_file> is the name of the file containing abbreviations and their corresponding expansions

  • <dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET

  • <voice_name> is the name of the model

  • <subvoices> is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.

  • <wfst_tokenizer_model> is the location of the tokenize_and_classify.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

  • <wfst_verbalizer_model> is the location of the verbalize.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

  • <sample_rate> is the sample rate of audio that the models were trained on

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR and riva filenames, otherwise this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the Riva-build Optional Parameters section and set the various min_batch_size, max_batch_size, and opt_batch_size parameters to 1 while executing the riva-build command.

Pretrained Quick Start Pipelines#

Pipeline

riva-build command

FastPitch + HiFi-GAN IPA (en-US Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=en-US \
    --num_speakers=12 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name English-US \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --upper_case_chars=True \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_abbreviations_file>

FastPitch + HiFi-GAN IPA (zh-CN Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=zh-CN \
    --num_speakers=10 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name Mandarin-CN \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Male-Sad:15 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --wfst_post_process_model=<far_post_process_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']'

FastPitch + HiFi-GAN IPA (es-ES Female)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:BSzv7YAjcH4nJS \
    <riva_hifigan_file>:BSzv7YAjcH4nJS \
    --language_code=es-ES \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Spanish-ES-Female-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (es-ES Male)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:PPihyG3Moru5in \
    <riva_hifigan_file>:PPihyG3Moru5in \
    --language_code=es-ES \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Spanish-ES-Male-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (es-US Multi-Speaker)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --language_code=es-US \
    --num_speakers=12 \
    --phone_set=ipa \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name Spanish-US \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Male-Fearful:11,Male-Happy:13,Female-Sad:14,Male-Sad:15 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']'

FastPitch + HiFi-GAN IPA (it-IT Female)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:R62srgxeXBgVxg \
    <riva_hifigan_file>:R62srgxeXBgVxg \
    --language_code=it-IT \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Italian-IT-Female-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (it-IT Male)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:dVRvg47ZqCdQrR \
    <riva_hifigan_file>:dVRvg47ZqCdQrR \
    --language_code=it-IT \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name Italian-IT-Male-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN IPA (de-DE Male)

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:ZzZjce65zzGZ9o \
    <riva_hifigan_file>:ZzZjce65zzGZ9o \
    --language_code=de-DE \
    --phone_dictionary_file=<dict_file> \
    --sample_rate 22050 \
    --voice_name German-DE-Male-1 \
    --phone_set=ipa \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

RadTTS + HiFi-GAN IPA

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_radtts_file>:<key> \
    <riva_hifigan_file>:<key> \
    --num_speakers=12 \
    --phone_dictionary_file=<txt_phone_dictionary_file> \
    --sample_rate 44100 \
    --voice_name English-US-RadTTS \
    --subvoices Female-1:0,Male-1:1,Female-Neutral:2,Male-Neutral:3,Female-Angry:4,Male-Angry:5,Female-Calm:6,Male-Calm:7,Female-Fearful:10,Female-Happy:12,Male-Happy:13,Female-Sad:14 \
    --phone_set=ipa \
    --upper_case_chars=True \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_abbreviations_file>

FastPitch + HiFi-GAN ARPABET

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --arpabet_file=cmudict-0.7b_nv22.08 \
    --sample_rate 44100 \
    --voice_name English-US \
    --subvoices Male-1:0,Female-1:1 \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --preprocessor.enable_emphasis_tag=True \
    --preprocessor.start_of_emphasis_token='[' \
    --preprocessor.end_of_emphasis_token=']' \
    --abbreviations_file=<txt_file>

FastPitch + HiFi-GAN LJSpeech

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <riva_fastpitch_file>:<key> \
    <riva_hifigan_file>:<key> \
    --arpabet_file=..cmudict-0.7b_nv22.08 \
    --voice_name ljspeech \
    --wfst_tokenizer_model=<far_tokenizer_file> \
    --wfst_verbalizer_model=<far_verbalizer_file> \
    --abbreviations_file=<txt_file>

All text normalization .far files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.

Riva-build Optional Parameters#

For details about the parameters passed to riva-build to customize the TTS pipeline, issue:

riva-build speech_synthesis -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build speech_synthesis [-h] [-f] [-v]
                                   [--language_code LANGUAGE_CODE]
                                   [--instance_group_count INSTANCE_GROUP_COUNT]
                                   [--kind KIND]
                                   [--max_batch_size MAX_BATCH_SIZE]
                                   [--max_queue_delay_microseconds MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--batching_type BATCHING_TYPE]
                                   [--voice_name VOICE_NAME]
                                   [--num_speakers NUM_SPEAKERS]
                                   [--subvoices SUBVOICES]
                                   [--sample_rate SAMPLE_RATE]
                                   [--chunk_length CHUNK_LENGTH]
                                   [--chunk_ms CHUNK_MS]
                                   [--overlap_length OVERLAP_LENGTH]
                                   [--num_mels NUM_MELS]
                                   [--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
                                   [--abbreviations_file ABBREVIATIONS_FILE]
                                   [--has_mapping_file HAS_MAPPING_FILE]
                                   [--mapping_file MAPPING_FILE]
                                   [--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
                                   [--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
                                   [--sparrowhawk_proto_files SPARROWHAWK_PROTO_FILES]
                                   [--verbalizer_proto_files VERBALIZER_PROTO_FILES]
                                   [--tokenizer_proto_files TOKENIZER_PROTO_FILES]
                                   [--postprocessor_proto_files POSTPROCESSOR_PROTO_FILES]
                                   [--wfst_pre_process_model WFST_PRE_PROCESS_MODEL]
                                   [--wfst_post_process_model WFST_POST_PROCESS_MODEL]
                                   [--transcript_decoder_layers TRANSCRIPT_DECODER_LAYERS]
                                   [--context_decoder_layers CONTEXT_DECODER_LAYERS]
                                   [--arpabet_file ARPABET_FILE]
                                   [--phone_dictionary_file PHONE_DICTIONARY_FILE]
                                   [--phone_set PHONE_SET]
                                   [--upper_case_chars UPPER_CASE_CHARS]
                                   [--upper_case_g2p UPPER_CASE_G2P]
                                   [--mel_basis_file_path MEL_BASIS_FILE_PATH]
                                   [--voice_map_file VOICE_MAP_FILE]
                                   [--history_future HISTORY_FUTURE]
                                   [--multilingual MULTILINGUAL]
                                   [--language_archive_path LANGUAGE_ARCHIVE_PATH]
                                   [--multi_char_tokenizer_offset MULTI_CHAR_TOKENIZER_OFFSET]
                                   [--context_embedding_path CONTEXT_EMBEDDING_PATH]
                                   [--n_timesteps N_TIMESTEPS]
                                   [--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
                                   [--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
                                   [--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
                                   [--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
                                   [--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
                                   [--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
                                   [--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
                                   [--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL]
                                   [--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
                                   [--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
                                   [--preprocessor.language PREPROCESSOR.LANGUAGE]
                                   [--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
                                   [--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
                                   [--preprocessor.mapping PREPROCESSOR.MAPPING]
                                   [--preprocessor.tolower PREPROCESSOR.TOLOWER]
                                   [--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
                                   [--preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG]
                                   [--preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN]
                                   [--preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN]
                                   [--preprocessor.bos_token PREPROCESSOR.BOS_TOKEN]
                                   [--preprocessor.eos_token PREPROCESSOR.EOS_TOKEN]
                                   [--magpie_tts.max_sequence_idle_microseconds MAGPIE_TTS.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--magpie_tts.max_batch_size MAGPIE_TTS.MAX_BATCH_SIZE]
                                   [--magpie_tts.min_batch_size MAGPIE_TTS.MIN_BATCH_SIZE]
                                   [--magpie_tts.opt_batch_size MAGPIE_TTS.OPT_BATCH_SIZE]
                                   [--magpie_tts.preferred_batch_size MAGPIE_TTS.PREFERRED_BATCH_SIZE]
                                   [--magpie_tts.batching_type MAGPIE_TTS.BATCHING_TYPE]
                                   [--magpie_tts.preserve_ordering MAGPIE_TTS.PRESERVE_ORDERING]
                                   [--magpie_tts.instance_group_count MAGPIE_TTS.INSTANCE_GROUP_COUNT]
                                   [--magpie_tts.max_queue_delay_microseconds MAGPIE_TTS.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--magpie_tts.optimization_graph_level MAGPIE_TTS.OPTIMIZATION_GRAPH_LEVEL]
                                   [--magpie_tts.chunk_ms MAGPIE_TTS.CHUNK_MS]
                                   [--magpie_tts.history_future MAGPIE_TTS.HISTORY_FUTURE]
                                   [--magpie_tts.future_len MAGPIE_TTS.FUTURE_LEN]
                                   [--magpie_tts.history_len MAGPIE_TTS.HISTORY_LEN]
                                   [--magpie_tts.fade_ms MAGPIE_TTS.FADE_MS]
                                   [--magpie_tts.max_decoder_steps MAGPIE_TTS.MAX_DECODER_STEPS]
                                   [--magpie_tts.bos_id MAGPIE_TTS.BOS_ID]
                                   [--magpie_tts.eos_id MAGPIE_TTS.EOS_ID]
                                   [--magpie_tts.audio_bos_id MAGPIE_TTS.AUDIO_BOS_ID]
                                   [--magpie_tts.audio_eos_id MAGPIE_TTS.AUDIO_EOS_ID]
                                   [--magpie_tts.context_audio_bos_id MAGPIE_TTS.CONTEXT_AUDIO_BOS_ID]
                                   [--magpie_tts.context_audio_eos_id MAGPIE_TTS.CONTEXT_AUDIO_EOS_ID]
                                   [--magpie_tts.codec_downsampling_rate MAGPIE_TTS.CODEC_DOWNSAMPLING_RATE]
                                   [--magpie_tts.temperature MAGPIE_TTS.TEMPERATURE]
                                   [--magpie_tts.cfg_scale MAGPIE_TTS.CFG_SCALE]
                                   [--magpie_tts.context_decoder_layers MAGPIE_TTS.CONTEXT_DECODER_LAYERS]
                                   [--magpie_tts.transcript_decoder_layers MAGPIE_TTS.TRANSCRIPT_DECODER_LAYERS]
                                   [--magpie_tts.apply_attention_prior MAGPIE_TTS.APPLY_ATTENTION_PRIOR]
                                   [--magpie_tts.attention_prior_epsilon MAGPIE_TTS.ATTENTION_PRIOR_EPSILON]
                                   [--magpie_tts.attention_prior_lookahead_window MAGPIE_TTS.ATTENTION_PRIOR_LOOKAHEAD_WINDOW]
                                   [--magpie_tts.estimate_alignment_from_layers MAGPIE_TTS.ESTIMATE_ALIGNMENT_FROM_LAYERS]
                                   [--magpie_tts.apply_prior_to_layers MAGPIE_TTS.APPLY_PRIOR_TO_LAYERS]
                                   [--magpie_tts.attention_prior_window_length_right MAGPIE_TTS.ATTENTION_PRIOR_WINDOW_LENGTH_RIGHT]
                                   [--magpie_tts.attention_prior_window_length_left MAGPIE_TTS.ATTENTION_PRIOR_WINDOW_LENGTH_LEFT]
                                   [--magpie_tts.start_prior_after_n_audio_steps MAGPIE_TTS.START_PRIOR_AFTER_N_AUDIO_STEPS]
                                   [--magpie_tts.top_k MAGPIE_TTS.TOP_K]
                                   [--magpie_tts.max_context_len MAGPIE_TTS.MAX_CONTEXT_LEN]
                                   [--magpie_tts.sample_rate MAGPIE_TTS.SAMPLE_RATE]
                                   [--audio_codec.max_sequence_idle_microseconds AUDIO_CODEC.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--audio_codec.max_batch_size AUDIO_CODEC.MAX_BATCH_SIZE]
                                   [--audio_codec.min_batch_size AUDIO_CODEC.MIN_BATCH_SIZE]
                                   [--audio_codec.opt_batch_size AUDIO_CODEC.OPT_BATCH_SIZE]
                                   [--audio_codec.preferred_batch_size AUDIO_CODEC.PREFERRED_BATCH_SIZE]
                                   [--audio_codec.batching_type AUDIO_CODEC.BATCHING_TYPE]
                                   [--audio_codec.preserve_ordering AUDIO_CODEC.PRESERVE_ORDERING]
                                   [--audio_codec.instance_group_count AUDIO_CODEC.INSTANCE_GROUP_COUNT]
                                   [--audio_codec.max_queue_delay_microseconds AUDIO_CODEC.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--audio_codec.optimization_graph_level AUDIO_CODEC.OPTIMIZATION_GRAPH_LEVEL]
                                   [--audio_codec.chunk_ms AUDIO_CODEC.CHUNK_MS]
                                   [--audio_codec.history_future AUDIO_CODEC.HISTORY_FUTURE]
                                   [--audio_codec.future_len AUDIO_CODEC.FUTURE_LEN]
                                   [--audio_codec.history_len AUDIO_CODEC.HISTORY_LEN]
                                   [--audio_codec.fade_ms AUDIO_CODEC.FADE_MS]
                                   [--audio_codec.audio_bos_id AUDIO_CODEC.AUDIO_BOS_ID]
                                   [--audio_codec.audio_eos_id AUDIO_CODEC.AUDIO_EOS_ID]
                                   [--audio_codec.codec_downsampling_rate AUDIO_CODEC.CODEC_DOWNSAMPLING_RATE]
                                   [--audio_codec.sample_rate AUDIO_CODEC.SAMPLE_RATE]
                                   [--audio_codec_encoder.max_sequence_idle_microseconds AUDIO_CODEC_ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--audio_codec_encoder.max_batch_size AUDIO_CODEC_ENCODER.MAX_BATCH_SIZE]
                                   [--audio_codec_encoder.min_batch_size AUDIO_CODEC_ENCODER.MIN_BATCH_SIZE]
                                   [--audio_codec_encoder.opt_batch_size AUDIO_CODEC_ENCODER.OPT_BATCH_SIZE]
                                   [--audio_codec_encoder.preferred_batch_size AUDIO_CODEC_ENCODER.PREFERRED_BATCH_SIZE]
                                   [--audio_codec_encoder.batching_type AUDIO_CODEC_ENCODER.BATCHING_TYPE]
                                   [--audio_codec_encoder.preserve_ordering AUDIO_CODEC_ENCODER.PRESERVE_ORDERING]
                                   [--audio_codec_encoder.instance_group_count AUDIO_CODEC_ENCODER.INSTANCE_GROUP_COUNT]
                                   [--audio_codec_encoder.max_queue_delay_microseconds AUDIO_CODEC_ENCODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--audio_codec_encoder.optimization_graph_level AUDIO_CODEC_ENCODER.OPTIMIZATION_GRAPH_LEVEL]
                                   [--audio_codec_encoder.trt_max_workspace_size AUDIO_CODEC_ENCODER.TRT_MAX_WORKSPACE_SIZE]
                                   [--audio_codec_encoder.use_onnx_runtime AUDIO_CODEC_ENCODER.USE_ONNX_RUNTIME]
                                   [--audio_codec_encoder.use_torchscript]
                                   [--audio_codec_encoder.use_trt_fp32]
                                   [--audio_codec_encoder.use_trt_fp8]
                                   [--audio_codec_encoder.fp16_needs_obey_precision_pass]
                                   [--neuralg2p.max_sequence_idle_microseconds NEURALG2P.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--neuralg2p.max_batch_size NEURALG2P.MAX_BATCH_SIZE]
                                   [--neuralg2p.min_batch_size NEURALG2P.MIN_BATCH_SIZE]
                                   [--neuralg2p.opt_batch_size NEURALG2P.OPT_BATCH_SIZE]
                                   [--neuralg2p.preferred_batch_size NEURALG2P.PREFERRED_BATCH_SIZE]
                                   [--neuralg2p.batching_type NEURALG2P.BATCHING_TYPE]
                                   [--neuralg2p.preserve_ordering NEURALG2P.PRESERVE_ORDERING]
                                   [--neuralg2p.instance_group_count NEURALG2P.INSTANCE_GROUP_COUNT]
                                   [--neuralg2p.max_queue_delay_microseconds NEURALG2P.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--neuralg2p.optimization_graph_level NEURALG2P.OPTIMIZATION_GRAPH_LEVEL]
                                   [--neuralg2p.trt_max_workspace_size NEURALG2P.TRT_MAX_WORKSPACE_SIZE]
                                   [--neuralg2p.use_onnx_runtime]
                                   [--neuralg2p.use_torchscript]
                                   [--neuralg2p.use_trt_fp32]
                                   [--neuralg2p.use_trt_fp8]
                                   [--neuralg2p.fp16_needs_obey_precision_pass]
                                   [--tts_generator.max_sequence_idle_microseconds TTS_GENERATOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--tts_generator.max_batch_size TTS_GENERATOR.MAX_BATCH_SIZE]
                                   [--tts_generator.min_batch_size TTS_GENERATOR.MIN_BATCH_SIZE]
                                   [--tts_generator.opt_batch_size TTS_GENERATOR.OPT_BATCH_SIZE]
                                   [--tts_generator.preferred_batch_size TTS_GENERATOR.PREFERRED_BATCH_SIZE]
                                   [--tts_generator.batching_type TTS_GENERATOR.BATCHING_TYPE]
                                   [--tts_generator.preserve_ordering TTS_GENERATOR.PRESERVE_ORDERING]
                                   [--tts_generator.instance_group_count TTS_GENERATOR.INSTANCE_GROUP_COUNT]
                                   [--tts_generator.max_queue_delay_microseconds TTS_GENERATOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--tts_generator.optimization_graph_level TTS_GENERATOR.OPTIMIZATION_GRAPH_LEVEL]
                                   [--tts_generator.trt_max_workspace_size TTS_GENERATOR.TRT_MAX_WORKSPACE_SIZE]
                                   [--tts_generator.use_onnx_runtime]
                                   [--tts_generator.use_torchscript]
                                   [--tts_generator.use_trt_fp32]
                                   [--tts_generator.use_trt_fp8]
                                   [--tts_generator.fp16_needs_obey_precision_pass]
                                   [--tts_magpie_flow_rate_limiter.max_sequence_idle_microseconds TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--tts_magpie_flow_rate_limiter.max_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_BATCH_SIZE]
                                   [--tts_magpie_flow_rate_limiter.min_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.MIN_BATCH_SIZE]
                                   [--tts_magpie_flow_rate_limiter.opt_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.OPT_BATCH_SIZE]
                                   [--tts_magpie_flow_rate_limiter.preferred_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.PREFERRED_BATCH_SIZE]
                                   [--tts_magpie_flow_rate_limiter.batching_type TTS_MAGPIE_FLOW_RATE_LIMITER.BATCHING_TYPE]
                                   [--tts_magpie_flow_rate_limiter.preserve_ordering TTS_MAGPIE_FLOW_RATE_LIMITER.PRESERVE_ORDERING]
                                   [--tts_magpie_flow_rate_limiter.instance_group_count TTS_MAGPIE_FLOW_RATE_LIMITER.INSTANCE_GROUP_COUNT]
                                   [--tts_magpie_flow_rate_limiter.max_queue_delay_microseconds TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--tts_magpie_flow_rate_limiter.optimization_graph_level TTS_MAGPIE_FLOW_RATE_LIMITER.OPTIMIZATION_GRAPH_LEVEL]
                                   [--bigvgan.max_sequence_idle_microseconds BIGVGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--bigvgan.max_batch_size BIGVGAN.MAX_BATCH_SIZE]
                                   [--bigvgan.min_batch_size BIGVGAN.MIN_BATCH_SIZE]
                                   [--bigvgan.opt_batch_size BIGVGAN.OPT_BATCH_SIZE]
                                   [--bigvgan.preferred_batch_size BIGVGAN.PREFERRED_BATCH_SIZE]
                                   [--bigvgan.batching_type BIGVGAN.BATCHING_TYPE]
                                   [--bigvgan.preserve_ordering BIGVGAN.PRESERVE_ORDERING]
                                   [--bigvgan.instance_group_count BIGVGAN.INSTANCE_GROUP_COUNT]
                                   [--bigvgan.max_queue_delay_microseconds BIGVGAN.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--bigvgan.optimization_graph_level BIGVGAN.OPTIMIZATION_GRAPH_LEVEL]
                                   [--bigvgan.trt_max_workspace_size BIGVGAN.TRT_MAX_WORKSPACE_SIZE]
                                   [--bigvgan.use_onnx_runtime]
                                   [--bigvgan.use_torchscript]
                                   [--bigvgan.use_trt_fp32]
                                   [--bigvgan.use_trt_fp8]
                                   [--bigvgan.fp16_needs_obey_precision_pass]
                                   output_path source_path [source_path ...]

Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

options:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  -v, --verbose         Verbose log outputs
  --language_code LANGUAGE_CODE
                        Language code for the model (for multilingual models,
                        please provide comma-separated list of language codes)
  --instance_group_count INSTANCE_GROUP_COUNT
                        How many instances in a group
  --kind KIND           Backend runs on CPU or GPU
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --max_queue_delay_microseconds MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --batching_type BATCHING_TYPE
  --voice_name VOICE_NAME
                        Set the voice name for speech synthesis
  --num_speakers NUM_SPEAKERS
                        Number of unique speakers.
  --subvoices SUBVOICES
                        Comma-separated list of subvoices (no whitespace).
  --sample_rate SAMPLE_RATE
                        Sample rate of the output signal
  --chunk_length CHUNK_LENGTH
                        Chunk length in mel frames to synthesize at one time
  --chunk_ms CHUNK_MS   For Magpie TTS only, chunk length (in ms) to
                        synthesize at a time.
  --overlap_length OVERLAP_LENGTH
                        Overlap length in mel frames to overlap neighboring
                        chunks
  --num_mels NUM_MELS   number of mels
  --num_samples_per_frame NUM_SAMPLES_PER_FRAME
                        number of samples per frame
  --abbreviations_file ABBREVIATIONS_FILE
                        Path to file with list of abbreviations and
                        corresponding expansions
  --has_mapping_file HAS_MAPPING_FILE
  --mapping_file MAPPING_FILE
                        Path to phoneme mapping file
  --wfst_tokenizer_model WFST_TOKENIZER_MODEL
                        Sparrowhawk model to use for tokenization and
                        classification, must be in .far format
  --wfst_verbalizer_model WFST_VERBALIZER_MODEL
                        Sparrowhawk model to use for verbalizer, must be in
                        .far format.
  --sparrowhawk_proto_files SPARROWHAWK_PROTO_FILES
                        Sparrowhawk proto files, must be in .ascii_proto
                        format.
  --verbalizer_proto_files VERBALIZER_PROTO_FILES
                        Verbalizer proto files, must be in .ascii_proto
                        format.
  --tokenizer_proto_files TOKENIZER_PROTO_FILES
                        Tokenizer proto files, must be in .ascii_proto format.
  --postprocessor_proto_files POSTPROCESSOR_PROTO_FILES
                        Postprocessor proto files, must be in .ascii_proto
                        format.
  --wfst_pre_process_model WFST_PRE_PROCESS_MODEL
                        Sparrowhawk model to use for pre process, must be in
                        .far format.
  --wfst_post_process_model WFST_POST_PROCESS_MODEL
                        Sparrowhawk model to use for post process, must be in
                        .far format.
  --transcript_decoder_layers TRANSCRIPT_DECODER_LAYERS
                        Decoder layers corresponding to text.
  --context_decoder_layers CONTEXT_DECODER_LAYERS
                        Decoder layers corresponding to context.
  --arpabet_file ARPABET_FILE
                        Path to pronunciation dictionary (deprecated)
  --phone_dictionary_file PHONE_DICTIONARY_FILE
                        Path to pronunciation dictionary
  --phone_set PHONE_SET
                        Phonetic set that the model was trained on. An unset
                        value will attempt to auto-detect the phone set used
                        during training. Supports either "arpabet", "ipa",
                        "none".
  --upper_case_chars UPPER_CASE_CHARS
                        Whether character representations for this model are
                        upper case or lower case.
  --upper_case_g2p UPPER_CASE_G2P
                        Whether character representations for this model are
                        upper case or lower case.
  --mel_basis_file_path MEL_BASIS_FILE_PATH
                        Pre calculated Mel basis file for Audio to Mel
  --voice_map_file VOICE_MAP_FILE
                        Default voice name to filepath map
  --history_future HISTORY_FUTURE
                        Number of Codec Future/History frames
  --multilingual MULTILINGUAL
                        Whether the model is multilingual
  --language_archive_path LANGUAGE_ARCHIVE_PATH
                        Path to the language archive file
  --multi_char_tokenizer_offset MULTI_CHAR_TOKENIZER_OFFSET
                        Helper to offset the multilingual character mapping
                        ignored for other models
  --context_embedding_path CONTEXT_EMBEDDING_PATH
                        Path to the context embedding file
  --n_timesteps N_TIMESTEPS
                        Number of times magpie flow generator model should
                        run.

preprocessor:
  --preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
                        Use Batched Forward calls
  --preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
  --preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
  --preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
  --preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --preprocessor.optimization_graph_level PREPROCESSOR.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
  --preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
  --preprocessor.language PREPROCESSOR.LANGUAGE
  --preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
                        maximum length of every emitted sequence
  --preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
                        maximum length of input string
  --preprocessor.mapping PREPROCESSOR.MAPPING
  --preprocessor.tolower PREPROCESSOR.TOLOWER
  --preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE
  --preprocessor.enable_emphasis_tag PREPROCESSOR.ENABLE_EMPHASIS_TAG
                        Boolean flag that controls if the emphasis tag should
                        be parsed or not during pre-processing
  --preprocessor.start_of_emphasis_token PREPROCESSOR.START_OF_EMPHASIS_TOKEN
                        field to indicate start of emphasis in the given text
  --preprocessor.end_of_emphasis_token PREPROCESSOR.END_OF_EMPHASIS_TOKEN
                        field to indicate end of emphasis in the given text
  --preprocessor.bos_token PREPROCESSOR.BOS_TOKEN
                        Beginning of sentence token for Magpie Flow and Magpie
                        Zero-shot
  --preprocessor.eos_token PREPROCESSOR.EOS_TOKEN
                        End of sentence token for Magpie Flow and Magpie Zero-
                        shot

magpie_tts:
  --magpie_tts.max_sequence_idle_microseconds MAGPIE_TTS.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --magpie_tts.max_batch_size MAGPIE_TTS.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --magpie_tts.min_batch_size MAGPIE_TTS.MIN_BATCH_SIZE
  --magpie_tts.opt_batch_size MAGPIE_TTS.OPT_BATCH_SIZE
  --magpie_tts.preferred_batch_size MAGPIE_TTS.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --magpie_tts.batching_type MAGPIE_TTS.BATCHING_TYPE
  --magpie_tts.preserve_ordering MAGPIE_TTS.PRESERVE_ORDERING
                        Preserve ordering
  --magpie_tts.instance_group_count MAGPIE_TTS.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --magpie_tts.max_queue_delay_microseconds MAGPIE_TTS.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --magpie_tts.optimization_graph_level MAGPIE_TTS.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --magpie_tts.chunk_ms MAGPIE_TTS.CHUNK_MS
                        Chunk size in ms
  --magpie_tts.history_future MAGPIE_TTS.HISTORY_FUTURE
                        Number of codec frames to use as history/future
  --magpie_tts.future_len MAGPIE_TTS.FUTURE_LEN
                        Number of codec frames to use as future
  --magpie_tts.history_len MAGPIE_TTS.HISTORY_LEN
                        Number of codec frames to use as history
  --magpie_tts.fade_ms MAGPIE_TTS.FADE_MS
                        Fade-in/Fade-out for the chunk in ms
  --magpie_tts.max_decoder_steps MAGPIE_TTS.MAX_DECODER_STEPS
                        Max decoder steps
  --magpie_tts.bos_id MAGPIE_TTS.BOS_ID
                        Text bos id.
  --magpie_tts.eos_id MAGPIE_TTS.EOS_ID
                        Text eos id.
  --magpie_tts.audio_bos_id MAGPIE_TTS.AUDIO_BOS_ID
                        Audio bos id.
  --magpie_tts.audio_eos_id MAGPIE_TTS.AUDIO_EOS_ID
                        Audio eos id.
  --magpie_tts.context_audio_bos_id MAGPIE_TTS.CONTEXT_AUDIO_BOS_ID
                        Audio bos id.
  --magpie_tts.context_audio_eos_id MAGPIE_TTS.CONTEXT_AUDIO_EOS_ID
                        Audio eos id.
  --magpie_tts.codec_downsampling_rate MAGPIE_TTS.CODEC_DOWNSAMPLING_RATE
                        Audio codec downsampling rate.
  --magpie_tts.temperature MAGPIE_TTS.TEMPERATURE
                        Temperature for decoding.
  --magpie_tts.cfg_scale MAGPIE_TTS.CFG_SCALE
                        Cfg scale for CFG decoding.
  --magpie_tts.context_decoder_layers MAGPIE_TTS.CONTEXT_DECODER_LAYERS
                        Decoder layers corresponding to context.
  --magpie_tts.transcript_decoder_layers MAGPIE_TTS.TRANSCRIPT_DECODER_LAYERS
                        Decoder layers corresponding to text.
  --magpie_tts.apply_attention_prior MAGPIE_TTS.APPLY_ATTENTION_PRIOR
                        Apply attention priors.
  --magpie_tts.attention_prior_epsilon MAGPIE_TTS.ATTENTION_PRIOR_EPSILON
                        Attention prior epsilon value.
  --magpie_tts.attention_prior_lookahead_window MAGPIE_TTS.ATTENTION_PRIOR_LOOKAHEAD_WINDOW
                        Attention prior lookahead window
  --magpie_tts.estimate_alignment_from_layers MAGPIE_TTS.ESTIMATE_ALIGNMENT_FROM_LAYERS
                        Layers to estimate attention priors.
  --magpie_tts.apply_prior_to_layers MAGPIE_TTS.APPLY_PRIOR_TO_LAYERS
                        Apply attention priors to layers.
  --magpie_tts.attention_prior_window_length_right MAGPIE_TTS.ATTENTION_PRIOR_WINDOW_LENGTH_RIGHT
                        Attention prior window length right
  --magpie_tts.attention_prior_window_length_left MAGPIE_TTS.ATTENTION_PRIOR_WINDOW_LENGTH_LEFT
                        Attention prior window length left
  --magpie_tts.start_prior_after_n_audio_steps MAGPIE_TTS.START_PRIOR_AFTER_N_AUDIO_STEPS
                        Start priors application after audio_steps
  --magpie_tts.top_k MAGPIE_TTS.TOP_K
                        Top_k streams for decoding.
  --magpie_tts.max_context_len MAGPIE_TTS.MAX_CONTEXT_LEN
                        Max context length
  --magpie_tts.sample_rate MAGPIE_TTS.SAMPLE_RATE
                        Sampling rate

audio_codec:
  --audio_codec.max_sequence_idle_microseconds AUDIO_CODEC.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --audio_codec.max_batch_size AUDIO_CODEC.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --audio_codec.min_batch_size AUDIO_CODEC.MIN_BATCH_SIZE
  --audio_codec.opt_batch_size AUDIO_CODEC.OPT_BATCH_SIZE
  --audio_codec.preferred_batch_size AUDIO_CODEC.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --audio_codec.batching_type AUDIO_CODEC.BATCHING_TYPE
  --audio_codec.preserve_ordering AUDIO_CODEC.PRESERVE_ORDERING
                        Preserve ordering
  --audio_codec.instance_group_count AUDIO_CODEC.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --audio_codec.max_queue_delay_microseconds AUDIO_CODEC.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --audio_codec.optimization_graph_level AUDIO_CODEC.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --audio_codec.chunk_ms AUDIO_CODEC.CHUNK_MS
                        Chunk size in ms
  --audio_codec.history_future AUDIO_CODEC.HISTORY_FUTURE
                        Number of codec frames to use as history/future
  --audio_codec.future_len AUDIO_CODEC.FUTURE_LEN
                        Number of codec frames to use as future
  --audio_codec.history_len AUDIO_CODEC.HISTORY_LEN
                        Number of codec frames to use as history
  --audio_codec.fade_ms AUDIO_CODEC.FADE_MS
                        Fade-in/Fade-out for the chunk in ms
  --audio_codec.audio_bos_id AUDIO_CODEC.AUDIO_BOS_ID
                        Audio bos id.
  --audio_codec.audio_eos_id AUDIO_CODEC.AUDIO_EOS_ID
                        Audio eos id.
  --audio_codec.codec_downsampling_rate AUDIO_CODEC.CODEC_DOWNSAMPLING_RATE
                        Audio codec downsampling rate.
  --audio_codec.sample_rate AUDIO_CODEC.SAMPLE_RATE
                        Sampling rate

audio_codec_encoder:
  --audio_codec_encoder.max_sequence_idle_microseconds AUDIO_CODEC_ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --audio_codec_encoder.max_batch_size AUDIO_CODEC_ENCODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --audio_codec_encoder.min_batch_size AUDIO_CODEC_ENCODER.MIN_BATCH_SIZE
  --audio_codec_encoder.opt_batch_size AUDIO_CODEC_ENCODER.OPT_BATCH_SIZE
  --audio_codec_encoder.preferred_batch_size AUDIO_CODEC_ENCODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --audio_codec_encoder.batching_type AUDIO_CODEC_ENCODER.BATCHING_TYPE
  --audio_codec_encoder.preserve_ordering AUDIO_CODEC_ENCODER.PRESERVE_ORDERING
                        Preserve ordering
  --audio_codec_encoder.instance_group_count AUDIO_CODEC_ENCODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --audio_codec_encoder.max_queue_delay_microseconds AUDIO_CODEC_ENCODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --audio_codec_encoder.optimization_graph_level AUDIO_CODEC_ENCODER.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --audio_codec_encoder.trt_max_workspace_size AUDIO_CODEC_ENCODER.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --audio_codec_encoder.use_onnx_runtime AUDIO_CODEC_ENCODER.USE_ONNX_RUNTIME
                        Use ONNX runtime for inference
  --audio_codec_encoder.use_torchscript
                        Use TorchScript instead of TensorRT
  --audio_codec_encoder.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --audio_codec_encoder.use_trt_fp8
                        Use TensorRT engine with FP8 instead of FP16,
                        available after Ada (compute capability >= 8.9)
  --audio_codec_encoder.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

neuralg2p:
  --neuralg2p.max_sequence_idle_microseconds NEURALG2P.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --neuralg2p.max_batch_size NEURALG2P.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --neuralg2p.min_batch_size NEURALG2P.MIN_BATCH_SIZE
  --neuralg2p.opt_batch_size NEURALG2P.OPT_BATCH_SIZE
  --neuralg2p.preferred_batch_size NEURALG2P.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --neuralg2p.batching_type NEURALG2P.BATCHING_TYPE
  --neuralg2p.preserve_ordering NEURALG2P.PRESERVE_ORDERING
                        Preserve ordering
  --neuralg2p.instance_group_count NEURALG2P.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --neuralg2p.max_queue_delay_microseconds NEURALG2P.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --neuralg2p.optimization_graph_level NEURALG2P.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --neuralg2p.trt_max_workspace_size NEURALG2P.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --neuralg2p.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --neuralg2p.use_torchscript
                        Use TorchScript instead of TensorRT
  --neuralg2p.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --neuralg2p.use_trt_fp8
                        Use TensorRT engine with FP8 instead of FP16,
                        available after Ada (compute capability >= 8.9)
  --neuralg2p.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

tts_generator:
  --tts_generator.max_sequence_idle_microseconds TTS_GENERATOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tts_generator.max_batch_size TTS_GENERATOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tts_generator.min_batch_size TTS_GENERATOR.MIN_BATCH_SIZE
  --tts_generator.opt_batch_size TTS_GENERATOR.OPT_BATCH_SIZE
  --tts_generator.preferred_batch_size TTS_GENERATOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tts_generator.batching_type TTS_GENERATOR.BATCHING_TYPE
  --tts_generator.preserve_ordering TTS_GENERATOR.PRESERVE_ORDERING
                        Preserve ordering
  --tts_generator.instance_group_count TTS_GENERATOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tts_generator.max_queue_delay_microseconds TTS_GENERATOR.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tts_generator.optimization_graph_level TTS_GENERATOR.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --tts_generator.trt_max_workspace_size TTS_GENERATOR.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in MB) to use for model export
                        to TensorRT
  --tts_generator.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --tts_generator.use_torchscript
                        Use TorchScript instead of TensorRT
  --tts_generator.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --tts_generator.use_trt_fp8
                        Use TensorRT engine with FP8 instead of FP16,
                        available after Ada (compute capability >= 8.9)
  --tts_generator.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

tts_magpie_flow_rate_limiter:
  --tts_magpie_flow_rate_limiter.max_sequence_idle_microseconds TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tts_magpie_flow_rate_limiter.max_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tts_magpie_flow_rate_limiter.min_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.MIN_BATCH_SIZE
  --tts_magpie_flow_rate_limiter.opt_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.OPT_BATCH_SIZE
  --tts_magpie_flow_rate_limiter.preferred_batch_size TTS_MAGPIE_FLOW_RATE_LIMITER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tts_magpie_flow_rate_limiter.batching_type TTS_MAGPIE_FLOW_RATE_LIMITER.BATCHING_TYPE
  --tts_magpie_flow_rate_limiter.preserve_ordering TTS_MAGPIE_FLOW_RATE_LIMITER.PRESERVE_ORDERING
                        Preserve ordering
  --tts_magpie_flow_rate_limiter.instance_group_count TTS_MAGPIE_FLOW_RATE_LIMITER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tts_magpie_flow_rate_limiter.max_queue_delay_microseconds TTS_MAGPIE_FLOW_RATE_LIMITER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tts_magpie_flow_rate_limiter.optimization_graph_level TTS_MAGPIE_FLOW_RATE_LIMITER.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration

bigvgan:
  --bigvgan.max_sequence_idle_microseconds BIGVGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --bigvgan.max_batch_size BIGVGAN.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --bigvgan.min_batch_size BIGVGAN.MIN_BATCH_SIZE
  --bigvgan.opt_batch_size BIGVGAN.OPT_BATCH_SIZE
  --bigvgan.preferred_batch_size BIGVGAN.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --bigvgan.batching_type BIGVGAN.BATCHING_TYPE
  --bigvgan.preserve_ordering BIGVGAN.PRESERVE_ORDERING
                        Preserve ordering
  --bigvgan.instance_group_count BIGVGAN.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --bigvgan.max_queue_delay_microseconds BIGVGAN.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --bigvgan.optimization_graph_level BIGVGAN.OPTIMIZATION_GRAPH_LEVEL
                        The Graph optimization level to use in Triton model
                        configuration
  --bigvgan.trt_max_workspace_size BIGVGAN.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --bigvgan.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --bigvgan.use_torchscript
                        Use TorchScript instead of TensorRT
  --bigvgan.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --bigvgan.use_trt_fp8
                        Use TensorRT engine with FP8 instead of FP16,
                        available after Ada (compute capability >= 8.9)
  --bigvgan.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network