Custom Models#

Model Deployment#

Like all Riva models, Riva TTS requires the following steps:

Create .riva files for each model from either a .tao or .nemo file as outlined in the respective Model Development with TAO Toolkit and NeMo sections

Create .rmir files for each Riva skill (for example, ASR, NLP, and TTS) using riva-build

Create model directories using riva_deploy

Deploy the model directory using riva_server

The following sections provide examples for specific steps as outlined above.

Creating Riva Files#

Riva files can be created from .nemo or .tao files. As mentioned before in the respective TAO and NeMo sections, the generation of Riva files from .nemo or .tao files must be done on a Linux x86_64 workstation only.

The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file.

Download the .nemo file from NGC onto the host system.
Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:1.4.0

After the container has launched, use nemo2riva to convert the .nemo to .riva run:

pip3 install nvidia-pyindex
VersionNum=???
ngc registry resource download-version "nvidia/riva/riva_quickstart:$2.4.0"
pip3 install "riva_quickstart_v$2.4.0/nemo2riva-$2.4.0-py3-none-any.whl"
nemo2riva --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

Repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Be sure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-{version}_beta-py3-none-any.whl version when performing the above step:

Note

Tacotron2 is kept as a .nemo file and is not supported with the nemo2riva tool.

Note

WaveGlow NeMo built with newer NeMo versions do not work with the nemo2riva tool nor the riva-build tool. Refer to the release-notes.html#riva-speech-skills-1-10-0-beta Known Issues section for more information.

Custom Pronunciations#

Speech synthesis models deployed in Riva are configured with a language-specific pronunciation dictionary mapping a large vocabulary of words from their written form, graphemes, to a sequence of perceptually distinct sounds, phonemes. In cases where pronunciation is ambiguous, for example with heteronyms like bass (the fish) and bass (the musical instrument), the dictionary is ignored and the synthesis model uses context clues from the sentence to predict an appropriate pronunciation.

Modern speech synthesis algorithms are surprisingly capable of accurately predicting pronunciations of new and novel words. Sometimes, however, it is desirable or necessary to provide extra context to the model.

While custom pronunciations can be supplied at request time using SSML, request-time overrides are best suited for one-off adjustments. For domain-specific terms with fixed pronunciations, configure Riva with these pronunciations when deploying the server.

There are two key parameters that can be configured through riva-build or in the preprocessor configuration that affects the phoneme path:

--arpabet_file path to pronunciation dictionary. For English, start with the dictionary available on NGC and add custom entries as needed. English language models use phonemes defined in ARPABET and CMUDict is the default pronunciation dictionary. Modify this dictionary with custom entries as needed.
--preprocessor.g2p_ignore_ambiguous If True, words that have more than one phonetic representation in the pronunciation dictionary such as “read” are not converted to phonemes. Defaults to True.

To determine the appropriate phoneme sequence, use the SSML API to experiment with phone sequences and evaluate the quality. Once the mapping sounds correct, add the discovered mapping to a new line in the dictionary.

Multispeaker Models#

Riva supports models with multiple speakers. Currently, this feature is limited to FastPitch and HiFi-Gan models.

To enable this feature, specify the following parameters before building the model.

--voice_name is the name of the model. Defaults to English-US-Female-1.
--subvoices is a comma-separated list of names for each subvoice, with the length equal to the number of subvoices as specified in the FastPitch model. For example, for a model with a “male” subvoice in the 0th speaker embedding and “female” subvoice in the first embedding, include the option --subvoices=Male:0,Female:1. If not provided, the desired embedding can be requested by integer index.

The voice name and subvoices are maintained in the generated .rmir file, and caried into the generated Triton repositories. During inference, modify the voice name of the request by appending voice_name with a period followed by a valid subvoice. For example, <voice_name>.<subvoice>.

Custom Voice#

Riva is voice agnostic and can be run with any English-US TTS voice. In order to train a custom voice model, data must first be collected. We recommend at least 30 minutes of high-quality data. For collecting the data, refer to the Riva custom voice recoder. After the data has been collected, the FastPitch and Hifi-GAN models need to be fine-tuned on this dataset. Refer to the NeMo fine-tuning notebook or the TAO fine-tuning notebook for how to train these models. A Riva pipeline using these models can be built according to the instructions on this page.

Pretrained Models#

Task	Architecture	Language	Dataset	Link
Mel Spectrogram Generation	FastPitch	English	English-US-Female-1	Riva
Vocoder	HiFi-GAN	English	English-US-Female-1	Riva
Mel Spectrogram Generation	FastPitch	English	English-US-Male-1	Riva
Vocoder	HiFi-GAN	English	English-US-Male-1	Riva
Mel Spectrogram Generation	FastPitch	English	LJSpeech	Riva
Mel Spectrogram Generation	Tacotron2	English	LJSpeech	Riva
Vocoder	HiFi-GAN	English	LJSpeech	Riva
Vocoder	Waveglow	English	LJSpeech	Riva

Pipeline Configuration#

FastPitch and HiFi-GAN#

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
    /servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file> \
    --wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
    --wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \

where:

<rmir_filename> is the Riva rmir file that is generated
<encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.
pipeline_name is an optional user-defined name for the components in the model repository
<fastpitch_riva_filename> is the name of the riva file for FastPitch
<hifigan_riva_filename> is the name of the riva file for HiFi-GAN
<abbr_file> is the name of the file containing abbreviations and their corresponding expansions
<dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
<subvoices> is a comma-separated list of names for each subvoice. Defaults to naming by integer index. This is needed and only used for multi-speaker models.
<wfst_tokenizer_model> is the location of the tokenize_and_classify.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script
<wfst_verbalizer_model> is the location of the verbalize.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the riva-build-optional-parameters section and set the various max_batch_size parameters to 1 while executing the riva-build command.

Tacotron2 and WaveGlow#

Warning

Tacotron2 and WaveGlow are deprecated and no longer recommended.

In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<tacotron_nemo_filename> \
    /servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file> \
    --wfst_tokenizer_model=/servicemaker-dev/<tokenizer_far_file> \
    --wfst_verbalizer_model=/servicemaker-dev/<verbalizer_far_file> \

where:

<rmir_filename> is the Riva rmir file that is generated
<encryption_key> is the encryption key used during the export of the .riva file
pipeline_name is an optional user-defined name for the components in the model repository
<tacotron_nemo_filename> is the name of the nemo checkpoint file for Tacotron 2
<waveglow_riva_filename> is the name of the riva file for the universal WaveGlow model
<abbr_file> is the name of the file containing abbreviations and their corresponding expansions
<dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
<wfst_tokenizer_model> is the location of the tokenize_and_classify.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script
<wfst_verbalizer_model> is the location of the verbalize.far file that is generated from running the NeMo’s Text Processing’s export_grammar.sh script

Pretrained Quick Start Pipelines#

Pipeline	`riva-build` command
FastPitch + HiFi-GAN Female 1	riva-build speech_synthesis \ <rmir_filename>:<key> \ <fastpitch_riva_filename>:<key> \ <hifigan_riva_filename>:<key> \ --sample_rate 44100 \ --voice_name English-US-Female-1 \ --arpabet_file=cmudict-0.7b-nv0.01 \ --abbreviations_file=abbr.txt \ --wfst_tokenizer_model=tokenize_and_classify.far \ --wfst_verbalizer_model=verbalize.far
FastPitch + HiFi-GAN Male 1	riva-build speech_synthesis \ <rmir_filename>:<key> \ <fastpitch_riva_filename>:<key> \ <hifigan_riva_filename>:<key> \ --sample_rate 44100 \ --voice_name English-US-Male-1 \ --arpabet_file=cmudict-0.7b-nv0.01 \ --abbreviations_file=abbr.txt \ --wfst_tokenizer_model=tokenize_and_classify.far \ --wfst_verbalizer_model=verbalize.far
FastPitch + HiFi-GAN LJSpeech	riva-build speech_synthesis \ <rmir_filename>:<key> \ <fastpitch_riva_filename>:<key> \ <hifigan_riva_filename>:<key> \ --voice_name ljspeech \ --arpabet_file=cmudict-0.7b-nv0.01 \ --abbreviations_file=abbr.txt \ --wfst_tokenizer_model=tokenize_and_classify.far \ --wfst_verbalizer_model=verbalize.far
Tacotron2 + WaveGlow	riva-build speech_synthesis \ <rmir_filename>:<key> \ <tacotron_nemo_filename>:<key> \ <waveglow_riva_filename>:<key> \ --arpabet_file=cmudict-0.7b-nv0.01 \ --abbreviations_file=abbr.txt \ --wfst_tokenizer_model=tokenize_and_classify.far \ --wfst_verbalizer_model=verbalize.far

All text normalization .far files are in NGC on the Riva TTS English Normalization Grammar page. All other auxiliary files that are not .riva files (such as pronunciation dictionaries) are in NGC on the Riva TTS English US Auxiliary Files page.

Riva-build Optional Parameters#

For details about the parameters passed to riva-build to customize the TTS pipeline, issue:

riva-build speech_synthesis -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build speech_synthesis [-h] [-f] [-v]
                                   [--language_code LANGUAGE_CODE]
                                   [--max_batch_size MAX_BATCH_SIZE]
                                   [--voice_name VOICE_NAME]
                                   [--num_speakers NUM_SPEAKERS]
                                   [--subvoices SUBVOICES]
                                   [--sample_rate SAMPLE_RATE]
                                   [--chunk_length CHUNK_LENGTH]
                                   [--overlap_length OVERLAP_LENGTH]
                                   [--num_mels NUM_MELS]
                                   [--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
                                   [--abbreviations_file ABBREVIATIONS_FILE]
                                   [--has_mapping_file HAS_MAPPING_FILE]
                                   [--arpabet_file ARPABET_FILE]
                                   [--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
                                   [--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
                                   [--denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE]
                                   [--denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE]
                                   [--denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE]
                                   [--denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE]
                                   [--denoiser.batching_type DENOISER.BATCHING_TYPE]
                                   [--denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING]
                                   [--denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT]
                                   [--denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--denoiser.fade_length DENOISER.FADE_LENGTH]
                                   [--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
                                   [--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
                                   [--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
                                   [--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
                                   [--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
                                   [--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
                                   [--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
                                   [--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
                                   [--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
                                   [--preprocessor.language PREPROCESSOR.LANGUAGE]
                                   [--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
                                   [--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
                                   [--preprocessor.mapping PREPROCESSOR.MAPPING]
                                   [--preprocessor.tolower PREPROCESSOR.TOLOWER]
                                   [--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
                                   [--encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoder.max_batch_size ENCODER.MAX_BATCH_SIZE]
                                   [--encoder.min_batch_size ENCODER.MIN_BATCH_SIZE]
                                   [--encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE]
                                   [--encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE]
                                   [--encoder.batching_type ENCODER.BATCHING_TYPE]
                                   [--encoder.preserve_ordering ENCODER.PRESERVE_ORDERING]
                                   [--encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT]
                                   [--encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE]
                                   [--encoder.use_onnx_runtime]
                                   [--encoder.use_trt_fp32]
                                   [--encoder.fp16_needs_obey_precision_pass]
                                   [--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE]
                                   [--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE]
                                   [--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE]
                                   [--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE]
                                   [--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING]
                                   [--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE]
                                   [--encoderFastPitch.use_onnx_runtime]
                                   [--encoderFastPitch.use_trt_fp32]
                                   [--encoderFastPitch.fp16_needs_obey_precision_pass]
                                   [--decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--decoder.max_batch_size DECODER.MAX_BATCH_SIZE]
                                   [--decoder.min_batch_size DECODER.MIN_BATCH_SIZE]
                                   [--decoder.opt_batch_size DECODER.OPT_BATCH_SIZE]
                                   [--decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE]
                                   [--decoder.batching_type DECODER.BATCHING_TYPE]
                                   [--decoder.preserve_ordering DECODER.PRESERVE_ORDERING]
                                   [--decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT]
                                   [--decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE]
                                   [--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE]
                                   [--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE]
                                   [--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE]
                                   [--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING]
                                   [--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE]
                                   [--waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE]
                                   [--waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE]
                                   [--waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE]
                                   [--waveglow.batching_type WAVEGLOW.BATCHING_TYPE]
                                   [--waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING]
                                   [--waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT]
                                   [--waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE]
                                   [--waveglow.use_onnx_runtime]
                                   [--waveglow.use_trt_fp32]
                                   [--waveglow.fp16_needs_obey_precision_pass]
                                   [--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE]
                                   [--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE]
                                   [--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE]
                                   [--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE]
                                   [--hifigan.batching_type HIFIGAN.BATCHING_TYPE]
                                   [--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING]
                                   [--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT]
                                   [--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE]
                                   [--hifigan.use_onnx_runtime]
                                   [--hifigan.use_trt_fp32]
                                   [--hifigan.fp16_needs_obey_precision_pass]
                                   output_path source_path [source_path ...]

Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  -v, --verbose         Verbose log outputs
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --voice_name VOICE_NAME
                        Set the voice name for speech synthesis
  --num_speakers NUM_SPEAKERS
                        Number of unqiue speakers.
  --subvoices SUBVOICES
                        Comma-seprated list of subvoices (no whitespace).
  --sample_rate SAMPLE_RATE
                        Sample rate of the output signal
  --chunk_length CHUNK_LENGTH
                        Chunk length in mel frames to synthesize at one time
  --overlap_length OVERLAP_LENGTH
                        Chunk length in mel frames to overlap neighboring
                        chunks
  --num_mels NUM_MELS   number of mels
  --num_samples_per_frame NUM_SAMPLES_PER_FRAME
                        number of samples per frame
  --abbreviations_file ABBREVIATIONS_FILE
                        Path to file with list of abbreviations and
                        corresponding expansions
  --has_mapping_file HAS_MAPPING_FILE
  --arpabet_file ARPABET_FILE
                        Path to pronunciation dictionary
  --wfst_tokenizer_model WFST_TOKENIZER_MODEL
                        Sparrowhawk model to use for tokenization and
                        classification, must be in .far format
  --wfst_verbalizer_model WFST_VERBALIZER_MODEL
                        Sparrowhawk model to use for verbalizer, must be in
                        .far format.

denoiser:
  --denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE
  --denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE
  --denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --denoiser.batching_type DENOISER.BATCHING_TYPE
  --denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING
                        Preserve ordering
  --denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --denoiser.fade_length DENOISER.FADE_LENGTH
                        fade length

preprocessor:
  --preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
  --preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
  --preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
  --preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
  --preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
  --preprocessor.language PREPROCESSOR.LANGUAGE
  --preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
                        maximum length of every emitted sequence
  --preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
                        maximum length of input string
  --preprocessor.mapping PREPROCESSOR.MAPPING
  --preprocessor.tolower PREPROCESSOR.TOLOWER
  --preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE

encoder:
  --encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoder.max_batch_size ENCODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoder.min_batch_size ENCODER.MIN_BATCH_SIZE
  --encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE
  --encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoder.batching_type ENCODER.BATCHING_TYPE
  --encoder.preserve_ordering ENCODER.PRESERVE_ORDERING
                        Preserve ordering
  --encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --encoder.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --encoder.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --encoder.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

encoderFastPitch:
  --encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE
  --encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE
  --encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE
  --encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --encoderFastPitch.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --encoderFastPitch.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --encoderFastPitch.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

decoder:
  --decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --decoder.max_batch_size DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --decoder.min_batch_size DECODER.MIN_BATCH_SIZE
  --decoder.opt_batch_size DECODER.OPT_BATCH_SIZE
  --decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --decoder.batching_type DECODER.BATCHING_TYPE
  --decoder.preserve_ordering DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds

chunkerFastPitch:
  --chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE
  --chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE
  --chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE
  --chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds

waveglow:
  --waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE
  --waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE
  --waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --waveglow.batching_type WAVEGLOW.BATCHING_TYPE
  --waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING
                        Preserve ordering
  --waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --waveglow.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --waveglow.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --waveglow.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

hifigan:
  --hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE
  --hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE
  --hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --hifigan.batching_type HIFIGAN.BATCHING_TYPE
  --hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING
                        Preserve ordering
  --hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT
  --hifigan.use_onnx_runtime
                        Use ONNX runtime instead of TensorRT
  --hifigan.use_trt_fp32
                        Use TensorRT engine with FP32 instead of FP16
  --hifigan.fp16_needs_obey_precision_pass
                        Flag to explicitly mark layers as float when parsing
                        the ONNX network

NVIDIA Riva Skills

Custom Models

Contents

Custom Models#

Model Deployment#

Creating Riva Files#

Custom Pronunciations#

Multispeaker Models#

Custom Voice#

Pretrained Models#

Pipeline Configuration#

FastPitch and HiFi-GAN#

Tacotron2 and WaveGlow#

Pretrained Quick Start Pipelines#

Riva-build Optional Parameters#