Speech Synthesis¶

The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

For new users, it is recommended to start with the FastPitch + HiFi-GAN models.

Model Architectures - Mel Spectrogram Generators¶

FastPitch: A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.

Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.

Model Architectures - Vocoders¶

HiFi-GAN: A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model archiecture that achieves both efficient and high-fidelity speech synthesis.

WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.

Services¶

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Model Deployment¶

Like all Riva models, Riva TTS requires the following steps:

Create .riva files for each model from either a .tao or .nemo file as outlined in the respective TAO and NeMo sections

Create .rmir files for each Riva skill (for example, ASR, NLP, and TTS) using riva-build

Create model directories using riva_deploy

Deploy the model directory using riva_server

The following sections describe some examples for specific steps as outlined above.

Creating Riva files¶

Riva files can be created from .nemo or .tao files. The following is an example of how a HiFi-GAN model can be converted to a .riva file from a .nemo file. First, download the .nemo file from NGC onto the host system. Run the NeMo container and share the .nemo file with the container including the -v option.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
    -v $(pwd):/NeMo \
    --shm-size=8g \
    -p 8888:8888 \
    -p 6006:6006 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --device=/dev/snd \
    nvcr.io/nvidia/nemo:1.4.0

After the container has launched, run:

pip3 install nvidia-pyindex
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/riva/riva_quickstart/versions/1.7.0-beta/files/riva_api-1.7.0b0-py3-none-any.whl -O riva_api-1.7.0b0-py3-none-any.whl
pip3 install nemo2riva-1.7.0_beta-py3-none-any.whl
nemo2riva --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo

You can repeat this process for each .nemo model to generate .riva files. It is suggested that you do so for FastPitch before continuing to the next step. Be sure that you are getting the latest tts_hifigan.nemo checkpoint, latest nvcr.io/nvidia/nemo container version, and latest nemo2riva-{version}_beta-py3-none-any.whl version when doing the above step:

Note

Tacotron2 is kept as a .nemo file and is not supported with the nemo2riva tool.

Note

WaveGlow NeMo built with newer NeMo versions do not work with the nemo2riva tool nor the riva-build tool. Refer to the Riva 1.10.0 Known Issues section in the Release Notes.

Pipeline Configuration¶

FastPitch and HiFi-GAN¶

Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:

riva-build speech_synthesis \
        /servicemaker-dev/<rmir_filename>:<encryption_key> \
        /servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
        /servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
        --voice_name=<pipeline_name> \
        --abbreviations_file=/servicemaker-dev/<abbr_file> \
        --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

<rmir_filename> is the Riva rmir file that is generated
<encryption_key> is the encryption key used during the export of the .riva file
pipeline_name is an optional user-defined name for the components in the model repository
<fastpitch_riva_filename> is the name of the riva file for FastPitch
<hifigan_riva_filename> is the name of the riva file for HiFi-GAN
<abbr_file> is the name of the file containing abbreviations and their corresponding expansions
<dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Tacotron2 and Waveglow¶

In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:

riva-build speech_synthesis \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<tacotron_nemo_filename> \
    /servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
    --voice_name=<pipeline_name> \
    --abbreviations_file=/servicemaker-dev/<abbr_file> \
    --arpabet_file=/servicemaker-dev/<dictionary_file>

where:

<rmir_filename> is the Riva rmir file that is generated
<encryption_key> is the encryption key used during the export of the .riva file
pipeline_name is an optional user-defined name for the components in the model repository
<tacotron_nemo_filename> is the name of the nemo checkpoint file for Tacotron 2
<waveglow_riva_filename> is the name of the riva file for the universal WaveGlow model
<abbr_file> is the name of the file containing abbreviations and their corresponding expansions
<dictionary_file> is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename, otherwise this is unnecessary.

Riva-build Optional Parameters¶

For details about the parameters passed to riva-build to customize the TTS pipeline, issue:

riva-build speech_synthesis -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build speech_synthesis [-h] [-f] [--language_code LANGUAGE_CODE]
                                   [--max_batch_size MAX_BATCH_SIZE]
                                   [--voice_name VOICE_NAME]
                                   [--num_speakers NUM_SPEAKERS]
                                   [--subvoices SUBVOICES]
                                   [--sample_rate SAMPLE_RATE]
                                   [--chunk_length CHUNK_LENGTH]
                                   [--overlap_length OVERLAP_LENGTH]
                                   [--num_mels NUM_MELS]
                                   [--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
                                   [--abbreviations_file ABBREVIATIONS_FILE]
                                   [--has_mapping_file HAS_MAPPING_FILE]
                                   [--arpabet_file ARPABET_FILE]
                                   [--denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE]
                                   [--denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE]
                                   [--denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE]
                                   [--denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE]
                                   [--denoiser.batching_type DENOISER.BATCHING_TYPE]
                                   [--denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING]
                                   [--denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT]
                                   [--denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--denoiser.fade_length DENOISER.FADE_LENGTH]
                                   [--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
                                   [--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
                                   [--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
                                   [--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
                                   [--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
                                   [--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
                                   [--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
                                   [--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
                                   [--preprocessor.abbreviations_path PREPROCESSOR.ABBREVIATIONS_PATH]
                                   [--preprocessor.dictionary_path PREPROCESSOR.DICTIONARY_PATH]
                                   [--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
                                   [--preprocessor.language PREPROCESSOR.LANGUAGE]
                                   [--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
                                   [--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
                                   [--preprocessor.mapping PREPROCESSOR.MAPPING]
                                   [--preprocessor.tolower PREPROCESSOR.TOLOWER]
                                   [--preprocessor.generate_pron_chars PREPROCESSOR.GENERATE_PRON_CHARS]
                                   [--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
                                   [--encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoder.max_batch_size ENCODER.MAX_BATCH_SIZE]
                                   [--encoder.min_batch_size ENCODER.MIN_BATCH_SIZE]
                                   [--encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE]
                                   [--encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE]
                                   [--encoder.batching_type ENCODER.BATCHING_TYPE]
                                   [--encoder.preserve_ordering ENCODER.PRESERVE_ORDERING]
                                   [--encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT]
                                   [--encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE]
                                   [--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE]
                                   [--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE]
                                   [--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE]
                                   [--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE]
                                   [--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING]
                                   [--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE]
                                   [--decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--decoder.max_batch_size DECODER.MAX_BATCH_SIZE]
                                   [--decoder.min_batch_size DECODER.MIN_BATCH_SIZE]
                                   [--decoder.opt_batch_size DECODER.OPT_BATCH_SIZE]
                                   [--decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE]
                                   [--decoder.batching_type DECODER.BATCHING_TYPE]
                                   [--decoder.preserve_ordering DECODER.PRESERVE_ORDERING]
                                   [--decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT]
                                   [--decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE]
                                   [--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE]
                                   [--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE]
                                   [--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE]
                                   [--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE]
                                   [--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING]
                                   [--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT]
                                   [--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE]
                                   [--waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE]
                                   [--waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE]
                                   [--waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE]
                                   [--waveglow.batching_type WAVEGLOW.BATCHING_TYPE]
                                   [--waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING]
                                   [--waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT]
                                   [--waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE]
                                   [--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                   [--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE]
                                   [--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE]
                                   [--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE]
                                   [--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE]
                                   [--hifigan.batching_type HIFIGAN.BATCHING_TYPE]
                                   [--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING]
                                   [--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT]
                                   [--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS]
                                   [--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE]
                                   output_path source_path [source_path ...]

Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --voice_name VOICE_NAME
                        Set the voice name for speech synthesis
  --num_speakers NUM_SPEAKERS
                        Number of unqiue speakers.
  --subvoices SUBVOICES
                        Comma-seprated list of subvoices (no whitespace).
  --sample_rate SAMPLE_RATE
                        Sample rate of the output signal
  --chunk_length CHUNK_LENGTH
                        Chunk length in mel frames to synthesize at one time
  --overlap_length OVERLAP_LENGTH
                        Chunk length in mel frames to overlap neighboring
                        chunks
  --num_mels NUM_MELS   number of mels
  --num_samples_per_frame NUM_SAMPLES_PER_FRAME
                        number of samples per frame
  --abbreviations_file ABBREVIATIONS_FILE
                        Path to file with list of abbreviations and
                        corresponding expansions
  --has_mapping_file HAS_MAPPING_FILE
  --arpabet_file ARPABET_FILE
                        Path to pronunciation dictionary

denoiser:
  --denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE
  --denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE
  --denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --denoiser.batching_type DENOISER.BATCHING_TYPE
  --denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING
                        Preserve ordering
  --denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --denoiser.fade_length DENOISER.FADE_LENGTH
                        fade length

preprocessor:
  --preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
  --preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
  --preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
  --preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
  --preprocessor.abbreviations_path PREPROCESSOR.ABBREVIATIONS_PATH
  --preprocessor.dictionary_path PREPROCESSOR.DICTIONARY_PATH
  --preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
  --preprocessor.language PREPROCESSOR.LANGUAGE
  --preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
                        maximum length of every emitted sequence
  --preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
                        maximum length of input string
  --preprocessor.mapping PREPROCESSOR.MAPPING
  --preprocessor.tolower PREPROCESSOR.TOLOWER
  --preprocessor.generate_pron_chars PREPROCESSOR.GENERATE_PRON_CHARS
  --preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE

encoder:
  --encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoder.max_batch_size ENCODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoder.min_batch_size ENCODER.MIN_BATCH_SIZE
  --encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE
  --encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoder.batching_type ENCODER.BATCHING_TYPE
  --encoder.preserve_ordering ENCODER.PRESERVE_ORDERING
                        Preserve ordering
  --encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

encoderFastPitch:
  --encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE
  --encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE
  --encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE
  --encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

decoder:
  --decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --decoder.max_batch_size DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --decoder.min_batch_size DECODER.MIN_BATCH_SIZE
  --decoder.opt_batch_size DECODER.OPT_BATCH_SIZE
  --decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --decoder.batching_type DECODER.BATCHING_TYPE
  --decoder.preserve_ordering DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds

chunkerFastPitch:
  --chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE
  --chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE
  --chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE
  --chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING
                        Preserve ordering
  --chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds

waveglow:
  --waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE
  --waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE
  --waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --waveglow.batching_type WAVEGLOW.BATCHING_TYPE
  --waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING
                        Preserve ordering
  --waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

hifigan:
  --hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE
  --hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE
  --hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --hifigan.batching_type HIFIGAN.BATCHING_TYPE
  --hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING
                        Preserve ordering
  --hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS
                        max queue delta in microseconds
  --hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

Speech Synthesis Markup Language (SSML)¶

Riva 1.8.0 adds preliminary support for SSML. Only the FastPitch model is supported at this time. There are no plans to add this functionality to Tacotron2. The FastPitch model must be exported using NeMo 1.5.1 and the nemo2riva 1.8.0 tool. All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text. Riva currently supports the following in a limited capacity:

prosody tag

pitch attribute

rate attribute

Pitch Attribute¶

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3]. Values outside this range result in an error being logged, and no audio returned. Note that this value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported. The pitch attribute is expressed in the following formats:

pitch="1"

pitch="+1.8"

pitch="-0.65"

pitch="high"

pitch="default"

Rate Attribute¶

Riva supports a % relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. It also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported. The rate attribute is expressed in the following formats:

rate="35%"

rate="+200%"

rate="low"

rate="default"

Warning

The pitch attribute currently does not support Hz, st, and % changes. Support is planned for a future Riva release.

For SSML examples with sample audio, refer to the Riva_speech_API_demo notebook section.

Pretrained Models¶

Task	Architecture	Language	Dataset	Link
Mel Spectrogram Generation	FastPitch	English	LJSpeech	Riva
Mel Spectrogram Generation	Tacotron2	English	LJSpeech	Riva
Vocoder	HiFi-GAN	English	LJSpeech	Riva
Vocoder	Waveglow	English	LJSpeech	Riva

Pretrained Quickstart Pipelines¶

Pipeline

riva-build command

FastPitch + HiFiGAN

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <fastpitch_riva_filename>:<key> \
    <hifigan_riva_filename>:<key> \
    --arpabet_file=cmudict-0.7b-nv0.01 \
    --abbreviations_file=abbr.txt

Tacotron2 + WaveGlow

riva-build speech_synthesis \
    <rmir_filename>:<key> \
    <tacotron_nemo_filename>:<key> \
    <waveglow_riva_filename>:<key> \
    --arpabet_file=cmudict-0.7b-nv0.01 \
    --abbreviations_file=abbr.txt