Speech Synthesis¶
The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.
For new users, it is recommended to start with the FastPitch + HiFi-GAN models.
Model Architectures - Mel Spectrogram Generators¶
FastPitch: A non-autoregressive transformer-based spectrogram generator that predicts duration and pitch from the FastPitch: Parallel Text-to-speech with Pitch Prediction paper. FastPitch is the recommended fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference and generates speech that can be further controlled with predicted contours. FastPitch can therefore change the perceived emotional state of the speaker or put emphasis on certain lexical units.
Tacotron 2: A modified Tacotron 2 model for mel-generation from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper. Tacotron 2 is a sequence-to-sequence model that generates mel-spectrograms from text and was originally designed to be used either with a mel-spectrogram inversion algorithm such as the Griffin-Limalgorithm or a neural decoder such as WaveNet.
Model Architectures - Vocoders¶
HiFi-GAN: A GAN-based vocoder from the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper. HiFi-GAN is the recommended model archiecture that achieves both efficient and high-fidelity speech synthesis.
WaveGlow: A flow-based vocoder from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper. Riva uses WaveGlow as the neural vocoder, which is responsible for converting frame-level acoustic features into a waveform at audio rates. Unlike other neural vocoders, WaveGlow is not auto-regressive, which makes it more performant when running on GPUs.
Services¶
Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Model Deployment¶
Like all Riva models, Riva TTS requires the following steps:
The following sections describe some examples for specific steps as outlined above.
Creating Riva files¶
Riva files can be created from .nemo
or .tao
files. The following is an example of how a
HiFi-GAN model can be converted to a .riva
file from a .nemo
file. First, download the
.nemo
file from NGC onto the host system. Run the NeMo container and share the .nemo
file
with the container including the -v
option.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip
unzip tts_hifigan_1.0.0rc1.zip
docker run --gpus all -it --rm \
-v $(pwd):/NeMo \
--shm-size=8g \
-p 8888:8888 \
-p 6006:6006 \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device=/dev/snd \
nvcr.io/nvidia/nemo:1.4.0
After the container has launched, run:
pip3 install nvidia-pyindex
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/riva/riva_quickstart/versions/1.7.0-beta/files/riva_api-1.7.0b0-py3-none-any.whl -O riva_api-1.7.0b0-py3-none-any.whl
pip3 install nemo2riva-1.7.0_beta-py3-none-any.whl
nemo2riva --out /NeMo/hifigan.riva /NeMo/tts_hifigan.nemo
You can repeat this process for each .nemo
model to generate .riva
files. It is suggested that
you do so for FastPitch before continuing to the next step. Be sure that you are getting the latest
tts_hifigan.nemo
checkpoint, latest nvcr.io/nvidia/nemo
container version, and latest
nemo2riva-{version}_beta-py3-none-any.whl
version when doing the above step:
Note
Tacotron2 is kept as a .nemo file and is not supported with the nemo2riva
tool.
Note
WaveGlow NeMo built with newer NeMo versions do not work with the nemo2riva
tool nor the riva-build
tool.
Refer to the Riva 1.10.0 Known Issues section in the Release Notes.
Pipeline Configuration¶
FastPitch and HiFi-GAN¶
Deploy a FastPitch and HiFi-GAN TTS pipeline as follows from within the ServiceMaker container:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<fastpitch_riva_filename>:<encryption_key> \
/servicemaker-dev/<hifigan_riva_filename>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file>
where:
<rmir_filename>
is the Rivarmir
file that is generated<encryption_key>
is the encryption key used during the export of the.riva
filepipeline_name
is an optional user-defined name for the components in the model repository<fastpitch_riva_filename>
is the name of theriva
file for FastPitch<hifigan_riva_filename>
is the name of theriva
file for HiFi-GAN<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
Upon successful completion of this command, a file named <rmir_filename>
is created in the
/servicemaker-dev/
folder. If your .riva
archives are encrypted, you need to include
:<encryption_key>
at the end of the RMIR filename and riva
filename, otherwise this is
unnecessary.
Tacotron2 and Waveglow¶
In the simplest use case, you can deploy a Tacotron2 or WaveGlow TTS model as follows:
riva-build speech_synthesis \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<tacotron_nemo_filename> \
/servicemaker-dev/<waveglow_riva_filename>:<encryption_key> \
--voice_name=<pipeline_name> \
--abbreviations_file=/servicemaker-dev/<abbr_file> \
--arpabet_file=/servicemaker-dev/<dictionary_file>
where:
<rmir_filename>
is the Rivarmir
file that is generated<encryption_key>
is the encryption key used during the export of the.riva
filepipeline_name
is an optional user-defined name for the components in the model repository<tacotron_nemo_filename>
is the name of thenemo
checkpoint file for Tacotron 2<waveglow_riva_filename>
is the name of theriva
file for the universal WaveGlow model<abbr_file>
is the name of the file containing abbreviations and their corresponding expansions<dictionary_file>
is the name of the file containing the pronunciation dictionary mapping from words to their phonetic representation in ARPABET.
Upon successful completion of this command, a file named <rmir_filename>
is created in the
/servicemaker-dev/
folder. If your .riva
archives are encrypted, you need to include
:<encryption_key>
at the end of the RMIR filename and riva
filename, otherwise this is
unnecessary.
Riva-build Optional Parameters¶
For details about the parameters passed to riva-build
to customize the TTS pipeline, issue:
riva-build speech_synthesis -h
The following list includes descriptions for all optional parameters currently recognized by riva-build
:
usage: riva-build speech_synthesis [-h] [-f] [--language_code LANGUAGE_CODE]
[--max_batch_size MAX_BATCH_SIZE]
[--voice_name VOICE_NAME]
[--num_speakers NUM_SPEAKERS]
[--subvoices SUBVOICES]
[--sample_rate SAMPLE_RATE]
[--chunk_length CHUNK_LENGTH]
[--overlap_length OVERLAP_LENGTH]
[--num_mels NUM_MELS]
[--num_samples_per_frame NUM_SAMPLES_PER_FRAME]
[--abbreviations_file ABBREVIATIONS_FILE]
[--has_mapping_file HAS_MAPPING_FILE]
[--arpabet_file ARPABET_FILE]
[--denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE]
[--denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE]
[--denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE]
[--denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE]
[--denoiser.batching_type DENOISER.BATCHING_TYPE]
[--denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING]
[--denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT]
[--denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS]
[--denoiser.fade_length DENOISER.FADE_LENGTH]
[--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE]
[--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE]
[--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE]
[--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE]
[--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE]
[--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING]
[--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT]
[--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
[--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH]
[--preprocessor.abbreviations_path PREPROCESSOR.ABBREVIATIONS_PATH]
[--preprocessor.dictionary_path PREPROCESSOR.DICTIONARY_PATH]
[--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS]
[--preprocessor.language PREPROCESSOR.LANGUAGE]
[--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH]
[--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH]
[--preprocessor.mapping PREPROCESSOR.MAPPING]
[--preprocessor.tolower PREPROCESSOR.TOLOWER]
[--preprocessor.generate_pron_chars PREPROCESSOR.GENERATE_PRON_CHARS]
[--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE]
[--encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--encoder.max_batch_size ENCODER.MAX_BATCH_SIZE]
[--encoder.min_batch_size ENCODER.MIN_BATCH_SIZE]
[--encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE]
[--encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE]
[--encoder.batching_type ENCODER.BATCHING_TYPE]
[--encoder.preserve_ordering ENCODER.PRESERVE_ORDERING]
[--encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT]
[--encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE]
[--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE]
[--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE]
[--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE]
[--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE]
[--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE]
[--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING]
[--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT]
[--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
[--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE]
[--decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--decoder.max_batch_size DECODER.MAX_BATCH_SIZE]
[--decoder.min_batch_size DECODER.MIN_BATCH_SIZE]
[--decoder.opt_batch_size DECODER.OPT_BATCH_SIZE]
[--decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE]
[--decoder.batching_type DECODER.BATCHING_TYPE]
[--decoder.preserve_ordering DECODER.PRESERVE_ORDERING]
[--decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT]
[--decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE]
[--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE]
[--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE]
[--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE]
[--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE]
[--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING]
[--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT]
[--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS]
[--waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE]
[--waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE]
[--waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE]
[--waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE]
[--waveglow.batching_type WAVEGLOW.BATCHING_TYPE]
[--waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING]
[--waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT]
[--waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS]
[--waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE]
[--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE]
[--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE]
[--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE]
[--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE]
[--hifigan.batching_type HIFIGAN.BATCHING_TYPE]
[--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING]
[--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT]
[--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS]
[--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE]
output_path source_path [source_path ...]
Generate a Riva Model from a speech_synthesis model trained with NVIDIA NeMo.
positional arguments:
output_path Location to write compiled Riva pipeline
source_path Source file(s)
optional arguments:
-h, --help show this help message and exit
-f, --force Overwrite existing artifacts if they exist
--language_code LANGUAGE_CODE
Language of the model
--max_batch_size MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--voice_name VOICE_NAME
Set the voice name for speech synthesis
--num_speakers NUM_SPEAKERS
Number of unqiue speakers.
--subvoices SUBVOICES
Comma-seprated list of subvoices (no whitespace).
--sample_rate SAMPLE_RATE
Sample rate of the output signal
--chunk_length CHUNK_LENGTH
Chunk length in mel frames to synthesize at one time
--overlap_length OVERLAP_LENGTH
Chunk length in mel frames to overlap neighboring
chunks
--num_mels NUM_MELS number of mels
--num_samples_per_frame NUM_SAMPLES_PER_FRAME
number of samples per frame
--abbreviations_file ABBREVIATIONS_FILE
Path to file with list of abbreviations and
corresponding expansions
--has_mapping_file HAS_MAPPING_FILE
--arpabet_file ARPABET_FILE
Path to pronunciation dictionary
denoiser:
--denoiser.max_sequence_idle_microseconds DENOISER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--denoiser.max_batch_size DENOISER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--denoiser.min_batch_size DENOISER.MIN_BATCH_SIZE
--denoiser.opt_batch_size DENOISER.OPT_BATCH_SIZE
--denoiser.preferred_batch_size DENOISER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--denoiser.batching_type DENOISER.BATCHING_TYPE
--denoiser.preserve_ordering DENOISER.PRESERVE_ORDERING
Preserve ordering
--denoiser.instance_group_count DENOISER.INSTANCE_GROUP_COUNT
How many instances in a group
--denoiser.max_queue_delay_microseconds DENOISER.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--denoiser.fade_length DENOISER.FADE_LENGTH
fade length
preprocessor:
--preprocessor.max_sequence_idle_microseconds PREPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--preprocessor.max_batch_size PREPROCESSOR.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--preprocessor.min_batch_size PREPROCESSOR.MIN_BATCH_SIZE
--preprocessor.opt_batch_size PREPROCESSOR.OPT_BATCH_SIZE
--preprocessor.preferred_batch_size PREPROCESSOR.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--preprocessor.batching_type PREPROCESSOR.BATCHING_TYPE
--preprocessor.preserve_ordering PREPROCESSOR.PRESERVE_ORDERING
Preserve ordering
--preprocessor.instance_group_count PREPROCESSOR.INSTANCE_GROUP_COUNT
How many instances in a group
--preprocessor.max_queue_delay_microseconds PREPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--preprocessor.mapping_path PREPROCESSOR.MAPPING_PATH
--preprocessor.abbreviations_path PREPROCESSOR.ABBREVIATIONS_PATH
--preprocessor.dictionary_path PREPROCESSOR.DICTIONARY_PATH
--preprocessor.g2p_ignore_ambiguous PREPROCESSOR.G2P_IGNORE_AMBIGUOUS
--preprocessor.language PREPROCESSOR.LANGUAGE
--preprocessor.max_sequence_length PREPROCESSOR.MAX_SEQUENCE_LENGTH
maximum length of every emitted sequence
--preprocessor.max_input_length PREPROCESSOR.MAX_INPUT_LENGTH
maximum length of input string
--preprocessor.mapping PREPROCESSOR.MAPPING
--preprocessor.tolower PREPROCESSOR.TOLOWER
--preprocessor.generate_pron_chars PREPROCESSOR.GENERATE_PRON_CHARS
--preprocessor.pad_with_space PREPROCESSOR.PAD_WITH_SPACE
encoder:
--encoder.max_sequence_idle_microseconds ENCODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--encoder.max_batch_size ENCODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--encoder.min_batch_size ENCODER.MIN_BATCH_SIZE
--encoder.opt_batch_size ENCODER.OPT_BATCH_SIZE
--encoder.preferred_batch_size ENCODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--encoder.batching_type ENCODER.BATCHING_TYPE
--encoder.preserve_ordering ENCODER.PRESERVE_ORDERING
Preserve ordering
--encoder.instance_group_count ENCODER.INSTANCE_GROUP_COUNT
How many instances in a group
--encoder.max_queue_delay_microseconds ENCODER.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--encoder.trt_max_workspace_size ENCODER.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
encoderFastPitch:
--encoderFastPitch.max_sequence_idle_microseconds ENCODERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--encoderFastPitch.max_batch_size ENCODERFASTPITCH.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--encoderFastPitch.min_batch_size ENCODERFASTPITCH.MIN_BATCH_SIZE
--encoderFastPitch.opt_batch_size ENCODERFASTPITCH.OPT_BATCH_SIZE
--encoderFastPitch.preferred_batch_size ENCODERFASTPITCH.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--encoderFastPitch.batching_type ENCODERFASTPITCH.BATCHING_TYPE
--encoderFastPitch.preserve_ordering ENCODERFASTPITCH.PRESERVE_ORDERING
Preserve ordering
--encoderFastPitch.instance_group_count ENCODERFASTPITCH.INSTANCE_GROUP_COUNT
How many instances in a group
--encoderFastPitch.max_queue_delay_microseconds ENCODERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--encoderFastPitch.trt_max_workspace_size ENCODERFASTPITCH.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
decoder:
--decoder.max_sequence_idle_microseconds DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--decoder.max_batch_size DECODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--decoder.min_batch_size DECODER.MIN_BATCH_SIZE
--decoder.opt_batch_size DECODER.OPT_BATCH_SIZE
--decoder.preferred_batch_size DECODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--decoder.batching_type DECODER.BATCHING_TYPE
--decoder.preserve_ordering DECODER.PRESERVE_ORDERING
Preserve ordering
--decoder.instance_group_count DECODER.INSTANCE_GROUP_COUNT
How many instances in a group
--decoder.max_queue_delay_microseconds DECODER.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
chunkerFastPitch:
--chunkerFastPitch.max_sequence_idle_microseconds CHUNKERFASTPITCH.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--chunkerFastPitch.max_batch_size CHUNKERFASTPITCH.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--chunkerFastPitch.min_batch_size CHUNKERFASTPITCH.MIN_BATCH_SIZE
--chunkerFastPitch.opt_batch_size CHUNKERFASTPITCH.OPT_BATCH_SIZE
--chunkerFastPitch.preferred_batch_size CHUNKERFASTPITCH.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--chunkerFastPitch.batching_type CHUNKERFASTPITCH.BATCHING_TYPE
--chunkerFastPitch.preserve_ordering CHUNKERFASTPITCH.PRESERVE_ORDERING
Preserve ordering
--chunkerFastPitch.instance_group_count CHUNKERFASTPITCH.INSTANCE_GROUP_COUNT
How many instances in a group
--chunkerFastPitch.max_queue_delay_microseconds CHUNKERFASTPITCH.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
waveglow:
--waveglow.max_sequence_idle_microseconds WAVEGLOW.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--waveglow.max_batch_size WAVEGLOW.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--waveglow.min_batch_size WAVEGLOW.MIN_BATCH_SIZE
--waveglow.opt_batch_size WAVEGLOW.OPT_BATCH_SIZE
--waveglow.preferred_batch_size WAVEGLOW.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--waveglow.batching_type WAVEGLOW.BATCHING_TYPE
--waveglow.preserve_ordering WAVEGLOW.PRESERVE_ORDERING
Preserve ordering
--waveglow.instance_group_count WAVEGLOW.INSTANCE_GROUP_COUNT
How many instances in a group
--waveglow.max_queue_delay_microseconds WAVEGLOW.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--waveglow.trt_max_workspace_size WAVEGLOW.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
hifigan:
--hifigan.max_sequence_idle_microseconds HIFIGAN.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--hifigan.max_batch_size HIFIGAN.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--hifigan.min_batch_size HIFIGAN.MIN_BATCH_SIZE
--hifigan.opt_batch_size HIFIGAN.OPT_BATCH_SIZE
--hifigan.preferred_batch_size HIFIGAN.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--hifigan.batching_type HIFIGAN.BATCHING_TYPE
--hifigan.preserve_ordering HIFIGAN.PRESERVE_ORDERING
Preserve ordering
--hifigan.instance_group_count HIFIGAN.INSTANCE_GROUP_COUNT
How many instances in a group
--hifigan.max_queue_delay_microseconds HIFIGAN.MAX_QUEUE_DELAY_MICROSECONDS
max queue delta in microseconds
--hifigan.trt_max_workspace_size HIFIGAN.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in Mb) to use for model export
to TensorRT
Speech Synthesis Markup Language (SSML)¶
Riva 1.8.0 adds preliminary support for SSML. Only the FastPitch model is supported at this time.
There are no plans to add this functionality to Tacotron2. The FastPitch model must be exported
using NeMo 1.5.1 and the nemo2riva 1.8.0 tool. All SSML inputs must be a valid XML document and
use the <speak>
root tag. All non-valid XML and all valid XML with a different root tag are
treated as raw input text. Riva currently supports the following in a limited capacity:
prosody
tag
pitch
attribute
rate
attribute
Pitch Attribute¶
Riva supports an additive relative change to the pitch. The pitch attribute has a range
of [-3, 3]. Values outside this range result in an error being logged, and no audio returned.
Note that this value returns a pitch shift of the attribute value multiplied with the
speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained
checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of
1.25
results in a change of 1.25*52.185=~65.23Hz
pitch shift up.
Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low
, low
, medium
,
high
, x-high
, and default
are supported.
The pitch attribute is expressed in the following formats:
pitch="1"
pitch="+1.8"
pitch="-0.65"
pitch="high"
pitch="default"
Rate Attribute¶
Riva supports a %
relative change to the rate. The rate attribute has a range of
[25%, 250%]. Values outside this range result in an error being logged and no audio
returned. It also supports the prosody tags as per the SSML specs. Prosody tags x-low
, low
, medium
,
high
, x-high
, and default
are supported.
The rate attribute is expressed in the following formats:
rate="35%"
rate="+200%"
rate="low"
rate="default"
Warning
The pitch attribute currently does not support Hz
, st
, and %
changes. Support is planned for
a future Riva release.
For SSML examples with sample audio, refer to the Riva_speech_API_demo notebook section.
Pretrained Models¶
Task |
Architecture |
Language |
Dataset |
Link |
---|---|---|---|---|
Mel Spectrogram Generation |
FastPitch |
English |
LJSpeech |
|
Mel Spectrogram Generation |
Tacotron2 |
English |
LJSpeech |
|
Vocoder |
HiFi-GAN |
English |
LJSpeech |
|
Vocoder |
Waveglow |
English |
LJSpeech |
Pretrained Quickstart Pipelines¶
Pipeline |
|
---|---|
FastPitch + HiFiGAN |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<fastpitch_riva_filename>:<key> \
<hifigan_riva_filename>:<key> \
--arpabet_file=cmudict-0.7b-nv0.01 \
--abbreviations_file=abbr.txt
|
Tacotron2 + WaveGlow |
riva-build speech_synthesis \
<rmir_filename>:<key> \
<tacotron_nemo_filename>:<key> \
<waveglow_riva_filename>:<key> \
--arpabet_file=cmudict-0.7b-nv0.01 \
--abbreviations_file=abbr.txt
|