Release Notes#

Important

The Riva SDK release only supports embedded (L4T) platforms. For x86 data center deployments, refer to Riva ASR NIM, Riva TTS NIM, and Riva NMT NIM documentation.

Important

If you are upgrading from a previous Riva version, refer to the Upgrading section.

All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Riva Release 2.26.0#

Key Features and Enhancements#

Added support for the Nemotron-ASR English (en-US) streaming ASR model on Jetson Thor, with the Sortformer streaming speaker diarizer bundled into the same artifact.
Added Neural Machine Translation (NMT) support on Jetson platforms using the Megatron 1B Any-to-Any model.
Added WebSocket realtime API (default port 9000) and HTTP REST API (default port 9001) for ASR and TTS on Jetson platforms. The gRPC API continues to be served on port 50051.
Updated the Magpie-TTS multilingual model to add support for Hindi and Japanese and improve quality and stability on Jetson Thor.

Model Updates#

Added the Nemotron-ASR English (en-US) cache-aware Parakeet-RNNT 0.6B streaming model on Jetson Thor. The Sortformer diarizer is bundled as part of the same Tegra artifact and does not require a separate accessory download.
Added the NMT Megatron 1B Any-to-Any model on Jetson Thor for multilingual translation.
Updated the Magpie-TTS multilingual model with support for new languages, Hindi and Japanese.
Refer to the Support Matrix for more information.

Fixed Issues#

Fixed unsupported HTTP API requests on Jetson Thor.
Fixed WebSocket client requests outside the container.

Known Issues#

Long input text provided to Magpie TTS might produce truncated output. To mitigate this, break the input text into smaller sentences, ensuring each normalized sentence is 400 characters or fewer, with sentences separated by periods.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. To receive an Opus stream, use the Python clients.
Riva punctuation models add a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the neural-based voice activity detector in Riva ASR has a non-negligible impact on latency and throughput. For Silero VAD, a degradation in those metrics of the order of 10% to 15% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after executing riva_start.sh has slightly longer latency. Subsequent calls exhibit normal latency.
The Riva build does not support providing a 1-gram language model in .arpa format because of a limitation in the KenLM utility for building language model binaries.
Clients should not send empty strings to the Riva Translation API, as they may be mistranslated into short sentences.
Riva ASR client supports only 16 kHz one-channel format when using FLAC encoding.

Riva Release 2.24.0#

Key Features and Enhancements#

Riva now uses Triton 2.60.0 and TensorRT 10.13 with support for Blackwell GPU architecture.
Added support for Silero VAD and Sortformer diarizer in streaming mode with parakeet-1.1b ASR model for Jetson platforms.
Added support for Magpie-TTS model to Jetson platforms, capable of multilingual speech synthesis.
Added support for parakeet-rnnt-1.1b Unified Multilingual Code-Switch ASR model to Jetson platforms.
Added FP8 quantization for parakeet-CTC English ASR model.
Added an option to apply ITN and PnC on partial transcripts for Parakeet models.

Model Updates#

Added Magpie-TTS for Jetson platforms, a multilingual TTS model supporting English, Spanish, French, Italian, German, Vietnamese, and Mandarin, with high-quality Male and Female emotion subvoices.
Updated the parakeet-ctc-1.1b English ASR model with improved accuracy and channel robustness.
Refer to the Support Matrix for more information.

Fixed Issues#

Fixed memory leak in ASR decoder when checking max decoder pending tasks.
Fixed slow memory leak in flashlight lexicon decoder of ASR models.

Breaking Changes#

Deprecated x86 data center deployments. For x86 deployments, refer to Riva ASR NIM, Riva TTS NIM, and Riva NMT NIM documentation.
Deprecated the NVIDIA Jetson Orin platform. Supported Jetson platform is now NVIDIA Jetson Thor.
Deprecated support for NVIDIA Fleet Command.

Known Issues#

NMT functionality is not supported on embedded platforms in this release.
Long input text provided to Magpie TTS might produce truncated output. To mitigate this, break the input text into smaller sentences, ensuring each normalized sentence is 400 characters or fewer, with sentences separated by periods.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. To receive an Opus stream, use the Python clients.
Riva punctuation models add a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the neural-based voice activity detector in Riva ASR has a non-negligible impact on latency and throughput. For Silero VAD, a degradation in those metrics of the order of 10%-15% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after executing riva_start.sh results in slightly longer latency. Subsequent calls will exhibit normal latency.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility for building language model binaries.
Clients should not send empty strings to the Riva Translation API, as they may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.19.0#

Key Features and Enhancements#

Riva now uses Triton 2.54.0 and TensorRT 10.8 with support for Blackwell GPU architecture.
Added support for speaker diarization in streaming mode using the sortformer model, together with ASR inference.
Added support for multilingual speech synthesis in the TTS pipeline, using the Magpie-TTS model architecture.
Improved ASR end-of-utterance detection by adding a provision to use Voice Activity Detector (VAD) model for endpointing.
Updated NMT pipeline with TensorRT inference on Ampere and future GPU architectures to improve translation throughput.
Added support for using remotely deployed ASR or TTS servers with NMT speech-to-text (S2T) or speech-to-speech (S2S) deployment.
Simplified the quickstart configuration script to easily configure models and languages for deployment.

Model Updates#

Added Magpie-TTS, a multilingual TTS model supporting English, Spanish, and French, with high-quality Male and Female emotion subvoices.
Added the following ASR models: parakeet-rnnt-1.1b multilingual, parakeet-ctc-1.1b English (channel-robust), and parakeet-ctc-0.6b Mandarin (with Mandarin-English code switch capability).
Added Whisper-large-turbo (v3) multilingual and Kotoba-Whisper (v2.2) Japanese ASR models.
Refer to the Support Matrix for more information.

Fixed Issues#

Fixed an issue that caused a segmentation fault during offline ASR inference when invalid input parameters were provided.
Fixed an issue that could cause a segmentation fault during streaming ASR inference if no data was sent by clients after the request was configured.
Fixed an issue in TTS inference that caused audio clicks in synthesized speech when the requested sample rate differed from the model’s actual sample rate.
Fixed an issue in the TTS pipeline that caused a failure when a long query containing multiple sentences was provided.
Fixed an issue where duplicate language codes appeared in the any-to-any translation model configuration requested by clients.
Added a configuration parameter that enables mutual authentication via TLS between the Riva server and clients.

Breaking Changes#

Deprecated the NVIDIA Volta V100 and prior GPU architectures.

Known Issues#

With the Magpie TTS model enabled, the Riva server can have a high load time of around 3 minutes upon executing riva_start.sh.
The Magpie TTS model has a high latency when running multiple concurrent requests.
Long input text provided to Magpie TTS might produce truncated output. To mitigate this, break the input text into smaller sentences, ensuring each normalized sentence is 400 characters or fewer, with sentences separated by periods.
Streaming speaker diarization with ASR is currently a beta release and supports up to a maximum of 8 concurrent requests.
The Riva server on Jetson ARM platforms does not support streaming speaker diarization, Silero VAD, or Magpie TTS models.
Mandarin TTS using Fastpitch-HifiGAN TTS model is not supported in this release.
Canary ASR models use PyTorch for inference and offer a low throughput compared to other similar models.
Canary, Whisper, and Distil-Whisper ASR models do not support word boosting or the .ogg and .opus encoding formats for audio input.
The parakeet-ctc-1.1b-unified-ml-cs universal, concat, and EMEA ASR models are of beta quality and do not support inverse text normalization (ITN).
The parakeet-rnnt-1.1b multilingual ASR model does not support word boosting or inverse text normalization (ITN).
The parakeet-ctc-0.6b Mandarin ASR model has high memory requirements with language model (LM). It is recommended to use this model in greedy mode.
The parakeet-ctc-0.6b-unified English ASR model can show negative confidence scores and partial ITN functionality. Additionally, punctuation symbols may sometimes return in a separate final transcript.
Zero-shot TTS model may have a high load time of around 3 minutes upon executing riva_start.sh.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and are recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
RADTTS++ and zero-shot TTS models are of beta quality and do not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
German Conformer unified ASR model may have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model produces transcripts with character timestamps only, not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model produces transcripts with punctuation only for the Japanese text.
Arabic ITN currently does not denormalize time, date, currency, or decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. To receive an Opus stream, use the Python clients.
Running nemo2riva with a FastPitch model and ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), so the accuracy for the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency compared to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client container.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
The Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models add a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the neural-based voice activity detector in Riva ASR has a non-negligible impact on latency and throughput. For Silero VAD, a degradation in those metrics of the order of 10%-15% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after executing riva_start.sh results in slightly longer latency. Subsequent calls will exhibit normal latency.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. For example, words containing “oe”, “ae”, or “ell”.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility for building language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to the Riva Translation API, as they may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.18.0#

Key Features and Enhancements#

Riva now uses Triton 2.50.0 and TensorRT 10.4.
Moved to a unified container for Riva, supporting both server and service maker functionalities. A separate service maker image is now removed.
Added support for using Silero voice activity detector model along with ASR.
Added support for Canary ASR model architecture in offline mode, along with AST support.
Added support for T5 TTS model architecture which offers an improved voice quality.
Extended Riva server interface to support HTTP/REST APIs for offline ASR models.
Added support in NMT pipeline for inferencing models trained with BCP-47 language codes.
Added a tutorial that demonstrates the finetuning of parakeet architecture based ASR models.
Added support for clients to provide a custom translation dictionary for NMT text translation pipeline.

Model Updates#

Added two multilingual unified ASR models with parakeet-1.1b CTC architecture and EMEA multilingual ASR model with language model (LM) support.
Added the following ASR models: Canary-1b, Canary-0.6b-turbo, and Spanish-English parakeet-0.6b-unified.
Added T5 TTS model with high quality male and female emotion subvoices.
Updated the English-GB conformer ASR model with an improved language model (LM).
Improved the Mandarin text normalization (TN) model with additional postprocessing.
Updated the German TTS model with improved G2P pronunciation dictionary.
Added a Spanish-English inverse text normalization (ITN) model.
Updated NMT Megatron any-to-any model to support four additional languages.

Fixed Issues#

Fixed a gradual memory leak on repeated inferences with offline ASR models, particularly visible with longer audio inputs.
Fixed an issue that caused max_batch_size set from riva-build command not to be propagated through TTS pipeline.
Fixed an issue that could cause intermittent crash in some environments when running the punctuation pipeline.
Fixed the reporting of supported language codes by the Whisper ASR model, allowing clients to specify it as input while performing inference.
Fixed an issue that erroneously required Python TTS clients to set custom_dictionary optional argument during inference.

Known Issues#

Canary ASR and T5 TTS models use PyTorch for inference and offer a low throughput compared to other similar models.
Canary, Whisper and Distil-Whisper ASR models do not support word boosting and .ogg, .opus encoding formats for audio input.
The T5 TTS model is currently limited to support a max_batch_size of 1.
The parakeet-ctc-1.1b-unified-ml-cs universal, concat and EMEA ASR models are of beta quality and do not support inverse text normalization (ITN).
The parakeet-ctc-0.6b-unified English ASR model can show negative confidence scores and partial ITN functionality. Also, sometimes the punctuation symbols might get returned in a separate final transcript.
The T5 TTS and zero shot TTS model can have a high load time of around 3 minutes upon executing riva_start.sh.
The Neural G2P model packaged with TTS models does not support full context of the sentence and is only invoked at word level for out-of-vocabulary words.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
The T5 TTS, RADTTS++ and zero shot TTS models are of beta quality and do not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency. Subsequent calls will exhibit normal latency.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.17.0#

Key Features and Enhancements#

Riva now uses Triton 2.47.0 and TensorRT 10.1.
Added support for Parakeet-RNNT, Whisper, and Distil-Whisper ASR model architectures in offline mode.
Added support for AST (Automatic Speech Translation) using offline Whisper ASR models.
Added support for applying inverse text normalization (ITN) to intermediate transcripts for two-pass end-of-utterance detection.
Added support for configuring the maximum number of speakers for offline speaker diarization with each gRPC request from clients.
Updated TTS pipeline to add support for multiple sentences in SSML input and removed the 400-character limit for sentence length.
Added a parameter in TTS that allows users to provide a custom G2P dictionary to override the default pronunciation of specific words.
Added “Do not Translate” support in NMT using <dnt> and </dnt> tags to exclude specified parts of the input text from translation.
Added support for applying pitch, rate, and volume attributes to S2S output, and “Do not Translate” capability in S2S and S2T pipelines.

Model Updates#

Added ASR parakeet-1.1b RNNT English, whisper-large (v3), and distil-whisper-large (v3) models.
Updated the ASR Japanese conformer model to support transcription for Japanese and English through code-switching.
Updated TTS English-US models to prevent mispronunciations when inputs contain website names, special characters, or abbreviations.
Updated TTS Spanish-US model to improve the G2P dictionary with better pronunciations.

Fixed Issues#

Fixed an issue in gRPC server that could cause failures when using multiple Triton servers for inference.
Fixed a memory leak in ASR that occurred when handling multiple inference requests.
Fixed a segmentation fault in ASR that happened during offline speaker diarization for some long duration audio inputs.
Fixed an issue in TTS where the /l/ phoneme was not correctly processed when provided in SSML tags.
Fixed multiple issues in NMT that affected the translation of digits or numbers in the input text.

Known Issues#

Whisper and Distil-Whisper ASR models do not support word boosting and .ogg, .opus encoding formats for audio input.
The parakeet-ctc-1.1b-unified-ml-cs EMEA ASR model is of beta quality and does not support inverse text normalization (ITN).
The parakeet-ctc-0.6b-unified English ASR model can show negative confidence scores and partial ITN functionality. Also, sometimes the punctuation symbols might get returned in a separate final transcript.
The zero shot TTS model can have a high load time of around 3 minutes upon executing riva_start.sh. In rare cases, the first inference call to this model can throw a timeout error, but subsequent inference calls will proceed normally.
The Neural G2P model packaged with zero shot TTS model does not support full context of the sentence and is only invoked at word level for out-of-vocabulary words.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
The RADTTS++ and zero shot TTS models are of beta quality and do not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency. Subsequent calls will exhibit normal latency.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.16.0#

Key Features and Enhancements#

Added support for two-pass end-of-utterance detection in the ASR pipeline to improve latency for receiving the final ASR result.
Added the capability to configure endpointing parameters in the ASR pipeline with each gRPC request from clients.
Added an option in the TTS Python client to list the available voice names with the current model deployed on the server.

Model Updates#

Added ASR parakeet-ctc-0.6b-unified English flashlight model with unified language model (LM) support.
Updated zero shot TTS model to enhance the quality of synthesized speech and improve the neural G2P model.
Updated Spanish-US TTS model to add two new emotions: Male-Fearful and Male-Sad.

Fixed Issues#

Fixed inference timeout errors seen on some GPUs with TTS fastpitch models used for synthesizing speech.
Fixed an issue in ASR that could cause segmentation faults when word boosting or speech hints are enabled in client requests.
Fixed a memory leak issue in ASR that caused a continuous increase in host memory usage over time during an active gRPC session.
Fixed translation backend issues to allow concurrent execution of batched input requests.
Fixed inference failure for zero shot TTS on the Jetson Orin platform when using prebuilt voices to synthesize speech.

Known Issues#

The parakeet-ctc-1.1b-unified-ml-cs EMEA ASR model is of beta quality and does not support inverse text normalization (ITN).
The parakeet-ctc-0.6b-unified English ASR model can show negative confidence scores and partial ITN functionality. Also, sometimes the punctuation symbols might get returned in a separate final transcript.
The zero shot TTS model can have a high load time of around 3 minutes upon executing riva_start.sh. In rare cases, the first inference call to this model can throw a timeout error, but subsequent inference calls will proceed normally.
The Neural G2P model packaged with zero shot TTS model does not support full context of the sentence and is only invoked at word level for out-of-vocabulary words.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
The RADTTS++ and zero shot TTS models are of beta quality and do not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long input text provided to TTS client might fail with a failed during inference error. By default, the maximum allowed length for each sentence (separated by period) within the input text is limited to 400 characters.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency. Subsequent calls will exhibit normal latency.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.15.1#

Fixed Issues#

Fixed host path usage in helm charts due to security context being applied.
Fixed offline speaker diarization on the Jetson Orin platform.
Fixed an issue that could cause breaks in audio synthesized using RADTTS++ (beta) emotion mixing model.
Updated nemo2riva to use the same onnx versions as supported in Riva server to fix deployment issues.

Model Updates#

Added ASR parakeet-ctc-0.6b-unified English, parakeet-ctc-1.1b English and parakeet-ctc-1.1b-unified-ml-cs EMEA models.
Added NMT megatron 1B any to any translation model.
Updated pronunciation dictionary and normalizer for english TTS models.

Known Issues#

The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error on some GPUs. Subsequent calls will exhibit normal latency.
The parakeet-ctc-1.1b-unified-ml-cs EMEA ASR model is of beta quality and does not support inverse text normalization (ITN).
The parakeet-ctc-0.6b-unified English ASR model can show negative confidence scores and partial ITN functionality. Also, sometimes the punctuation symbols might get returned in a separate final transcript.
The Neural G2P model packaged with zero shot TTS model does not support full context of the sentence and is only invoked at word level for out-of-vocabulary words.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
The RADTTS++ model is a beta model for mixing emotions and does not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long input text provided to TTS client might fail with a failed during inference error. By default, the maximum allowed length for each sentence (separated by period) within the input text is limited to 400 characters.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.15.0#

Note

Users upgrading to 2.15.0 from previous versions must run riva_clean.sh followed by riva_init.sh using the Quick Start scripts. If you are using a .riva file (either prebuilt or custom), you must rerun riva-build for the existing models using latest model versions present on NGC.

Key Features and Enhancements#

Riva now uses Triton 2.40.0 and TensorRT 8.6 with CUDA 12 support.
Updated ASR pipeline to use Triton BLS backend architecture, that is able to deliver ~40% better latency and throughput.
Added support for Zero Shot speech synthesis using audio prompt in TTS pipeline, along with Neural G2P inference for out-of-vocabulary words. The Zero Shot Riva TTS model is currently under limited early access.
Added provision to keep a running transcript buffer to improve punctuation accuracy by ~10% during streaming ASR.
Updated Helm chart to run Riva and Triton servers in separate pods, allowing scaling and deployment across multiple GPUs.
Added tutorials for NMT synthetic data generation and fine-tuning multilingual NMT models with Nvidia NeMo.

Model Updates#

Added a new ASR model architecture (Parakeet) and included parakeet-ctc-riva-0-6b-en-us ASR model. It brings a relative improvement of ~11% over conformer-ctc-L-en-us ASR model and ~24% over parakeet-ctc-0.6b (Nemo version) ASR model on Hugging Face.
Added Dutch nl-NL and nl-BE (Beta) Conformer ASR, BERT-base punctuation and Inverse text normalization (ITN) models.
Updated English-US BERT punctuation model with ~7% relative accuracy improvement. Added English-US BERT-large punctuation model that delivers an additional ~1.5% relative accuracy improvement over BERT-base.
Updated Mandarin (zh-CN) Conformer ASR model to support Mandarin-English code-switch and removed Mandarin-English (zh-en-CN) Conformer multilingual code-switch ASR model.
Added a Zero Shot TTS model (beta) for speech synthesis using audio prompt, along with Neural G2P model (beta). The Zero Shot Riva TTS model is currently under limited early access.
Improved Mandarin TTS model to handle pauses better and updated Spanish-US TTS model to remove narrator speaker.
Added Megatron 1B en to any NMT model.

Fixed Issues#

Fixed truncation of translated text for NMT en to any models by supporting max_gen_delta parameter as a riva-build argument.
Fixed an issue which can cause unwanted spaces around punctuation characters in NMT translation output.
Fixed issues in NMT text translation binary client to handle num_iterations and batch_size parameters better.
Corrected NGC CLI binary used on linux ARM platforms to fix riva_init failure.
Fixed an issue that resulted in an error when setting edge values for SSML volume attribute in TTS clients.
Resolved an issue causing word boosting failure when both word boosting and speech hints are specified.
Updated dependencies in nemo2riva required for converting models from .nemo to .riva in python3.10 environment.

Breaking Changes#

Deprecated the Jetson Xavier AGX and NX platforms, as Jetson platforms are updated to use Jetpack 6.0 DP image which does not support Xavier. Supported Jetson platform now is Jetson Orin.
Deprecated Jasper, Quartznet, Citrinet ASR model architectures. Supported ASR architectures now are Conformer and Parakeet.
Deprecated all NLP models and APIs, except for punctuation and capitalization.
All Conformer ASR and BERT punctuation .riva models are published on NGC with onnx-opset=18 with an updated version name. Users must use the latest versions with Riva 2.15.0 and further, previous versions are not compatible anymore.

Known Issues#

The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error on some GPUs. Subsequent calls will exhibit normal latency.
The Neural G2P model packaged with zero shot TTS model does not support full context of the sentence and is only invoked at word level for out-of-vocabulary words.
Dutch (nl-NL and nl-BE) conformer ASR models are of beta quality and recommended to be used with ITN enabled by passing --verbatim_transcripts=false from the client. The nl-BE model is recommended to be used with Neural VAD enabled for better accuracy.
The RADTTS++ model is a beta model for mixing emotions and does not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
When generating .riva models from .nemo using nemo2riva, the nemo:23.08 image is not compatible with Riva due to updated Torch version. To avoid any Riva deployment issues, the recommendation is to continue using the last working NeMo image.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Offline speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Offline speaker diarization currently does not work on the Jetson Orin platform.
Long input text provided to TTS client might fail with a failed during inference error. By default, the maximum allowed length for each sentence (separated by period) within the input text is limited to 400 characters.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.14.0#

Key Features and Enhancements#

Added support in TTS for mixing multiple emotions through SSML input, using RADTTS++ (beta) model.

Model Updates#

Added Mandarin-English Conformer multilingual code-switch ASR model.
Added Spanish-US multi-speaker and RADTTS++ (beta) emotion mixing TTS models.
Added Megatron 1B any to en NMT model.

Fixed Issues#

Fixed empty transcripts from NMT when batch size > 8 is tried from the client.

Known Issues#

Mandarin-English Conformer multilingual code-switch ASR model does not support punctuations.
The RADTTS++ model is a beta model for mixing emotions and does not fully support all functionality like pitch, rate, volume SSML attributes, and so on.
When generating .riva models from .nemo using nemo2riva, the nemo:23.08 image is not compatible with Riva due to updated Torch version. To avoid any Riva deployment issues, the recommendation is to continue using the last working NeMo image.
Mandarin TTS output has inaccurate pronunciation for some polyphone characters. Also, the audio might sound less natural due to pauses within sentences.
German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.
When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.13.1#

For detailed Release Notes, refer to Riva Release 2.13.0.

Fixed Issues#

Fixed ASR word confidence score to have values in [0-1] range.
Fixed Helm charts to allow custom model deployment from any NGC org/team.

Model Updates#

Added Mandarin TTS model with male and female emotion subvoices.

Known Issues#

Mandarin TTS output has inaccurate pronunciation for some polyphone characters. Also, the audio might sound less natural due to pauses within sentences.

Riva Release 2.13.0#

Key Features and Enhancements#

Added support in TTS for synthesizing speech in non-English languages.
Added TTS multi-speaker adapter IPA pretrained .nemo checkpoint and a tutorial on how to finetune it for smaller datasets.
Added support for tagging gRPC request and response with unique identifier.

Model Updates#

Added German Conformer unified and updated Spanish-English Conformer multilingual code-switch ASR models.
Added Japanese-English Conformer unified multilingual code-switch and updated English ASR models.
Added Spanish and Italian TTS models with male and female voices, and German TTS model with male voice.

Fixed Issues#

Simplified translation documentation to ease deployment and used consistent naming for translation clients.
Fixed speech translation clients to support microphone input and logging of performance metrics.
Fixed an issue in ASR that caused intermittent transcript inaccuracy on multiple runs in some cases.
Corrected timestamps in ASR result for character based languages, Japanese and Mandarin.
Fixed profane words filtering in ASR result transcripts in case when greedy decoder is used.

Breaking Changes#

The denoiser arguments used in riva-build when building TTS models has been renamed to postprocessor to better reflect what occurs in that step. The postprocessor is currently used to cross-fade audio chunks and is not used for denoising.

Known Issues#

German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.
Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.
Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.
Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.
When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.12.1#

For detailed Release Notes, refer to Riva Release 2.12.0.

Fixed Issues#

Updated Helm charts to fix an issue that can cause deployment failure in some environments.

Riva Release 2.12.0#

Key Features and Enhancements#

Updated Helm charts to support model deployment on multiple NVIDIA Triton servers.
Updated Helm charts to assign models to specific GPUs when using single NVIDIA Triton server with multiple GPUs.

Model Updates#

Added Mandarin Conformer unified and Spanish-English Multilingual code-switching ASR models.
Updated Italian Conformer and Japanese Conformer unified ASR models.
Added emotion sub-voices for FastPitch and RAD-TTS models.

Fixed Issues#

S2S output in OPUS encoded format would sometimes have intermittent glitches. This issue has been fixed.
The Conformer unified ASR model always returned punctuated output irrespective of the --automatic_punctuation flag. This issue has been fixed.
S2S service is updated to return the appropriate gRPC status for different error scenarios.

Known Issues#

Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.
When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.
Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.11.0#

Key Features and Enhancements#

Added a new service called Speech-to-Speech Translation (S2S). Riva S2S translates audio between language pairs, that is, from one language to another.
Added a new service called Speech-to-Text Translation (S2T). Riva S2T transcribes audio between language pairs, that is, from one language to another.
Added two new Riva S2S and S2T APIs, StreamingTranslateSpeechToSpeech and StreamingTranslateSpeechToText.

Model Updates#

Added the Conformer unified Japanese ASR model, which is an acoustic model trained with punctuation symbols as part of its vocabulary. This helps in getting more accurate punctuations within transcriptions.

Fixed Issues#

The --phone_dictionary_file and --mapping_file arguments for riva-build of the TTS pipeline now accepts relative paths.

Breaking Changes#

Triton backend config of CTC decoder has backward incompatible changes. So the model repositories generated by earlier Riva release are not compatible. Please generate new model repository by running riva_init.sh as mentioned in Quick Start steps.

Known Issues#

S2S output in OPUS encoded format can have intermittent glitches. This issue is not observed with PCM output from S2S.
The Conformer unified ASR model always returns punctuated output irrespective of the --automatic_punctuation flag.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.10.0#

Key Features and Enhancements#

Added RadTTS support for speech synthesis. In the default configuration, use English-US-RadTTS as the voice_name to use the RadTTS model. English-US defers to the FastPitch model.
Upgraded the following software versions on embedded platforms:

Model Updates#

Added new Punctuation and Capitalization models for Japanese (jp-JP) and Russian (ru-RU) languages.
Updated Conformer L ASR models for Arabic (ar-AR), Spanish (es-US), Portuguese (pt-BR), and Mandarin (zh-CN) languages.
Added RadTTS and HiFi-GAN RadTTS TTS models with IPA alphabet for English (en-US)
Updated language model for Arabic (ar-AR)

Fixed Issues#

The pitch SSML attribute supports ‘Hz’

Known Issues#

The --phone_dictionary_file and --mapping_file arguments for riva-build of the TTS pipeline does not work for relative paths.
Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.
Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.9.0#

Key Features and Enhancements#

Riva now supports Opus encoding (in the TTS service) and decoding (in the ASR service). In ASR, you can submit .ogg and .opus audio files to transcode. In TTS, you can choose an option to receive a serialized opus-encoded stream. A deserializer for that stream is also provided. For more information, refer to sample clients.
Added a new service called Riva translation. Riva translation translates text between language pairs, that is, from one language to another.
Added two new Riva translation APIs, TranslateText and ListSupportedLanguagePairs.
Lexicon free decoding with a character based LM. See Flashlight Decoder Lexicon Free for details.

Model Updates#

Added four multilingual models and 10 bilingual models for NMT. Refer to NMT Customizing for more information.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Deprecated and Removed Features#

TAO Toolkit support for Riva is now deprecated. We recommend you use NVIDIA NeMo to fine-tune pretrained models on a custom data set.

Known Issues#

Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note
  
  Changing the default value may lead to lower performance and quality.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.
On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.
Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.
Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.8.1#

For detailed Release Notes, refer to Riva Release 2.8.0.

Fixed Issues#

SSML prosody tags with the new FastPitch IPA model now apply prosody in the correct locations.

Riva Release 2.8.0#

Important

We recommend using the Riva 2.8.1 (22.11.1) release instead of version 2.8.0.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Added a punctuation and capitalization model for the ASR EMEA Spanish (es-ES), Japanese (ja-JP), Korean(ko-KR), Brazilian Portuguese (pt-BR) and Italian (it-IT) model.
Added Conformer-L models for the ASR EMEA Spanish (es-ES), Japanese (ja-JP), Italian (it-IT) and Arabic (ar-AR) models.
Added Citrinet-1024 model for ASR EMEA Spanish (es-ES)
Updated Citrinet-1024 models for ASR Russian (ru-RU) and French (fr-FR)
Deployed model configs can be requested via a gRPC command to Riva
The speech synthesis pretrained model uses the International Phonetic Alphabet (IPA) for inference and training instead of ARPAbet. Refer to the Known Issues section regarding the SSML prosody tag.
Added support for Non-Overlapping Speaker Diarization in case of Offline Recognition. This is a alpha release of this feature, so it is not enabled by default. User needs to uncomment the rmir_diarizer_offline model in the Quick Start config.sh before running riva_init.sh to enable the feature.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

en ITN models now handle bank cards with 14 and 15 digits
The Conformer ASR model recipes have been updated with --endpointing.residue_blanks_at_start=-2 to better match NeMo WER.
The Spanish punctuation models used in the ASR model recipes now preserve accents.
The riva-build command for NLP models have been updated such that --nlp_pipeline_backend.to_lower and --nlp_pipeline_backend.tokenizer_to_lower have been removed. Use --to_lower and --tokenizer_to_lower.

Deprecated and Removed Features#

The following features have been deprecated.

Speech synthesis with ARPABET for inference and training

Known Issues#

Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.
Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency relative to other languages.
Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency relative to other languages.
Orientation of output (word timestamps) is disrupted with Arabic while using riva\_streaming\_asr\_client and riva\_asr\_client in the client Docker.
Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.
Japanese punctuation does not work well with numbers and English characters.
Speaker diarization is an alpha release and will increase ASR latency if enabled.
Long SSML input that worked with previous TTS ARPAbet models might fail with a “failed during inference” error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.
- To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:
  - Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)
  - In the Docker workspace cd /data/models/tts_preprocessor-English-US
  - In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.
  - Continue with the rest of the Quick Start steps: riva_start.sh
  Note: Changing the default value may lead to lower performance/quality.
SSML prosody tags with the new FastPitch IPA model will lead to the prosody being applied in later parts of the text and not where the user tags them. If the prosody tags are needed, use the older FastPitch ARPAbet model released with Riva 2.7.0 and older.
Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.
Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Building and Deploying ASR Pipelines can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.
On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Release 2.7.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Added ITN support for fr-FR
Updated ITN models for en-US, es-US:
- en-US 2.0
  - Support for credit cards
  - Indian numbering (lakhs, crores, and so on)
  - Numeric sequences (phone numbers, credit cards, SSN, and so on)
  - Support for double, triple in numeric sequences in the above numeric sequences (“double five triple eight nine six four seven two” -> 558-889-6472)
  - Alphanumeric sequences (H1N1),
  - Currencies of various countries and cryptocurrencies
- es-US 2.0
  - Currencies
  - Fractions
  - Measurements
  - Math
  - Telephone (country codes and extensions)
Added a punctuation and capitalization model for the ASR United Kingdom English (en-GB) model.
Added Citrinet-1024 and Conformer-L models for the ASR Portuguese Brazilian (pt-BR) and Korean (kr-KR) models.
The ASR Mandarin language model is now pruned.
Deployed model configs can be requested via a gRPC command to Riva

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

en ITN models now handle bank cards with 14 and 15 digits
The Conformer ASR model recipes have been updated with --endpointing.residue_blanks_at_start=-2 to better match NeMo WER.
The Spanish punctuation models used in the ASR model recipes now preserve accents.
The riva-build command for NLP models have been updated such that --nlp_pipeline_backend.to_lower and --nlp_pipeline_backend.tokenizer_to_lower have been removed. Use --to_lower and --tokenizer_to_lower.

Deprecated and Removed Features#

The following features have been removed.

Tacotron 2 and WaveGlow model support

Known Issues#

Riva punctuation models adds a period if the input text is empty.
Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.
Korean and Brazilian Portuguese Citrinet models have low throughput in offline mode.
The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Building and Deploying ASR Pipelines can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. We recommend using a greedy decoder with the Mandarin Conformer acoustic model.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.
On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Skills Release 2.6.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

ASR word level timestamps and confidences for all alternatives. This is an experimental feature. The accuracy of these confidences is not guaranteed.
Utterance level confidences for all alternatives. This is an experimental feature. The accuracy of these confidences is not guaranteed.
Option to use a neural-based voice activity detector in ASR to filter out noise from the audio and potentially reduce spurious words from appearing in ASR transcripts.
Added support for the SSML emphasis tag in Riva TTS.
Model updates:
- Version 3.0 of the Conformer Hindi ASR model is now available.
- Version 2.1 of the Conformer French ASR model is now available.
- New pruned ASR language models are available for German, English, Hindi, and Russian.
- New ITN models are available for French, English, and Spanish.
- New BERT-based punctuation models are available for English and French.
- Riva TTS English-US model supports emphasis outputs

Breaking Changes#

The riva-build parameters starting with --vad.<parameter_name> must be changed to --endpointing.<parameter_name>.
The riva-build parameters --vad.vad_start_history and --vad.vad_stop_history are now --endpointing.start_history and --endpointing.stop_history respectively.
The riva-build option --vad_type now has two possible values none and neural, and is used to select the pre-acoustic model voice activity detection algorithm used in Riva ASR (refer to Neural-Based Voice Activity Detection for more information).
The riva-build option --endpointing_type now has two possible values none and greedy_ctc, and is used to select the post-acoustic model end-pointing algorithm used in Riva to detect beginning/end of utterances (refer to Beginning/End of Utterance Detection for more information).

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

An option was added to the Riva Helm chart to optionally remove all models before deployment. This is to address an issue where models from a previous version of Riva could get reused, causing an error when creating pods.
Fixed an issue with our punctuator model that caused the riva-build parameter pad_chars_with_space to be ignored.

Deprecated and Removed Features#

Tacotron 2 and WaveGlow will be removed in Riva 2.7.0.

Limitations#

The emphasis tag has a few limitations:
- Feature support is dependent on training data and will only work on models trained with data containing emphasis samples.
- Use the tag around individual words; not around multiple words. "<emphasis>Hello</emphasis> <emphasis>World</emphasis>!" is valid. "<emphasis>Hello World!</emphasis>" is not.
- No other SSML tags can be nested inside of the emphasis tag.
- The tag does not support the level attribute.
Currently, the profanity filter feature does not support symbolic languages (for example, Japanese, Chinese, and so on).

Known Issues#

The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.
Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Building and Deploying ASR Pipelines can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.
Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.
The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. We recommend using a greedy decoder with the Mandarin Conformer acoustic model.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.
The Riva server does not return timestamps for every Mandarin character in the transcript.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.
On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Release 2.5.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

FastPitch models now support ragged batching for improved throughput. Starting in Riva 2.5.0, all newly exported FastPitch models will enable the ragged batching feature. Note that older FastPitch checkpoints must be exported again to enable the ragged batching feature.
Riva ASR now supports profanity filtering. Refer to the Profanity Filter section for more details.
Model updates:
- The new single TTS English-US multi-speaker model replaces the previous two single-speaker models setup.
- Version 3.0 of the Conformer Mandarin ASR model is now available.
- Version 2.0 of the Conformer Russian ASR model is now available.
Upgraded the following embedded hardware and software versions:

Breaking Changes#

The default value of the asr_model_delay parameter used by ASR decoders has been changed from 12 to 0.
The Riva client and server Docker images are now combined into a single Docker image.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Changed the default value of the asr_model_delay parameter from 12 to 0, which should help prevent word timestamps with negative values.
Changed the output processed_text from the TTS pipeline to match the behavior in the preprocessor. When a character is passed to TTS and does not exist in the mapping file, the preprocessor removes this character prior to tokenization. Likewise, these characters will be removed from the processed_text output.
Fixed a bug in the punctuator model that prevented proper punctuation of transcripts that included square brackets.
Fixed a bug to properly cancel ongoing RPCs when the Riva server is shut down.

Limitations#

Currently, the profanity filter feature does not support symbolic languages (for example, Japanese, Chinese, and so on).

Known Issues#

Loading a FastPitch model with ragged batching support results in Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.
On Jetson platforms, the first run of riva_tts_client after riva_start.sh in offline mode can throw a timeout error. This will be fixed in a future release of Riva.

Riva Release 2.4.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Added support for SSML sub-tags for speech synthesis.
Added support for ARM-based deployments.
Model updates:
- Updated the Conformer en-US speech recognition model.
- Added a Conformer fr-FR, en-GB, and zh-CN speech recognition models.
- Added new punctuation models for fr-FR and hi-IN.

Breaking Changes#

The riva-build parameter --vad.vad_type, which is used to select the type of VAD heuristic to use in the ASR pipeline, has been replaced by --vad_type.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed an issue that caused ASR word timestamps to have extremely large values.

Deprecated and Removed Features#

The following features have been deprecated.

The Tacotron 2 and WaveGlow TTS pipeline are now deprecated and will be removed in a future version of Riva. Consider switching to the FastPitch and HiFi-GAN pipeline, which is faster, more robust, and has similar quality as the Tacotron 2 and WaveGlow TTS pipeline.

Known Issues#

The French punctuation model sometimes omits punctuation marks. An improved punctuation model will be provided in the next release.
Word timestamps in ASR transcripts can be inaccurate for some audio files and ASR models.
The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. This will be fixed in a future version of Riva. Until then, we recommend to use a greedy decoder when using the Mandarin Conformer acoustic model.
On Jetson Xavier NX, the pre-configured Hindi ASR pipelines from the Quick Start scripts do not fit into the available 8 GB RAM due to the large language model being used. This will be fixed in a future release of Riva.
The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example. This will be fixed in future release of Riva.

Riva Release 2.3.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Support has been added for the volume attribute of the <prosody> SSML tag to control the volume of synthesized speech. In order to use this tag, the FastPitch .riva file must be rebuilt from a .nemo or .tao file.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Deprecated and Removed Features#

The following features have been deprecated.

The Tacotron 2 and WaveGlow TTS pipeline will be deprecated in a future version of Riva. Consider switching to the FastPitch and HiFi-GAN pipeline, which is faster, more robust, and has similar quality as the Tacotron 2 and WaveGlow TTS pipeline.

Riva Release 2.2.1#

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed a throughput performance regression in the speech synthesis service.
Return properly-punctuated words in WordInfo objects in offline speech recognition mode.
When word boosting in speech recognition, a warning instead of an error is returned when requested words cannot be boosted.

Riva Release 2.2.0#

Important

We recommend using the Riva 2.2.1 (22.05.1) release instead of v2.2.0.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Riva supports the NVIDIA Jetson Orin platform.
Punctuation models support arbitrary sequence length, and no longer truncate inputs.
Added the option to share the feature extractor between multiple ASR pipelines.
Model updates:
- Added new Hindi speech recognition model (Conformer).
- Improved the Mandarin language model.
- Added Mandarin punctuation support.

Breaking Changes#

In the intent_slot pipeline, the --contextual command-line option is removed. The contextual mode behavior is still supported by the Riva client API and ServiceMaker using the contextual model config attribute. The default is false.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed an issue in TTS where the pitch and rate attributes were not applied where specified.
Fixed an issue reading non-standard wav headers that could cause marginally increased latency returning first result.
Fixed improperly required channel_count in speech recognition request configuration.
Fixed a potential crash when deploying TTS for a novel language with text normalization disabled.

Known Issues#

The Mandarin punctuation model clips the output when there are English words present in the input text.
The Mandarin punctuation model accuracy is low compared to other languages. It will be improved in a future version of Riva.
The Riva server currently does not return timestamps for every Mandarin character in the transcript. This will be fixed in a future version of Riva.
On the Jetson Xavier NX, the German ASR model doesn’t fit into the available 8 GB RAM.

Riva Release 2.1.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Added text normalization options as part of the riva-build process. Refer to the TTS Pipeline Configuration section for more information.
Added multiple tutorials.

Breaking Changes#

Removed the following environment variables related to text normalization in TTS: NORM_PROTO_CONFIG and NORM_PROTO_PATH.
In previous versions, TTS used text normalization by default if none is specified. Now, text normalization will not be performed if none is specified.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Riva Release 2.0.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Riva supports Linux ARM64 platforms, that is, NVIDIA Jetson AGX Xavier™ and NVIDIA Jetson NX Xavier, referred to as embedded throughout the documentation.
Riva provides two new pretrained TTS voices that are easily deployable via the Quick Start scripts.
Phoneme SSML tags support manually overriding pronunciations.
SSL/TLS connections to the Riva server are supported.
There is a new option for generating additional tokenization’s for words in the lexicon (this is an experimental feature, which may boost recognition accuracy).
Inverse text normalization grammars must be provided during the riva-build stage to allow customizations for inverse text normalization.
Ability to add opt-in API key for sending telemetry back to NVIDIA.

Breaking Changes#

All legacy Jarvis APIs have been removed and are no longer supported.
The returned type of audio waveform from the Riva TTS service is now int16 to be compatible with the linear PCM wave format currently supported by Riva.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed an issue in ServiceMaker that caused punctuation and capitalization models generated with recent NeMo versions to lead to inaccurate results.
Fixed an issue that could lead to a crash when using word boosting.

Known Issues#

Deployment of Citrinet models for offline recognition can fail during the riva-deploy phase if large chunk sizes are used. To workaround this issue, we recommend passing parameter max-dim=100000 to nemo2riva when converting the .nemo model to .riva. This will enable using a chunk size up to 900 seconds during the riva-deploy phase.
On embedded platforms, the ASR examples in asr-python-basics and asr-python-boosting Jupyter notebooks do not work by default, since they invoke offline recognition API and embedded platforms do not have an offline ASR model enabled by default. To get these examples working, you need to either deploy an offline ASR model or modify the examples to use streaming recognition API.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.10.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.10.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Riva 1.10.0 beta now uses Triton 2.19.0 and TensorRT 8.2
The default behavior of Riva TTS’s G2P pipeline has changed. Words that have multiple phonetic representations now default to use graphemes. This was done to match the default NeMo training behavior. To revert to the old behavior, please add --preprocessor.g2p_ignore_ambiguous=False to riva-build.
ASR word boosting at request time is supported in Riva. This feature allows you to provide a list of words that should be given a higher score when decoding the output of the acoustic model. Refer to the gRPC ASR protobuf file (protobuf-docs-asr) for more information on how to include boosted words with the ASR request.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed an issue that can cause acoustic models exported from NeMo 1.5+ to incorrectly include spaces in transcript.
Fixed an issue in nemo2riva preventing conversion of models from NeMo version less than 1.3.0.
Fixed an issue that could lead to irregular rhythm of speech when a TTS model was trained with mixed representation input.
Fixed an issue that can cause incorrect transcripts when the server is under a heavy load.

Known Issues#

The Riva Speech Samples image nvcr.io/nvidia/riva/riva-speech-client:1.10.0-beta-samples does not exist. Use nvcr.io/nvidia/riva/riva-speech-client:1.8.0-beta-samples instead.
The ASR word boosting feature in Riva currently does not support boosting of phrases or combination of words. This will be supported in a future version of Riva.
nemo2riva and riva-build is currently broken for newer WaveGlow NeMo TTS checkpoints. As a workaround, use this WaveGlow.riva file instead: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechsynthesis_waveglow/files.

Riva Release 1.9.0 Beta#

Note

Users upgrading to 1.9.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Improved customization for Automatic Speech Recognition (ASR) Spanish (es-US) and German (de-DE) language models.
The rate SSML attribute supports x-low, low, medium, high, x-high, and default.
The pitch SSML attribute supports x-low, low, medium, high, x-high, and default.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Known Issues#

The pretrained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.
The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.8.0 Beta#

Note

Users upgrading to 1.8.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Released new pretrained models for German (de-DE), Russian (ru-RU), and Spanish (es-US) speech recognition.
Increased recognition accuracy of English (en-US) speech recognition models.
Introduced partial support for Speech Synthesis Markup Language (SSML) within the TTS API. Support has been added for pitch and rate attributes of the <prosody> tag to control pitch and duration of synthesized speech. Additional SSML support is planned for future releases.
Added word boosting support to the Speech Recognition API to bias ASR engine to recognize particular words of interest at request time. This release is limited to boosting of in-vocabulary words; out-of-vocabulary word boosting will be available in an upcoming release.
Minor ASR inference speed improvements in online mode.
Improved offline ASR recognition accuracy.
Added support for the Automatic Speech Recognition (ASR) Conformer-CTC model. The Conformer-CTC model is a nonauto-regressive variant of the Conformer model for ASR, which uses CTC loss/decoding instead of Transducer.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed an issue in TTS pipeline that can sometimes cause an audible ‘pop’ at the end of an utterance.

Known Issues#

The pretrained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.
The rate SSML attribute does not support x-low, low, medium, high, x-high, or default.
The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes, nor does it support x-low, low, medium, high, x-high, or default.
When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.7.0 Beta#

Note

Users upgrading to 1.7.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Added support for models trained by NVIDIA TAO Toolkit 21.11.
Riva Streaming TTS now supports resampling, if necessary, to match the requested audio sample rate.
Default Riva English ASR model updated with higher accuracy.
Minor improvements in English text normalization and inverse text normalization models.
Increased maximum message size to support larger audio inputs in offline ASR.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematurely truncated.
Fixed issue related to custom pronunciations being mishandled by text normalization for TTS.

Known Issues#

When running the nemo2riva package with EFF version 0.5.2, an ignored exception warning is printed. This should not affect functionality of the generated .riva models. This will be addressed in a future release of EFF.
During ASR pipeline execution, inverse text normalization will not convert digits into numerals (one->1) unless there are 10 digits. This limitation will be addressed in a future version of Riva.
The punctuation pipeline does not support Unicode character input. This will be fixed in the next release.

Riva Release 1.6.0 Beta#

Note

Users upgrading to 1.6.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

The Riva TTS service is no longer limited to 400 characters long input strings.
Updated the performance page of the documentation to include performance of Citrinet, FastPitch, and HiFi-GAN models

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematurely truncated.

Known Issues#

Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.
NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the NeMo > Riva > RMIR > TensorRT path. This will be addressed in a future release.

Riva Release 1.5.0 Beta#

Note

Users upgrading to 1.5.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

Support for training n-gram language models for ASR has been added to TAO Toolkit. These language models are fully supported in Riva.
FastPitch now leverages Tensor Cores for improved inference performance.
nemo2riva now provides a warning when attempting to convert unsupported models.
Minor enhancements were made to cover additional cases in text normalization/inverse text normalization for English.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Fixed failure in Quick Start for some versions of the NGC client.
Fixed minor issues that could cause occasional artifacts or reduced quality in TTS generated audio.
Eliminated misleading error messages during riva-build process.

Announcements#

NVIDIA Transfer Learning Toolkit (TLT) has been renamed to NVIDIA TAO Toolkit starting in the 1.5.0-beta release.

Known Issues#

NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the NeMo > Riva > RMIR > TensorRT path. This will be addressed in a future release.

Riva Release 1.4.0 Beta#

Note

Users upgrading to 1.4.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

Minor stability improvements were made to the ASR and TTS services.
Exposed the model_name parameter in the nlp_classify_tokens sample client.
Fixed an issue with the ASR language model hyperparameter tuning tool.

Announcements#

The Jarvis framework has been renamed to Riva starting in the 1.4.0-beta release. Jarvis Speech Skills has been renamed to Riva. Documentation, scripts, and commands have been updated accordingly.
- The Jarvis API is supported but deprecated beginning with this release. It will be removed in a future release. Old Jarvis clients are expected to work as-is with this version of Riva, however, users will need to migrate to the Riva API after the Jarvis API is removed.
- The Riva API modifies the following service names:
  - JarvisASR -> RivaSpeechRecognition
  - JarvisNLP -> RivaLanguageUnderstanding
  - JarvisCoreNLP -> RivaLanguageUnderstanding
  - JarvisTTS -> RivaSpeechSynthesis
- jarvis-build and jarvis-deploy commands have been replaced with the equivalent riva-build and riva-deploy commands.
The riva-build command parameters for ASR pipelines have changed.
- The --lm_decoder_cpu parameter is deprecated. Replace --lm_decoder_cpu.decoder_type=<decoder_type> with --decoder_type=<decoder_type> and replace --lm_decoder_cpu.<param_name>=<param_value> with --<decoder_type>_decoder.<param_name>=<param_value>. For example, instead of using --lm_decoder_cpu.decoder_type=greedy --lm_decoder_cpu.asr_model_delay=-1, use --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.
- The type of decoder to use must be explicitly set by using --decoder_type=<decoder_type> where <decoder_type> must be one of greedy, os2s, flashlight, or kaldi.
Refer to ASR Pipeline Configuration for example riva-build commands to use with different acoustic models.

Riva Release 1.3.0 Beta#

Note

Users upgrading to 1.3.0 Beta from previous versions must rerun jarvis-build for existing models. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

NVIDIA Riva

Release Notes

Contents

Release Notes#

Riva Release 2.26.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.24.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Breaking Changes#

Known Issues#

Riva Release 2.19.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Breaking Changes#

Known Issues#

Riva Release 2.18.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.17.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.16.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.15.1#

Fixed Issues#

Model Updates#

Known Issues#

Riva Release 2.15.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Breaking Changes#

Known Issues#

Riva Release 2.14.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.13.1#

Fixed Issues#

Model Updates#

Known Issues#

Riva Release 2.13.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Breaking Changes#

Known Issues#

Riva Release 2.12.1#

Fixed Issues#

Riva Release 2.12.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.11.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Breaking Changes#

Known Issues#

Riva Release 2.10.0#

Key Features and Enhancements#

Model Updates#

Fixed Issues#

Known Issues#

Riva Release 2.9.0#