Release Notes#

Important

If you are upgrading from a previous Riva version, refer to the Upgrading section.

All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Riva Release 2.14.0#

Key Features and Enhancements#

  • Added support in TTS for mixing multiple emotions through SSML input, using RADTTS++ (beta) model.

Model Updates#

  • Added Mandarin-English Conformer multilingual code-switch ASR model.

  • Added Spanish-US multi-speaker and RADTTS++ (beta) emotion mixing TTS models.

  • Added Megatron 1B any to en NMT model.

Fixed Issues#

  • Fixed empty transcripts from NMT when batch size > 8 is tried from the client.

Known Issues#

  • Mandarin-English Conformer multilingual code-switch ASR model does not support punctuations.

  • The RADTTS++ model is a beta model for mixing emotions and does not fully support all functionality like pitch, rate, volume SSML attributes etc.

  • When generating .riva models from .nemo using nemo2riva, the nemo:23.08 image is not compatible with Riva due to updated torch version. To avoid any Riva deployment issues, the recommendation is to continue using the last working nemo image.

  • Mandarin TTS output has inaccurate pronunciation for some polyphone characters. Also, the audio might sound less natural due to pauses within sentences.

  • German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.

  • Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.

  • Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.

  • Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.

  • When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.

  • Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.

  • Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.13.1#

For detailed Release Notes, refer to Riva Release 2.13.0.

Fixed Issues#

  • Fixed ASR word confidence score to have values in [0-1] range.

  • Fixed Helm charts to allow custom model deployment from any NGC org/team.

Model Updates#

  • Added Mandarin TTS model with male and female emotion subvoices.

Known Issues#

  • Mandarin TTS output has inaccurate pronunciation for some polyphone characters. Also, the audio might sound less natural due to pauses within sentences.

Riva Release 2.13.0#

Key Features and Enhancements#

  • Added support in TTS for synthesizing speech in non-English languages.

  • Added TTS multi-speaker adapter IPA pretrained .nemo checkpoint and a tutorial on how to finetune it for smaller datasets.

  • Added support for tagging gRPC request and response with unique identifier.

Model Updates#

  • Added German Conformer unified and updated Spanish-English Conformer multilingual code-switch ASR models.

  • Added Japanese-English Conformer unified multilingual code-switch and updated English ASR models.

  • Added Spanish and Italian TTS models with male and female voices, and German TTS model with male voice.

Fixed Issues#

  • Simplified translation documentation to ease deployment and used consistent naming for translation clients.

  • Fixed speech translation clients to support microphone input and logging of performance metrics.

  • Fixed an issue in ASR that caused intermittent transcript inaccuracy on multiple runs in some cases.

  • Corrected timestamps in ASR result for character based languages, Japanese and Mandarin.

  • Fixed profane words filtering in ASR result transcripts in case when greedy decoder is used.

Breaking Changes#

  • The denoiser arguments used in riva-build when building TTS models has been renamed to postprocessor to better reflect what occurs in that step. The postprocessor is currently used to cross-fade audio chunks and is not used for denoising.

Known Issues#

  • German Conformer unified ASR model can have low accuracy in some cases, particularly for Inverse Text Normalization when the transcript contains capitalized words.

  • Japanese-English Conformer unified multilingual code-switch ASR model results only contain character timestamps and not word timestamps.

  • Japanese-English Conformer unified multilingual code-switch ASR model result transcripts contain punctuations only for the Japanese text.

  • Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.

  • When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.

  • Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.

  • Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.12.1#

For detailed Release Notes, refer to Riva Release 2.12.0.

Fixed Issues#

  • Updated Helm charts to fix an issue that can cause deployment failure in some environments.

Riva Release 2.12.0#

Key Features and Enhancements#

  • Updated Helm charts to support model deployment on multiple NVIDIA Triton servers.

  • Updated Helm charts to assign models to specific GPUs when using single NVIDIA Triton server with multiple GPUs.

Model Updates#

  • Added Mandarin Conformer unified and Spanish-English Multilingual code-switching ASR models.

  • Updated Italian Conformer and Japanese Conformer unified ASR models.

  • Added emotion sub-voices for FastPitch and RAD-TTS models.

Fixed Issues#

  • S2S output in OPUS encoded format would sometimes have intermittent glitches. This issue has been fixed.

  • The Conformer unified ASR model always returned punctuated output irrespective of the --automatic_punctuation flag. This issue has been fixed.

  • S2S service is updated to return the appropriate gRPC status for different error scenarios.

Known Issues#

  • Multilingual Spanish-English code-switching ASR model uses Spanish punctuation by default and does not punctuate the English text.

  • When using a single NVIDIA Triton server in a Riva Helm chart, all ASR models must be deployed on the same GPU due to a limitation from the feature extractor.

  • Arabic ITN currently does not de-normalize time, date, currency, and decimal numbers.

  • Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.11.0#

Key Features and Enhancements#

  • Added a new service called Speech-to-Speech Translation (S2S). Riva S2S translates audio between language pairs, that is, from one language to another.

  • Added a new service called Speech-to-Text Translation (S2T). Riva S2T transcribes audio between language pairs, that is, from one language to another.

  • Added two new Riva S2S and S2T APIs, StreamingTranslateSpeechToSpeech and StreamingTranslateSpeechToText.

Model Updates#

  • Added the Conformer unified Japanese ASR model, which is an acoustic model trained with punctuation symbols as part of its vocabulary. This helps in getting more accurate punctuations within transcriptions.

Fixed Issues#

  • The --phone_dictionary_file and --mapping_file arguments for riva-build of the TTS pipeline now accepts relative paths.

Breaking Changes#

  • Triton backend config of CTC decoder has backward incompatible changes. So the model repositories generated by earlier Riva release are not compatible. Please generate new model repository by running riva_init.sh as mentioned in Quick Start steps.

Known Issues#

  • S2S output in OPUS encoded format can have intermittent glitches. This issue is not observed with PCM output from S2S.

  • The Conformer unified ASR model always returns punctuated output irrespective of the --automatic_punctuation flag.

  • Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.10.0#

Key Features and Enhancements#

  • Added RadTTS support for speech synthesis. In the default configuration, use English-US-RadTTS as the voice_name to use the RadTTS model. English-US defers to the FastPitch model.

  • Upgraded the following software versions on embedded platforms:

Model Updates#

  • Added new Punctuation and Capitalization models for Japanese (jp-JP) and Russian (ru-RU) languages.

  • Updated Conformer L ASR models for Arabic (ar-AR), Spanish (es-US), Portuguese (pt-BR), and Mandarin (zh-CN) languages.

  • Added RadTTS and HiFi-GAN RadTTS TTS models with IPA alphabet for English (en-US)

  • Updated language model for Arabic (ar-AR)

Fixed Issues#

  • The pitch SSML attribute supports ‘Hz’

Known Issues#

  • The --phone_dictionary_file and --mapping_file arguments for riva-build of the TTS pipeline does not work for relative paths.

  • Riva TTS cpp-clients automatically convert Opus to 16-bit pulse-code modulation (PCM) before writing the output audio to disk. Use the Python clients to receive an Opus stream.

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support st and % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.9.0#

Key Features and Enhancements#

  • Riva now supports Opus encoding (in the TTS service) and decoding (in the ASR service). In ASR, you can submit .ogg and .opus audio files to transcode. In TTS, you can choose an option to receive a serialized opus-encoded stream. A deserializer for that stream is also provided. For more information, refer to sample clients.

  • Added a new service called Riva translation. Riva translation translates text between language pairs, that is, from one language to another.

  • Added two new Riva translation APIs, TranslateText and ListSupportedLanguagePairs.

  • Lexicon free decoding with a character based LM. See Flashlight Decoder Lexicon Free for details.

Model Updates#

  • Added four multilingual models and 10 bilingual models for NMT. Refer to NMT Customizing for more information.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Deprecated and Removed Features#

TAO Toolkit support for Riva is now deprecated. We recommend you use NVIDIA NeMo to fine-tune pretrained models on a custom data set.

Known Issues#

  • Running nemo2riva on a FastPitch model with ragged batching support results in warnings about ONNXRuntimeError INVALID_GRAPH. These can be safely ignored.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency with regard to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency with regard to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva_streaming_asr_client and riva_asr_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a failed during inference error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note

      Changing the default value may lead to lower performance and quality.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • The Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.

  • On Jetson NX Xavier, the German and Korean ASR, Translation and Speaker Diarization models do not fit into the available 8 GB RAM.

  • Clients should not send empty strings to Riva Translation API, these may be mistranslated into short sentences.

  • Riva ASR client supports only 16kHz 1-channel format when using FLAC encoding.

Riva Release 2.8.1#

For detailed Release Notes, refer to Riva Release 2.8.0.

Fixed Issues#

  • SSML prosody tags with the new FastPitch IPA model now apply prosody in the correct locations.

Riva Release 2.8.0#

Important

We recommend using the Riva 2.8.1 (22.11.1) release instead of version 2.8.0.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Added a punctuation and capitalization model for the ASR EMEA Spanish (es-ES), Japanese (ja-JP), Korean(ko-KR), Brazilian Portuguese (pt-BR) and Italian (it-IT) model.

  • Added Conformer-L models for the ASR EMEA Spanish (es-ES), Japanese (ja-JP), Italian (it-IT) and Arabic (ar-AR) models.

  • Added Citrinet-1024 model for ASR EMEA Spanish (es-ES)

  • Updated Citrinet-1024 models for ASR Russian (ru-RU) and French (fr-FR)

  • Deployed model configs can be requested via a gRPC command to Riva

  • The speech synthesis pretrained model uses the International Phonetic Alphabet (IPA) for inference and training instead of ARPAbet. Refer to the Known Issues section regarding the SSML prosody tag.

  • Added support for Non-Overlapping Speaker Diarization in case of Offline Recognition. This is a alpha release of this feature, so it is not enabled by default. User needs to uncomment the rmir_diarizer_offline model in the Quick Start config.sh before running riva_init.sh to enable the feature.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • en ITN models now handle bank cards with 14 and 15 digits

  • The Conformer ASR model recipes have been updated with --endpointing.residue_blanks_at_start=-2 to better match NeMo WER.

  • The Spanish punctuation models used in the ASR model recipes now preserve accents.

  • The riva-build command for NLP models have been updated such that --nlp_pipeline_backend.to_lower and --nlp_pipeline_backend.tokenizer_to_lower have been removed. Use --to_lower and --tokenizer_to_lower.

Deprecated and Removed Features#

The following features have been deprecated.

  • Speech synthesis with ARPABET for inference and training

Known Issues#

  • Arabic ASR acoustic model is targeted for Modern Standard Arabic (MSA), therefore, the accuracy of the Lebanese accent might be poor.

  • Spanish (es-ES) and Italian Conformer-CTC-L acoustic models have low throughput and high latency relative to other languages.

  • Korean, Brazilian Portuguese, Spanish (es-ES), French, and Russian Citrinet models have low throughput and high latency relative to other languages.

  • Orientation of output (word timestamps) is disrupted with Arabic while using riva\_streaming\_asr\_client and riva\_asr\_client in the client Docker.

  • Arabic Conformer-CTC model has poor silence robustness. For better results, use Neural VAD.

  • Japanese punctuation does not work well with numbers and English characters.

  • Speaker diarization is an alpha release and will increase ASR latency if enabled.

  • Long SSML input that worked with previous TTS ARPAbet models might fail with a “failed during inference” error due to the IPA model’s internal representation being slightly longer than the ARPAbet model.

    • To update the length in the Quick Start steps, after riva_init.sh and before riva_start.sh:

      • Access the location where the model repository ($riva-model-repo) is generated (you can mount the volume with a temp Docker: docker run -it -v riva-model-repo:/data ubuntu)

      • In the Docker workspace cd /data/models/tts_preprocessor-English-US

      • In config.pbtxt, edit the value of key max_sequence_length value to 500. Save and exit Docker.

      • Continue with the rest of the Quick Start steps: riva_start.sh

      Note: Changing the default value may lead to lower performance/quality.

  • SSML prosody tags with the new FastPitch IPA model will lead to the prosody being applied in later parts of the text and not where the user tags them. If the prosody tags are needed, use the older FastPitch ARPAbet model released with Riva 2.7.0 and older.

  • Portuguese punctuation model has poor accuracy with commas. This will be fixed in an upcoming release.

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Pipeline Configuration can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin and Japanese Conformer acoustic models leads to high latency. We recommend using a greedy decoder with these acoustic models.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin and Japanese character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.

  • On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Release 2.7.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Added ITN support for fr-FR

  • Updated ITN models for en-US, es-US:

    • en-US 2.0

      • Support for credit cards

      • Indian numbering (lakhs, crores, and so on)

      • Numeric sequences (phone numbers, credit cards, SSN, and so on)

      • Support for double, triple in numeric sequences in the above numeric sequences (“double five triple eight nine six four seven two” -> 558-889-6472)

      • Alphanumeric sequences (H1N1),

      • Currencies of various countries and cryptocurrencies

    • es-US 2.0

      • Currencies

      • Fractions

      • Measurements

      • Math

      • Telephone (country codes and extensions)

  • Added a punctuation and capitalization model for the ASR United Kingdom English (en-GB) model.

  • Added Citrinet-1024 and Conformer-L models for the ASR Portuguese Brazilian (pt-BR) and Korean (kr-KR) models.

  • The ASR Mandarin language model is now pruned.

  • Deployed model configs can be requested via a gRPC command to Riva

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • en ITN models now handle bank cards with 14 and 15 digits

  • The Conformer ASR model recipes have been updated with --endpointing.residue_blanks_at_start=-2 to better match NeMo WER.

  • The Spanish punctuation models used in the ASR model recipes now preserve accents.

  • The riva-build command for NLP models have been updated such that --nlp_pipeline_backend.to_lower and --nlp_pipeline_backend.tokenizer_to_lower have been removed. Use --to_lower and --tokenizer_to_lower.

Deprecated and Removed Features#

The following features have been removed.

  • Tacotron 2 and WaveGlow model support

Known Issues#

  • Riva punctuation models adds a period if the input text is empty.

  • Riva punctuation models assume that the incoming text is unpunctuated. If the incoming text is already punctuated, the punctuation model might double the existing punctuation.

  • Korean and Brazilian Portuguese Citrinet models have low throughput in offline mode.

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Pipeline Configuration can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. We recommend using a greedy decoder with the Mandarin Conformer acoustic model.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.

  • On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Skills Release 2.6.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • ASR word level timestamps and confidences for all alternatives. This is an experimental feature. The accuracy of these confidences is not guaranteed.

  • Utterance level confidences for all alternatives. This is an experimental feature. The accuracy of these confidences is not guaranteed.

  • Option to use a neural-based voice activity detector in ASR to filter out noise from the audio and potentially reduce spurious words from appearing in ASR transcripts.

  • Added support for the SSML emphasis tag in Riva TTS.

  • Model updates:

    • Version 3.0 of the Conformer Hindi ASR model is now available.

    • Version 2.1 of the Conformer French ASR model is now available.

    • New pruned ASR language models are available for German, English, Hindi, and Russian.

    • New ITN models are available for French, English, and Spanish.

    • New BERT-based punctuation models are available for English and French.

    • Riva TTS English-US model supports emphasis outputs

Breaking Changes#

  • The riva-build parameters starting with --vad.<parameter_name> must be changed to --endpointing.<parameter_name>.

  • The riva-build parameters --vad.vad_start_history and --vad.vad_stop_history are now --endpointing.start_history and --endpointing.stop_history respectively.

  • The riva-build option --vad_type now has two possible values none and neural, and is used to select the pre-acoustic model voice activity detection algorithm used in Riva ASR (refer to Neural-Based Voice Activity Detection for more information).

  • The riva-build option --endpointing_type now has two possible values none and greedy_ctc, and is used to select the post-acoustic model end-pointing algorithm used in Riva to detect beginning/end of utterances (refer to Beginning/End of Utterance Detection for more information).

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • An option was added to the Riva Helm chart to optionally remove all models before deployment. This is to address an issue where models from a previous version of Riva could get reused, causing an error when creating pods.

  • Fixed an issue with our punctuator model that caused the riva-build parameter pad_chars_with_space to be ignored.

Deprecated and Removed Features#

Tacotron 2 and WaveGlow will be removed in Riva 2.7.0.

Limitations#

  • The emphasis tag has a few limitations:

    • Feature support is dependent on training data and will only work on models trained with data containing emphasis samples.

    • Use the tag around individual words; not around multiple words. "<emphasis>Hello</emphasis> <emphasis>World</emphasis>!" is valid. "<emphasis>Hello World!</emphasis>" is not.

    • No other SSML tags can be nested inside of the emphasis tag.

    • The tag does not support the level attribute.

  • Currently, the profanity filter feature does not support symbolic languages (for example, Japanese, Chinese, and so on).

Known Issues#

  • The use of the new neural-based voice activity detector in the Riva ASR has a non-negligible impact on latency and throughput. In local tests, a degradation in those metrics on the order of 25%-50% has been observed.

  • Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate. Timestamps are estimated from the final weights of the specific acoustic model being used. The accuracy of those timestamps can vary depending on several variables including audio duration, audio quality, and accuracy of the model.

  • Conformer acoustic models fine-tuned with TAO Toolkit and deployed in Riva with the recommended riva-build parameters from Pipeline Configuration can lead to empty transcripts at inference time. To workaround this problem, pass the --nn.use_trt_fp32 parameter to riva-build. This will be fixed in a future version of TAO Toolkit.

  • Loading a FastPitch model with ragged batching support results in the Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • The first Riva TTS call after riva_start.sh results in longer latency, and can throw a timeout error. Subsequent calls will exhibit normal latency.

  • The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. We recommend using a greedy decoder with the Mandarin Conformer acoustic model.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example.

  • The Riva server does not return timestamps for every Mandarin character in the transcript.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

  • Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • The pitch SSML attribute is not in compliance with the SSML specs, and does not support Hz, st, % changes.

  • On Jetson NX Xavier, the German ASR model does not fit into the available 8 GB RAM.

Riva Release 2.5.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • FastPitch models now support ragged batching for improved throughput. Starting in Riva 2.5.0, all newly exported FastPitch models will enable the ragged batching feature. Note that older FastPitch checkpoints must be exported again to enable the ragged batching feature.

  • Riva ASR now supports profanity filtering. Refer to the Profanity Filter section for more details.

  • Model updates:

    • The new single TTS English-US multi-speaker model replaces the previous two single-speaker models setup.

    • Version 3.0 of the Conformer Mandarin ASR model is now available.

    • Version 2.0 of the Conformer Russian ASR model is now available.

  • Upgraded the following embedded hardware and software versions:

Breaking Changes#

  • The default value of the asr_model_delay parameter used by ASR decoders has been changed from 12 to 0.

  • The Riva client and server Docker images are now combined into a single Docker image.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Changed the default value of the asr_model_delay parameter from 12 to 0, which should help prevent word timestamps with negative values.

  • Changed the output processed_text from the TTS pipeline to match the behavior in the preprocessor. When a character is passed to TTS and does not exist in the mapping file, the preprocessor removes this character prior to tokenization. Likewise, these characters will be removed from the processed_text output.

  • Fixed a bug in the punctuator model that prevented proper punctuation of transcripts that included square brackets.

  • Fixed a bug to properly cancel ongoing RPCs when the Riva server is shut down.

Limitations#

  • Currently, the profanity filter feature does not support symbolic languages (for example, Japanese, Chinese, and so on).

Known Issues#

  • Loading a FastPitch model with ragged batching support results in Triton server logging warnings about CleanUnusedInitializersAndNodeArgs. These can be safely ignored.

  • On Jetson platforms, the first run of riva_tts_client after riva_start.sh in offline mode can throw a timeout error. This will be fixed in a future release of Riva.

Riva Release 2.4.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Added support for SSML sub-tags for speech synthesis.

  • Added support for ARM-based deployments.

  • Model updates:

    • Updated the Conformer en-US speech recognition model.

    • Added a Conformer fr-FR, en-GB, and zh-CN speech recognition models.

    • Added new punctuation models for fr-FR and hi-IN.

Breaking Changes#

  • The riva-build parameter --vad.vad_type, which is used to select the type of VAD heuristic to use in the ASR pipeline, has been replaced by --vad_type.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed an issue that caused ASR word timestamps to have extremely large values.

Deprecated and Removed Features#

The following features have been deprecated.

  • The Tacotron 2 and WaveGlow TTS pipeline are now deprecated and will be removed in a future version of Riva. Consider switching to the FastPitch and HiFi-GAN pipeline, which is faster, more robust, and has similar quality as the Tacotron 2 and WaveGlow TTS pipeline.

Known Issues#

  • The French punctuation model sometimes omits punctuation marks. An improved punctuation model will be provided in the next release.

  • Word timestamps in ASR transcripts can be inaccurate for some audio files and ASR models.

  • The use of the OpenSeq2Seq decoder with the Mandarin Conformer acoustic model leads to high latency. This will be fixed in a future version of Riva. Until then, we recommend to use a greedy decoder when using the Mandarin Conformer acoustic model.

  • On Jetson Xavier NX, the pre-configured Hindi ASR pipelines from the Quick Start scripts do not fit into the available 8 GB RAM due to the large language model being used. This will be fixed in a future release of Riva.

  • The pre-configured English Great Britain (en-GB) ASR pipeline transcribes some en-GB specific words with the en-US spelling. This is observed for words having “oe”, “ae”, “ell” for example. This will be fixed in future release of Riva.

Riva Release 2.3.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Support has been added for the volume attribute of the <prosody> SSML tag to control the volume of synthesized speech. In order to use this tag, the FastPitch .riva file must be rebuilt from a .nemo or .tao file.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Deprecated and Removed Features#

The following features have been deprecated.

  • The Tacotron 2 and WaveGlow TTS pipeline will be deprecated in a future version of Riva. Consider switching to the FastPitch and HiFi-GAN pipeline, which is faster, more robust, and has similar quality as the Tacotron 2 and WaveGlow TTS pipeline.

Riva Release 2.2.1#

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed a throughput performance regression in the speech synthesis service.

  • Return properly-punctuated words in WordInfo objects in offline speech recognition mode.

  • When word boosting in speech recognition, a warning instead of an error is returned when requested words cannot be boosted.

Riva Release 2.2.0#

Important

We recommend using the Riva 2.2.1 (22.05.1) release instead of v2.2.0.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Riva supports the NVIDIA Jetson Orin platform.

  • Punctuation models support arbitrary sequence length, and no longer truncate inputs.

  • Added the option to share the feature extractor between multiple ASR pipelines.

  • Model updates:

    • Added new Hindi speech recognition model (Conformer).

    • Improved the Mandarin language model.

    • Added Mandarin punctuation support.

Breaking Changes#

  • In the intent_slot pipeline, the --contextual command-line option is removed. The contextual mode behavior is still supported by the Riva client API and ServiceMaker using the contextual model config attribute. The default is false.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed an issue in TTS where the pitch and rate attributes were not applied where specified.

  • Fixed an issue reading non-standard wav headers that could cause marginally increased latency returning first result.

  • Fixed improperly required channel_count in speech recognition request configuration.

  • Fixed a potential crash when deploying TTS for a novel language with text normalization disabled.

Known Issues#

  • The Mandarin punctuation model clips the output when there are English words present in the input text.

  • The Mandarin punctuation model accuracy is low compared to other languages. It will be improved in a future version of Riva.

  • The Riva server currently does not return timestamps for every Mandarin character in the transcript. This will be fixed in a future version of Riva.

  • On the Jetson Xavier NX, the German ASR model doesn’t fit into the available 8 GB RAM.

Riva Release 2.1.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Added text normalization options as part of the riva-build process. Refer to the TTS Pipeline Configuration section for more information.

  • Added multiple tutorials.

Breaking Changes#

  • Removed the following environment variables related to text normalization in TTS: NORM_PROTO_CONFIG and NORM_PROTO_PATH.

  • In previous versions, TTS used text normalization by default if none is specified. Now, text normalization will not be performed if none is specified.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Riva Release 2.0.0#

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Riva supports Linux ARM64 platforms, that is, NVIDIA Jetson AGX Xavier™ and NVIDIA Jetson NX Xavier, referred to as embedded throughout the documentation.

  • Riva provides two new pretrained TTS voices that are easily deployable via the Quick Start scripts.

  • Phoneme SSML tags support manually overriding pronunciations.

  • SSL/TLS connections to the Riva server are supported.

  • There is a new option for generating additional tokenization’s for words in the lexicon (this is an experimental feature, which may boost recognition accuracy).

  • Inverse text normalization grammars must be provided during the riva-build stage to allow customizations for inverse text normalization.

  • Ability to add opt-in API key for sending telemetry back to NVIDIA.

Breaking Changes#

  • All legacy Jarvis APIs have been removed and are no longer supported.

  • The returned type of audio waveform from the Riva TTS service is now int16 to be compatible with the linear PCM wave format currently supported by Riva.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed an issue in ServiceMaker that caused punctuation and capitalization models generated with recent NeMo versions to lead to inaccurate results.

  • Fixed an issue that could lead to a crash when using word boosting.

Known Issues#

  • Deployment of Citrinet models for offline recognition can fail during the riva-deploy phase if large chunk sizes are used. To workaround this issue, we recommend passing parameter max-dim=100000 to nemo2riva when converting the .nemo model to .riva. This will enable using a chunk size up to 900 seconds during the riva-deploy phase.

  • On embedded platforms, the ASR examples in asr-python-basics and asr-python-boosting Jupyter notebooks do not work by default, since they invoke offline recognition API and embedded platforms do not have an offline ASR model enabled by default. To get these examples working, you need to either deploy an offline ASR model or modify the examples to use streaming recognition API.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.10.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.10.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Riva 1.10.0 beta now uses Triton 2.19.0 and TensorRT 8.2

  • The default behavior of Riva TTS’s G2P pipeline has changed. Words that have multiple phonetic representations now default to use graphemes. This was done to match the default NeMo training behavior. To revert to the old behavior, please add --preprocessor.g2p_ignore_ambiguous=False to riva-build.

  • ASR word boosting at request time is supported in Riva. This feature allows you to provide a list of words that should be given a higher score when decoding the output of the acoustic model. Refer to the gRPC ASR protobuf file (riva/proto/riva_asr.proto) for more information on how to include boosted words with the ASR request.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed an issue that can cause acoustic models exported from NeMo 1.5+ to incorrectly include spaces in transcript.

  • Fixed an issue in nemo2riva preventing conversion of models from NeMo version less than 1.3.0.

  • Fixed an issue that could lead to irregular rhythm of speech when a TTS model was trained with mixed representation input.

  • Fixed an issue that can cause incorrect transcripts when the server is under a heavy load.

Known Issues#

  • The Riva Speech Samples image nvcr.io/nvidia/riva/riva-speech-client:1.10.0-beta-samples does not exist. Use nvcr.io/nvidia/riva/riva-speech-client:1.8.0-beta-samples instead.

  • The ASR word boosting feature in Riva currently does not support boosting of phrases or combination of words. This will be supported in a future version of Riva.

  • nemo2riva and riva-build is currently broken for newer WaveGlow NeMo TTS checkpoints. As a workaround, use this WaveGlow.riva file instead: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechsynthesis_waveglow/files.

Riva Release 1.9.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.9.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Improved customization for Automatic Speech Recognition (ASR) Spanish (es-US) and German (de-DE) language models.

  • The rate SSML attribute supports x-low, low, medium, high, x-high, and default.

  • The pitch SSML attribute supports x-low, low, medium, high, x-high, and default.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Known Issues#

  • The pretrained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.

  • The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.8.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.8.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Released new pretrained models for German (de-DE), Russian (ru-RU), and Spanish (es-US) speech recognition.

  • Increased recognition accuracy of English (en-US) speech recognition models.

  • Introduced partial support for Speech Synthesis Markup Language (SSML) within the TTS API. Support has been added for pitch and rate attributes of the <prosody> tag to control pitch and duration of synthesized speech. Additional SSML support is planned for future releases.

  • Added word boosting support to the Speech Recognition API to bias ASR engine to recognize particular words of interest at request time. This release is limited to boosting of in-vocabulary words; out-of-vocabulary word boosting will be available in an upcoming release.

  • Minor ASR inference speed improvements in online mode.

  • Improved offline ASR recognition accuracy.

  • Added support for the Automatic Speech Recognition (ASR) Conformer-CTC model. The Conformer-CTC model is a nonauto-regressive variant of the Conformer model for ASR, which uses CTC loss/decoding instead of Transducer.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed an issue in TTS pipeline that can sometimes cause an audible ‘pop’ at the end of an utterance.

Known Issues#

  • The pretrained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.

  • The rate SSML attribute does not support x-low, low, medium, high, x-high, or default.

  • The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes, nor does it support x-low, low, medium, high, x-high, or default.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Release 1.7.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.7.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Added support for models trained by NVIDIA TAO Toolkit 21.11.

  • Riva Streaming TTS now supports resampling, if necessary, to match the requested audio sample rate.

  • Default Riva English ASR model updated with higher accuracy.

  • Minor improvements in English text normalization and inverse text normalization models.

  • Increased maximum message size to support larger audio inputs in offline ASR.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematurely truncated.

  • Fixed issue related to custom pronunciations being mishandled by text normalization for TTS.

Known Issues#

  • When running the nemo2riva package with EFF version 0.5.2, an ignored exception warning is printed. This should not affect functionality of the generated .riva models. This will be addressed in a future release of EFF.

  • During ASR pipeline execution, inverse text normalization will not convert digits into numerals (one->1) unless there are 10 digits. This limitation will be addressed in a future version of Riva.

  • The punctuation pipeline does not support Unicode character input. This will be fixed in the next release.

Riva Release 1.6.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.6.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • The Riva TTS service is no longer limited to 400 characters long input strings.

  • Updated the performance page of the documentation to include performance of Citrinet, FastPitch, and HiFi-GAN models

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematurely truncated.

Known Issues#

  • Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the NeMo > Riva > RMIR > TensorRT path. This will be addressed in a future release.

Riva Release 1.5.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.5.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Key Features and Enhancements#

This Riva release includes the following key features and enhancements.

  • Support for training n-gram language models for ASR has been added to TAO Toolkit. These language models are fully supported in Riva.

  • FastPitch now leverages Tensor Cores for improved inference performance.

  • nemo2riva now provides a warning when attempting to convert unsupported models.

  • Minor enhancements were made to cover additional cases in text normalization/inverse text normalization for English.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Fixed failure in Quick Start for some versions of the NGC client.

  • Fixed minor issues that could cause occasional artifacts or reduced quality in TTS generated audio.

  • Eliminated misleading error messages during riva-build process.

Announcements#

  • NVIDIA Transfer Learning Toolkit (TLT) has been renamed to NVIDIA TAO Toolkit starting in the 1.5.0-beta release.

Known Issues#

  • NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the NeMo > Riva > RMIR > TensorRT path. This will be addressed in a future release.

Riva Release 1.4.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.4.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Compatibility#

For the latest data center and embedded compatibility software and hardware versions, refer to the Support Matrix.

Fixed Issues#

  • Minor stability improvements were made to the ASR and TTS services.

  • Exposed the model_name parameter in the nlp_classify_tokens sample client.

  • Fixed an issue with the ASR language model hyperparameter tuning tool.

Announcements#

  • The Jarvis framework has been renamed to Riva starting in the 1.4.0-beta release. Jarvis Speech Skills has been renamed to Riva. Documentation, scripts, and commands have been updated accordingly.

    • The Jarvis API is supported but deprecated beginning with this release. It will be removed in a future release. Old Jarvis clients are expected to work as-is with this version of Riva, however, users will need to migrate to the Riva API after the Jarvis API is removed.

    • The Riva API modifies the following service names:

      • JarvisASR -> RivaSpeechRecognition

      • JarvisNLP -> RivaLanguageUnderstanding

      • JarvisCoreNLP -> RivaLanguageUnderstanding

      • JarvisTTS -> RivaSpeechSynthesis

    • jarvis-build and jarvis-deploy commands have been replaced with the equivalent riva-build and riva-deploy commands.

  • The riva-build command parameters for ASR pipelines have changed.

    • The --lm_decoder_cpu parameter is deprecated. Replace --lm_decoder_cpu.decoder_type=<decoder_type> with --decoder_type=<decoder_type> and replace --lm_decoder_cpu.<param_name>=<param_value> with --<decoder_type>_decoder.<param_name>=<param_value>. For example, instead of using --lm_decoder_cpu.decoder_type=greedy --lm_decoder_cpu.asr_model_delay=-1, use --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

    • The type of decoder to use must be explicitly set by using --decoder_type=<decoder_type> where <decoder_type> must be one of greedy, os2s, flashlight, or kaldi.

    Refer to ASR Pipeline Configuration for example riva-build commands to use with different acoustic models.

Riva Release 1.3.0 Beta#

This is a beta release. All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Note

Users upgrading to 1.3.0 Beta from previous versions must rerun jarvis-build for existing models. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.