Release Notes¶

Riva Speech Skills 1.4.0 Beta¶

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.4.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run riva_clean.sh followed by riva_init.sh.

Announcements¶

The Jarvis framework has been renamed to Riva starting in the 1.4.0-beta release. Jarvis Speech Skills has been renamed to Riva Speech Skills. Documentation, scripts, and commands have been updated accordingly.
- The Jarvis API is supported but deprecated beginning with this release. It will be removed in a future release. Old Jarvis clients are expected to work as-is with this version of Riva Speech Skills, however, users will need to migrate to the Riva API after the Jarvis API is removed.
- The Riva API modifies the following service names:
  - JarvisASR -> RivaSpeechRecognition
  - JarvisNLP -> RivaLanguageUnderstanding
  - JarvisCoreNLP -> RivaLanguageUnderstanding
  - JarvisTTS -> RivaSpeechSynthesis
- jarvis-build and jarvis-deploy commands have been replaced with the equivalent riva-build and riva-deploy commands.
The riva-build command parameters for ASR pipelines have changed.
- The --lm_decoder_cpu parameter is deprecated. Replace --lm_decoder_cpu.decoder_type=<decoder_type> with --decoder_type=<decoder_type> and replace --lm_decoder_cpu.<param_name>=<param_value> with --<decoder_type>_decoder.<param_name>=<param_value>. For example, instead of using --lm_decoder_cpu.decoder_type=greedy --lm_decoder_cpu.asr_model_delay=-1, use --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.
- The type of decoder to use must be explicitly set by using --decoder_type=<decoder_type> where <decoder_type> must be one of greedy, os2s, flashlight, or kaldi.
Refer to ASR Pipeline Configuration for example riva-build commands to use with different acoustic models.

Bug Fixes¶

Minor stability improvements were made to the ASR and TTS services.
Exposed the model_name parameter in the nlp_classify_tokens sample client.
Fixed an issue with the ASR language model hyperparameter tuning tool.

Jarvis Speech Skills 1.3.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.3.0 Beta from previous versions must rerun jarvis-build for existing models. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

New Features / Enhancements¶

Added support for FastPitch and HifiGan TTS models, improving both quality and inference speed over previous versions. This model architecture is now the default for Jarvis.
Added improved text normalization capabilities for text-to-speech service.
Introduced new nemo2jarvis tool to enable easier deployment of models trained with NVIDIA NeMo.
Added a new Virtual Assistant (with Google Dialogflow) sample.

Bug fixes¶

Fixed issue in Python ASR sample clients that could result in truncated intermediate transcripts.
Miscellaneous stability improvements for ServiceMaker.

Known Issues¶

In NLP question answering, sequence length of up to 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.
nemo2jarvis tool does not yet support all NLP models supported by Transfer Learning Toolkit + Jarvis. Currently supported models include: ASR models, FastPitch, HiFi-GAN, and BERT-based Question Answering only.

Jarvis Speech Skills 1.2.1 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.2.x Beta from previous versions (1.1.x or older) must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

Bug fixes¶

Fixed an issue related to pulling models from NGC during Quickstart and Helm initialization.

Jarvis Speech Skills 1.2.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.2.0 Beta from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

New Features / Enhancements¶

Added support for new CitriNet ASR acoustic models. This model architecture is now the default for Jarvis. New pretrained models are available with additional data compared to previous versions.
Added inverse text normalization for English by default to speech recognition output. This feature can be disabled at request-time by setting verbatim_transcripts = True in RecognitionConfig. Future releases will support customization of the normalization and support for additional languages.
Improved the speed of model jarvis-build deployment optimization step.
Added support for Megatron NLP models trained with TLT.

Bug fixes¶

Reduced host memory consumption during jarvis-build process for most model architectures.
Fixed a compilation issue that could cause crashes on some older x86 CPUs.
Fixed a potential crash for some NLP and TTS input sequences.

Known Issues¶

Host memory required to optimize the WaveGlow network jarvis-build is higher than in previous versions, and may fail on systems with limited system memory. Future versions of Jarvis will address this.
In NLP question answering, sequence length of upto 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.

Limitations¶

See Limitations.

Jarvis Speech Skills 1.1.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.1.0 Beta from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

New Features / Enhancements¶

Added spans in Named Entity Recognition (NER) results to indicate the start/end characters of the entity in the original passage.
Intermediate Automatic Speech Recognition (ASR) transcripts now return multiple partial transcripts. API users can choose to concatenate multiple partial transcripts to get lowest-latency results, or to filter based on the stability score to display only the portions of the transcript that are least likely to change.

Bug fixes¶

For APIs which support specifying a model by name, a return error message is given when an invalid model is requested.
Fixed a failure in jarvis-build for some text classification models trained with Transfer Learning Toolkit (TLT).
Fixed a crash in batched Natural Language Processing (NLP) APIs.
Fixed a crash in batched Natural Language Processing (NLP) APIs.

Limitations¶

ASR

To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.

NLP

Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named jarvis_punctuation and only supports English text.
The NER model must be named jarvis_ner.
The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.

TTS

Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling

Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

A single helm deployment assumes a single GPU type is used for the deployment.

Virtual Assistant sample

The provided samples are not complete chatbots, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.
The Jarvis NLP sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.
Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.
These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection is to stream audio to (TTS) and from (ASR); you are unable to sustain more than four concurrent socket connections.
The chatbot application is not optimized for low latency in the case of multiple concurrent users.
Some erratic issues have been observed with the chatbot samples on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.

Virtual Assistant (with Rasa)

The provided samples are not complete virtual assistants, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.
The Rasa virtual assistant sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.
Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.
Although the Rasa servers and the chatbot servers can be hosted on different machines, the provided code does not support independent scaling of the servers.
These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection to stream audio to (TTS) and from (ASR) the user is unable to sustain more than four concurrent socket connections.
The Rasa virtual assistant is not optimized for low latency in case of multiple concurrent users.
Some erratic issues have been observed with the Rasa sample on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Speech Skills 1.0.0-b3 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Note

Users upgrading to 1.0.0-b.3 from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run jarvis_clean.sh followed by jarvis_init.sh.

New Features / Enhancements¶

Intent/Slot NLP models have an optional “contextual” mode that can be enabled or disabled by the jarvis-build model configuration tool. See Joint Intent and Slots.
Updated to TensorRT version 7.2.2.3.

Bugfixes¶

Improved logging in servicemaker if invalid or missing encryption key is used.
Fixed issue with jarvis-build when specifying language models with relative path.
Fixed issue with Question Answering service that could return infinite confidence.
Removed erroneous “GPUs unavailable warning” during model download in Quickstart jarvis_init.sh.

Jarvis Speech Skills 1.0.0-b.2 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Key Features and Enhancements¶

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.

Bugfixes¶

Memory leak under high load resolved.
Issue with occasional mishandling of currencies in TTS fixed.
Clock change to prevent possibility of negative latencies reported by sample clients on certain systems
Improved TTS throughput

Compatibility¶

For the latest hardware and software compatibility support, refer to the Support Matrix.

Limitations¶

ASR

To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.

NLP

Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named jarvis_punctuation and only supports English text.
The NER model must be named jarvis_ner.
The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.
Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.

TTS

Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling

Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Chatbot application is not optimized for best latency in case of multiple concurrent users.
There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.
Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.
Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.
The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.

Jarvis Speech Skills 1.0.0-b.1 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Key Features and Enhancements¶

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.

Compatibility¶

For the latest hardware and software compatibility support, refer to the Support Matrix.

Known Issues¶

Observed a memory increase in the Jarvis ASR service over time when running at a continuously high load. To workaround this issue, restart the service to free memory or run within Kubernetes with failover mechanism. We plan to resolve the issue in next Jarvis release. For more details on the issue in gRPC, see: triton-inference-server/server#2517.

Limitations¶

ASR

To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.

NLP

Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named jarvis_punctuation and only supports English text.
The NER model must be named jarvis_ner.
The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.
Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.

TTS

Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling - Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Chatbot application is not optimized for best latency in case of multiple concurrent users.
There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.
Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.
Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.
The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.