Release Notes¶
Riva Speech Skills 1.4.0 Beta¶
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.4.0 Beta from previous versions must rerun riva-build
for existing models.
Those using the Quick Start tool should run riva_clean.sh
followed by riva_init.sh
.
Announcements¶
The Jarvis framework has been renamed to Riva starting in the 1.4.0-beta release. Jarvis Speech Skills has been renamed to Riva Speech Skills. Documentation, scripts, and commands have been updated accordingly.
The Jarvis API is supported but deprecated beginning with this release. It will be removed in a future release. Old Jarvis clients are expected to work as-is with this version of Riva Speech Skills, however, users will need to migrate to the Riva API after the Jarvis API is removed.
The Riva API modifies the following service names:
JarvisASR -> RivaSpeechRecognition
JarvisNLP -> RivaLanguageUnderstanding
JarvisCoreNLP -> RivaLanguageUnderstanding
JarvisTTS -> RivaSpeechSynthesis
jarvis-build
andjarvis-deploy
commands have been replaced with the equivalentriva-build
andriva-deploy
commands.
The
riva-build
command parameters for ASR pipelines have changed.The
--lm_decoder_cpu
parameter is deprecated. Replace--lm_decoder_cpu.decoder_type=<decoder_type>
with--decoder_type=<decoder_type>
and replace--lm_decoder_cpu.<param_name>=<param_value>
with--<decoder_type>_decoder.<param_name>=<param_value>
. For example, instead of using--lm_decoder_cpu.decoder_type=greedy --lm_decoder_cpu.asr_model_delay=-1
, use--decoder_type=greedy --greedy_decoder.asr_model_delay=-1
.The type of decoder to use must be explicitly set by using
--decoder_type=<decoder_type>
where<decoder_type>
must be one ofgreedy
,os2s
,flashlight
, orkaldi
.
Refer to ASR Pipeline Configuration for example
riva-build
commands to use with different acoustic models.
Bug Fixes¶
Minor stability improvements were made to the ASR and TTS services.
Exposed the
model_name
parameter in thenlp_classify_tokens
sample client.Fixed an issue with the ASR language model hyperparameter tuning tool.
Jarvis Speech Skills 1.3.0 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.3.0 Beta from previous versions must rerun jarvis-build
for existing models.
Those using the Quick Start tool should run jarvis_clean.sh
followed by jarvis_init.sh
.
New Features / Enhancements¶
Added support for FastPitch and HifiGan TTS models, improving both quality and inference speed over previous versions. This model architecture is now the default for Jarvis.
Added improved text normalization capabilities for text-to-speech service.
Introduced new nemo2jarvis tool to enable easier deployment of models trained with NVIDIA NeMo.
Added a new Virtual Assistant (with Google Dialogflow) sample.
Bug fixes¶
Fixed issue in Python ASR sample clients that could result in truncated intermediate transcripts.
Miscellaneous stability improvements for ServiceMaker.
Known Issues¶
In NLP question answering, sequence length of up to 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.
nemo2jarvis
tool does not yet support all NLP models supported by Transfer Learning Toolkit + Jarvis. Currently supported models include: ASR models, FastPitch, HiFi-GAN, and BERT-based Question Answering only.
Jarvis Speech Skills 1.2.1 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.2.x Beta from previous versions (1.1.x or older) must rerun jarvis-build
for existing models due to an updated version of TensorRT.
Those using the Quick Start tool should run jarvis_clean.sh
followed by jarvis_init.sh
.
Bug fixes¶
Fixed an issue related to pulling models from NGC during Quickstart and Helm initialization.
Jarvis Speech Skills 1.2.0 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.2.0 Beta from previous versions must rerun jarvis-build
for existing models due to an updated version of TensorRT.
Those using the Quick Start tool should run jarvis_clean.sh
followed by jarvis_init.sh
.
New Features / Enhancements¶
Added support for new CitriNet ASR acoustic models. This model architecture is now the default for Jarvis. New pretrained models are available with additional data compared to previous versions.
Added inverse text normalization for English by default to speech recognition output. This feature can be disabled at request-time by setting
verbatim_transcripts = True
inRecognitionConfig
. Future releases will support customization of the normalization and support for additional languages.Improved the speed of model
jarvis-build
deployment optimization step.Added support for Megatron NLP models trained with TLT.
Bug fixes¶
Reduced host memory consumption during
jarvis-build
process for most model architectures.Fixed a compilation issue that could cause crashes on some older x86 CPUs.
Fixed a potential crash for some NLP and TTS input sequences.
Known Issues¶
Host memory required to optimize the WaveGlow network
jarvis-build
is higher than in previous versions, and may fail on systems with limited system memory. Future versions of Jarvis will address this.In NLP question answering, sequence length of upto 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.
Limitations¶
See Limitations.
Jarvis Speech Skills 1.1.0 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.1.0 Beta from previous versions must rerun jarvis-build
for existing models due to an updated version of TensorRT.
Those using the Quick Start tool should run jarvis_clean.sh
followed by jarvis_init.sh
.
New Features / Enhancements¶
Added spans in Named Entity Recognition (NER) results to indicate the start/end characters of the entity in the original passage.
Intermediate Automatic Speech Recognition (ASR) transcripts now return multiple partial transcripts. API users can choose to concatenate multiple partial transcripts to get lowest-latency results, or to filter based on the
stability
score to display only the portions of the transcript that are least likely to change.
Bug fixes¶
For APIs which support specifying a model by name, a return error message is given when an invalid model is requested.
Fixed a failure in
jarvis-build
for some text classification models trained with Transfer Learning Toolkit (TLT).Fixed a crash in batched Natural Language Processing (NLP) APIs.
Fixed a crash in batched Natural Language Processing (NLP) APIs.
Limitations¶
ASR
To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.
NLP
Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named
jarvis_punctuation
and only supports English text.The NER model must be named
jarvis_ner
.The Intent/Intent Domain models must be named
jarvis_intent_<intent_domain>
andjarvis_seqclass_domain
, respectively.
TTS
Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.
Scaling
Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.
Kubernetes Deployment
A single helm deployment assumes a single GPU type is used for the deployment.
Virtual Assistant sample
The provided samples are not complete chatbots, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.
The Jarvis NLP sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.
Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.
These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection is to stream audio to (TTS) and from (ASR); you are unable to sustain more than four concurrent socket connections.
The chatbot application is not optimized for low latency in the case of multiple concurrent users.
Some erratic issues have been observed with the chatbot samples on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.
Virtual Assistant (with Rasa)
The provided samples are not complete virtual assistants, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.
The Rasa virtual assistant sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.
Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.
Although the Rasa servers and the chatbot servers can be hosted on different machines, the provided code does not support independent scaling of the servers.
These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection to stream audio to (TTS) and from (ASR) the user is unable to sustain more than four concurrent socket connections.
The Rasa virtual assistant is not optimized for low latency in case of multiple concurrent users.
Some erratic issues have been observed with the Rasa sample on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.
Jarvis Speech Skills 1.0.0-b3 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Note
Users upgrading to 1.0.0-b.3 from previous versions must rerun jarvis-build
for existing models due to an updated version of TensorRT.
Those using the Quick Start tool should run jarvis_clean.sh
followed by jarvis_init.sh
.
New Features / Enhancements¶
Intent/Slot NLP models have an optional “contextual” mode that can be enabled or disabled by the
jarvis-build
model configuration tool. See Joint Intent and Slots.Updated to TensorRT version 7.2.2.3.
Bugfixes¶
Improved logging in servicemaker if invalid or missing encryption key is used.
Fixed issue with
jarvis-build
when specifying language models with relative path.Fixed issue with Question Answering service that could return infinite confidence.
Removed erroneous “GPUs unavailable warning” during model download in Quickstart
jarvis_init.sh
.
Jarvis Speech Skills 1.0.0-b.2 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Key Features and Enhancements¶
Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.
Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.
Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.
Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.
Bugfixes¶
Memory leak under high load resolved.
Issue with occasional mishandling of currencies in TTS fixed.
Clock change to prevent possibility of negative latencies reported by sample clients on certain systems
Improved TTS throughput
Compatibility¶
For the latest hardware and software compatibility support, refer to the Support Matrix.
Limitations¶
ASR
To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.
NLP
Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named
jarvis_punctuation
and only supports English text.The NER model must be named
jarvis_ner
.The Intent/Intent Domain models must be named
jarvis_intent_<intent_domain>
andjarvis_seqclass_domain
, respectively.Non-Core API methods are limited to
batch=1
requests. Forbatch>1
, use Core NLP methods.
TTS
Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.
Scaling
Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.
Kubernetes Deployment
A single helm deployment assumes a single GPU type is used for the deployment.
Jarvis Samples - Jarvis Virtual Assistant
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Chatbot application is not optimized for best latency in case of multiple concurrent users.
There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.
Jarvis Samples - Jarvis Virtual Assistant (Rasa)
Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.
Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.
Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.
The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
Jarvis Samples - SpeechSquad
The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.
Jarvis Speech Skills 1.0.0-b.1 Beta
This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.
Key Features and Enhancements¶
Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.
Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.
Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.
Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.
Compatibility¶
For the latest hardware and software compatibility support, refer to the Support Matrix.
Known Issues¶
Observed a memory increase in the Jarvis ASR service over time when running at a continuously high load. To workaround this issue, restart the service to free memory or run within Kubernetes with failover mechanism. We plan to resolve the issue in next Jarvis release. For more details on the issue in gRPC, see: triton-inference-server/server#2517.
Limitations¶
ASR
To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.
Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.
NLP
Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.
The Punctuation model must be named
jarvis_punctuation
and only supports English text.The NER model must be named
jarvis_ner
.The Intent/Intent Domain models must be named
jarvis_intent_<intent_domain>
andjarvis_seqclass_domain
, respectively.Non-Core API methods are limited to
batch=1
requests. Forbatch>1
, use Core NLP methods.
TTS
Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.
Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.
Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.
Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.
We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.
Scaling - Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.
Kubernetes Deployment
A single helm deployment assumes a single GPU type is used for the deployment.
Jarvis Samples - Jarvis Virtual Assistant
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Chatbot application is not optimized for best latency in case of multiple concurrent users.
There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.
Jarvis Samples - Jarvis Virtual Assistant (Rasa)
Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.
Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.
Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.
Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.
The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.
The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.
Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.
Jarvis Samples - SpeechSquad
The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.