Release Notes

Riva Speech Skills 1.9.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.9.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Improved customization for Automatic Speech Recognition (ASR) Spanish (es-US) and German (de-DE) language models.

  • The rate SSML attribute supports x-low, low, medium, high, x-high, and default.

  • The pitch SSML attribute supports x-low, low, medium, high, x-high, and default.

Known Issues

  • The pre-trained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.

  • The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Speech Skills 1.8.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.8.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Released new pretrained models for German (de-DE), Russian (ru-RU), and Spanish (es-US) speech recognition.

  • Increased recognition accuracy of English (en-US) speech recognition models.

  • Introduced partial support for Speech Synthesis Markup Language (SSML) within the TTS API. Support has been added for pitch and rate attributes of the <prosody> tag to control pitch and duration of synthesized speech. Additional SSML support is planned for future releases.

  • Added word boosting support to the Speech Recognition API to bias ASR engine to recognize particular words of interest at request time. This release is limited to boosting of in-vocabulary words; out-of-vocabulary word boosting will be available in an upcoming release.

  • Minor ASR inference speed improvements in online mode.

  • Improved offline ASR recognition accuracy.

  • Added support for the Automatic Speech Recognition (ASR) Conformer-CTC model. The Conformer-CTC model is a non-autoregressive variant of the Conformer model for ASR which uses CTC loss/decoding instead of Transducer.

Bug fixes

  • Fixed an issue in TTS pipeline that can sometimes cause an audible ‘pop’ at the end of an utterance.

Known Issues

  • The pre-trained model used to add punctuation and capitalization to ASR transcripts supports a maximum input length of 128 tokens. Currently, if an ASR transcript containing more than 128 tokens is passed to the punctuation and capitalization model, it will be truncated to 128 tokens. This will be addressed in a future release of Riva.

  • The rate SSML attribute does not support x-low, low, medium, high, x-high, or default.

  • The pitch SSML attribute is not currently in compliance with the SSML specs, and does not support Hz, st, % changes, nor does it support x-low, low, medium, high, x-high, or default.

  • When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Riva Speech Skills 1.7.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.7.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Added support for models trained by NVIDIA TAO Toolkit 21.11.

  • Riva Streaming TTS now supports resampling, if necessary, to match the requested audio sample rate.

  • Default Riva English ASR model updated with higher accuracy.

  • Minor improvements in English text normalization and inverse text normalization models.

  • Increased maximum message size to support larger audio inputs in offline ASR.

Bug fixes

  • Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematuraly truncated.

  • Fixed issue related to custom pronunciations being mishandled by text normalization for TTS.

Known Issues

  • When running the nemo2riva package with EFF version 0.5.2, an ignored exception warning is printed. This should not affect functionality of the generated .riva models. This will be addressed in a future release of EFF.

  • During ASR pipeline execution inverse text normalization will not convert digits into numerals (one->1) unless there are ten digits. This limitation will be addressed in a future version of Riva.

  • The punctuation pipeline does not support unicode character input. This will be fixed in the next release.

Riva Speech Skills 1.6.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.6.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • The Riva TTS service is no longer limited to 400 characters long input strings.

  • Updated the performance page of the documentation to include performance of Citrinet and Fastpitch+Hifigan models

Bug fixes

  • Fixed minor issues that could cause the synthesized audio generated by the TTS service to be prematuraly truncated.

Known Issues

  • Riva build does not support providing a 1-gram language model in .arpa format. This is due to a limitation in the KenLM utility to build language model binaries.

  • NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the Nemo -> Riva -> RMIR -> TensorRT path. This will be addressed in a future release.

Riva Speech Skills 1.5.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.5.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by


  • NVIDIA Transfer Learning Toolkit (TLT) has been renamed to NVIDIA TAO Toolkit starting in the 1.5.0-beta release.

New Features / Enhancements

  • Support for training n-gram language models for ASR has been added to TAO Toolkit. These language models are fully supported in Riva.

  • FastPitch now leverages Tensor Cores for improved inference performance.

  • nemo2riva now provides a warning when attempting to convert unsupported models.

  • Minor enhancements were made to cover additional cases in text normalization/inverse text normalization for English.

Bug fixes

  • Fixed failure in Quickstart for some versions of the NGC client.

  • Fixed minor issues that could cause occasional artifacts or reduced quality in TTS generated audio.

  • Eliminated misleading error messages during riva-build process.

Known Issues

  • NLP Question Answering functionality may cause a segmentation fault when using TensorRT files generated from the Nemo -> Riva -> RMIR -> TensorRT path. This will be addressed in a future release.

Riva Speech Skills 1.4.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.4.0 Beta from previous versions must rerun riva-build for existing models. Those using the Quick Start tool should run followed by


  • The Jarvis framework has been renamed to Riva starting in the 1.4.0-beta release. Jarvis Speech Skills has been renamed to Riva Speech Skills. Documentation, scripts, and commands have been updated accordingly.

    • The Jarvis API is supported but deprecated beginning with this release. It will be removed in a future release. Old Jarvis clients are expected to work as-is with this version of Riva Speech Skills, however, users will need to migrate to the Riva API after the Jarvis API is removed.

    • The Riva API modifies the following service names:

      • JarvisASR -> RivaSpeechRecognition

      • JarvisNLP -> RivaLanguageUnderstanding

      • JarvisCoreNLP -> RivaLanguageUnderstanding

      • JarvisTTS -> RivaSpeechSynthesis

    • jarvis-build and jarvis-deploy commands have been replaced with the equivalent riva-build and riva-deploy commands.

  • The riva-build command parameters for ASR pipelines have changed.

    • The --lm_decoder_cpu parameter is deprecated. Replace --lm_decoder_cpu.decoder_type=<decoder_type> with --decoder_type=<decoder_type> and replace --lm_decoder_cpu.<param_name>=<param_value> with --<decoder_type>_decoder.<param_name>=<param_value>. For example, instead of using --lm_decoder_cpu.decoder_type=greedy --lm_decoder_cpu.asr_model_delay=-1, use --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

    • The type of decoder to use must be explicitly set by using --decoder_type=<decoder_type> where <decoder_type> must be one of greedy, os2s, flashlight, or kaldi.

    Refer to ASR Pipeline Configuration for example riva-build commands to use with different acoustic models.

Bug Fixes

  • Minor stability improvements were made to the ASR and TTS services.

  • Exposed the model_name parameter in the nlp_classify_tokens sample client.

  • Fixed an issue with the ASR language model hyperparameter tuning tool.

Jarvis Speech Skills 1.3.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.3.0 Beta from previous versions must rerun jarvis-build for existing models. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Added support for FastPitch and HifiGan TTS models, improving both quality and inference speed over previous versions. This model architecture is now the default for Jarvis.

  • Added improved text normalization capabilities for text-to-speech service.

  • Introduced new nemo2jarvis tool to enable easier deployment of models trained with NVIDIA NeMo.

  • Added a new Virtual Assistant (with Google Dialogflow) sample.

Bug fixes

  • Fixed issue in Python ASR sample clients that could result in truncated intermediate transcripts.

  • Miscellaneous stability improvements for ServiceMaker.

Known Issues

  • In NLP question answering, sequence length of up to 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.

  • nemo2jarvis tool does not yet support all NLP models supported by Transfer Learning Toolkit + Jarvis. Currently supported models include: ASR models, FastPitch, HiFi-GAN, and BERT-based Question Answering only.

Jarvis Speech Skills 1.2.1 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.2.x Beta from previous versions (1.1.x or older) must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run followed by

Bug fixes

  • Fixed an issue related to pulling models from NGC during Quickstart and Helm initialization.

Jarvis Speech Skills 1.2.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.2.0 Beta from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Added support for new CitriNet ASR acoustic models. This model architecture is now the default for Jarvis. New pretrained models are available with additional data compared to previous versions.

  • Added inverse text normalization for English by default to speech recognition output. This feature can be disabled at request-time by setting verbatim_transcripts = True in RecognitionConfig. Future releases will support customization of the normalization and support for additional languages.

  • Improved the speed of model jarvis-build deployment optimization step.

  • Added support for Megatron NLP models trained with TLT.

Bug fixes

  • Reduced host memory consumption during jarvis-build process for most model architectures.

  • Fixed a compilation issue that could cause crashes on some older x86 CPUs.

  • Fixed a potential crash for some NLP and TTS input sequences.

Known Issues

  • Host memory required to optimize the WaveGlow network jarvis-build is higher than in previous versions, and may fail on systems with limited system memory. Future versions of Jarvis will address this.

  • In NLP question answering, sequence length of upto 512 tokens are supported (The default models come with 384 seq length). If the context is larger than the limit then only the last part of the context will be used to find the correct answer. Future versions of Jarvis will address this issue.


See Limitations.

Jarvis Speech Skills 1.1.0 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.1.0 Beta from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Added spans in Named Entity Recognition (NER) results to indicate the start/end characters of the entity in the original passage.

  • Intermediate Automatic Speech Recognition (ASR) transcripts now return multiple partial transcripts. API users can choose to concatenate multiple partial transcripts to get lowest-latency results, or to filter based on the stability score to display only the portions of the transcript that are least likely to change.

Bug fixes

  • For APIs which support specifying a model by name, a return error message is given when an invalid model is requested.

  • Fixed a failure in jarvis-build for some text classification models trained with Transfer Learning Toolkit (TLT).

  • Fixed a crash in batched Natural Language Processing (NLP) APIs.

  • Fixed a crash in batched Natural Language Processing (NLP) APIs.



  • To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.

  • Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.


  • Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.

  • The Punctuation model must be named jarvis_punctuation and only supports English text.

  • The NER model must be named jarvis_ner.

  • The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.


  • Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.

  • Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.

  • Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.

  • Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.

  • We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.


  • Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

  • A single helm deployment assumes a single GPU type is used for the deployment.

Virtual Assistant sample

  • The provided samples are not complete chatbots, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.

  • The Jarvis NLP sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.

  • Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.

  • These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection is to stream audio to (TTS) and from (ASR); you are unable to sustain more than four concurrent socket connections.

  • The chatbot application is not optimized for low latency in the case of multiple concurrent users.

  • Some erratic issues have been observed with the chatbot samples on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.

Virtual Assistant (with Rasa)

  • The provided samples are not complete virtual assistants, but are intended as simple examples of how to build basic task-oriented chatbots with Jarvis. Consequently, the intent classifier and slot filling models have been trained with small amounts of data and are not expected to be highly accurate.

  • The Rasa virtual assistant sample supports intents for weather, temperature, rain, humidity, sunny, cloudy and snowfall checks. It does not support general conversational queries or other domains.

  • Both the Jarvis NLP and Rasa NLU samples support only 1 slot for city. Neither takes into account the day associated with the query.

  • Although the Rasa servers and the chatbot servers can be hosted on different machines, the provided code does not support independent scaling of the servers.

  • These samples support up to four concurrent users. This restriction is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is being used. The socket connection to stream audio to (TTS) and from (ASR) the user is unable to sustain more than four concurrent socket connections.

  • The Rasa virtual assistant is not optimized for low latency in case of multiple concurrent users.

  • Some erratic issues have been observed with the Rasa sample on the Firefox browser. The most common issue is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Speech Skills 1.0.0-b3 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.


Users upgrading to 1.0.0-b.3 from previous versions must rerun jarvis-build for existing models due to an updated version of TensorRT. Those using the Quick Start tool should run followed by

New Features / Enhancements

  • Intent/Slot NLP models have an optional “contextual” mode that can be enabled or disabled by the jarvis-build model configuration tool. See Joint Intent and Slots.

  • Updated to TensorRT version


  • Improved logging in servicemaker if invalid or missing encryption key is used.

  • Fixed issue with jarvis-build when specifying language models with relative path.

  • Fixed issue with Question Answering service that could return infinite confidence.

  • Removed erroneous “GPUs unavailable warning” during model download in Quickstart

Jarvis Speech Skills 1.0.0-b.2 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Key Features and Enhancements

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.


  • Memory leak under high load resolved.

  • Issue with occasional mishandling of currencies in TTS fixed.

  • Clock change to prevent possibility of negative latencies reported by sample clients on certain systems

  • Improved TTS throughput


For the latest hardware and software compatibility support, refer to the Support Matrix.



  • To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.

  • Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.


  • Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.

  • The Punctuation model must be named jarvis_punctuation and only supports English text.

  • The NER model must be named jarvis_ner.

  • The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.

  • Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.


  • Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.

  • Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.

  • Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.

  • Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.

  • We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.


  • Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

  • A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Chatbot application is not optimized for best latency in case of multiple concurrent users.

  • There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

  • The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

  • Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.

  • Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.

  • Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

  • The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.

Jarvis Speech Skills 1.0.0-b.1 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Riva Developer Forum.

Key Features and Enhancements

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.


For the latest hardware and software compatibility support, refer to the Support Matrix.

Known Issues

  • Observed a memory increase in the Jarvis ASR service over time when running at a continuously high load. To workaround this issue, restart the service to free memory or run within Kubernetes with failover mechanism. We plan to resolve the issue in next Jarvis release. For more details on the issue in gRPC, see: triton-inference-server/server#2517.



  • To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.

  • Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.


  • Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.

  • The Punctuation model must be named jarvis_punctuation and only supports English text.

  • The NER model must be named jarvis_ner.

  • The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.

  • Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.


  • Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.

  • Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.

  • Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.

  • Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.

  • We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling - Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

  • A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Chatbot application is not optimized for best latency in case of multiple concurrent users.

  • There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

  • The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

  • Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.

  • Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.

  • Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

  • The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.