Release Notes

Jarvis Speech Skills 1.0.0-b.2 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Key Features and Enhancements

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.

Bugfixes

  • Memory leak under high load resolved.

  • Issue with occasional mishandling of currencies in TTS fixed.

  • Clock change to prevent possibility of negative latencies reported by sample clients on certain systems

  • Improved TTS throughput

Compatibility

For the latest hardware and software compatibility support, refer to the Support Matrix.

Limitations

ASR

  • To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.

  • Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.

NLP

  • Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.

  • The Punctuation model must be named jarvis_punctuation and only supports English text.

  • The NER model must be named jarvis_ner.

  • The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.

  • Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.

TTS

  • Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.

  • Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.

  • Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.

  • Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.

  • We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling - Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

  • A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Chatbot application is not optimized for best latency in case of multiple concurrent users.

  • There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

  • The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

  • Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.

  • Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.

  • Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

  • The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.

Jarvis Speech Skills 1.0.0-b.1 Beta

This is a beta release. All published functionality in the release notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our developer forum.

Key Features and Enhancements

Ease of use Jarvis addresses the needs of multiple types of users, from Data Scientists, Deep Learning Researchers, Software Developers to System Administrators. It consists of a collection of modular, predefined functions, packaged as individual microservices. These AI services cover speech recognition, speech synthesis, and different aspects of natural language understanding.

Flexibility Jarvis is built to be modular. Depending on the desired application, a user can easily deploy the AI services as separate modules or chain them together into complex pipelines. Jarvis also addresses the challenges of inference and deployment by leveraging NVIDIA’s scalable microservices framework, Triton Inference Server.

Customizable Jarvis is built to facilitate easily creating new models or fine-tuning the provided models with end-user owned data via the NeMo toolkit or TLT. NeMo provides pre-trained models for ASR, NLP, and TTS, but also allows users to train custom models using the provided starting points.

Performance Jarvis incorporates cutting edge latency and throughput as well as accuracy via the NLP, ASR, and TTS services to meet the needs of users to create high-performance AI services. Jarvis leverages NVIDIA’s optimized inference toolkit called TensorRT, to ensure the highest possible performance of all models.

Compatibility

For the latest hardware and software compatibility support, refer to the Support Matrix.

Known Issues

Observed memory increase in Jarvis ASR service over time when running at a continuously high load. Suggested work around process: Restart service to free memory or run within Kubernetes with failover mechanism. Plan to resolve the issue in next Jarvis release. For more details on the issue in gRPC, please reference: triton-inference-server/server#2517.

Limitations

ASR

  • To utilize multiple GPUs for scale-out inference, run one instance of Triton Inference Server and Jarvis Speech AI Server per GPU. Using one instance of Triton Inference Server and Jarvis Speech AI server with multiple GPUs is currently not supported. This will be fixed in a future release.

  • Jarvis ASR pipelines can produce different transcripts even if they are using the same acoustic model. For example, a Jarvis ASR pipeline that uses a large chunk size for offline recognition can produce different transcripts than a pipeline which uses a smaller chunk size for streaming recognition, even if they use the same acoustic model.

NLP

  • Only fixed model names for Punctuation, Named Entity Recognition (NER), and Intent models are supported. Future releases will leverage the model registration subsystem and support multiple versions/variants of each model.

  • The Punctuation model must be named jarvis_punctuation and only supports English text.

  • The NER model must be named jarvis_ner.

  • The Intent/Intent Domain models must be named jarvis_intent_<intent_domain> and jarvis_seqclass_domain, respectively.

  • Non-Core API methods are limited to batch=1 requests. For batch>1, use Core NLP methods.

TTS

  • Requests to the TTS service must be less than 400 characters in length. This limitation will be addressed in a future release.

  • Resampling of streaming TTS is currently unsupported. Requests to the streaming TTS service must be for 22050 hz audio. This limitation will be addressed in a future release.

  • Only a single voice and the English language is supported. However, for further customization, the user can train Tacotron 2 with their own data, if available, with NeMo 1.0.0b4 or later.

  • Only pulse-code modulation (PCM) encoding is supported, however, this will be selectable by the user in future releases.

  • We do not recommend making more than 8-10 simultaneous requests with the models provided in this release as it can effect the performance on NVIDIA T4.

Scaling - Passing the Jarvis Speech container to more than 1 GPU may result in undefined behavior. We currently recommend scaling by running one instance of Jarvis per GPU.

Kubernetes Deployment

  • A single helm deployment assumes a single GPU type is used for the deployment.

Jarvis Samples - Jarvis Virtual Assistant

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Chatbot application is not optimized for best latency in case of multiple concurrent users.

  • There are some erratic issues with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

  • The Jarvis NLU pipeline and Rasa DM is not optimized for best accuracy.

Jarvis Samples - Jarvis Virtual Assistant (Rasa)

  • Rasa NLU currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy, wind speed and snowfall check. Jarvis NLP currently supports intents for weather, temperature, rain check, humidity, sunny, cloudy and snowfall check.

  • Both Jarvis NLP and Rasa NLU currently support only 1 slot for city. The day associated with the query is not taken into account when processing.

  • Although the Rasa servers and the Chatbot client servers can be hosted on different machines, they do not currently support independent scaling of the servers.

  • Up to 4 concurrent users are supported. This restriction of 4 concurrent users is not because of Jarvis, but because of the web framework (Flask and Flask-ScoketIO) that is used. The socket connection used to stream audio to (TTS) and from (ASR) the user does not sustain more than 4 concurrent socket connections.

  • The Rasa or the Chatbot application has not been optimized for best latency in case of multiple concurrent users.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best accuracy; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • The Rasa NLU pipeline and Rasa DM is not optimized for best inference time; the primary objective of this demo is to showcase the integration of Jarvis with Rasa and not to build a production-ready pipeline.

  • Erratic issues have been observed with the Chatbot on the Firefox browser, most common of which is the TTS output being taken in as input by ASR for certain microphone gain values.

Jarvis Samples - SpeechSquad

  • The current version of the Jarvis server does not report server latency measurements to the SpeechSquad server. Hence, when executing the SpeechSquad client, the tracing.server_latency measurements will not be reported.