Add Speech Capabilities to a Conversational AI Application (Latest Version)

Riva Virtual Assistant

Within the video below we highlight an example of a voice-enabled Riva Virtual Assistant (VA) application built using Riva Speech AI Services. This VA listens to speech from the user, transcribes the speech input using an Automatic Speech Recognition (ASR) model, analyzes the text using an Intent Recognition and Slot filling model, and finally, it computes a response and talks back to the user in a natural sounding voice using the Text-to-Speech (TTS) model. The application leverages Riva’s gRPC Python API bindings which are available as a Python wheel on NGC.

riva-intermediate-002.png

Steps in the pipeline include:

ASR Service

The user speaks into their microphone. The transcripts are generated using the Riva ASR streaming API. These transcripts are then sent to the text based chatbot to determine the appropriate response.

Text Based Chatbot

This is composed of a simple state machine dialog manager and an NLU component. The dialog manager is responsible for keeping track of what was spoken, and forming a response using a set of rules leading to different dialog paths. This is done by first sending the user utterances in the transcripts to the Riva NLU service to determine it’s intent, and associated entities. This information is used to determine the next dialog state, where the dialog manager either forms a response, or it may query a fulfillment service (like Weather API) to gather relevant information before doing so.

TTS Service

The response obtained from the state machine is then passed to the Riva TTS service, where it is converted to audio that is spoken to the user.

© Copyright 2022-2023, NVIDIA. Last updated on Feb 6, 2023.