Speech AI

NVIDIA ACE Agent supports Speech AI using the Chat Controller module. The Chat Controller module provides a speech interface to ACE Agent by exposing gRPC APIs. The Chat Controller uses NVIDIA Riva ASR and TTS services for audio to text transcription and text to speech synthesis.

There are several pre-configured pipelines which are available to use for speech AI. The functionality of speech AI pipeline varies depending upon the pipeline selected. You can define your pipeline in the deploy/docker/docker_init.sh file.

Detailed information regarding the gRPC APIs exposed by the Chat Controller can be found in the gRPC Interface API documentation.

The Chat Controller has six main components:

  • ASR module

  • TTS module

  • Speech Pipeline Manager

  • Chat Engine module

  • Logger module

  • UMIM module

ASR Module

The ASR module is responsible for calling the Riva ASR gRPC APIs for transcribing the audio received to the Chat Controller via the gRPC APIs. It supports certain features provided by Riva ASR like word boosting for customizations and profanity filtering.

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time; by giving them a higher score when decoding the output of the acoustic model.

Refer to the Riva ASR section for more information about word boosting and how to do word boosting in the Chat Controller.

TTS Module

The TTS module is responsible for calling the Riva TTS gRPC APIs for synthesizing the audio for the Chat Engine response and other TTS transcripts received by the Chat Controller via the gRPC APIs.

The TTS module supports IPA customization for improving pronunciation.

Refer to the Riva TTS section for more information about IPA customization and how to do customization in the Chat Controller.

NVIDIA Riva TTS and Chat Controller supports SSML tags for customizing the TTS pronunciation. Some of the SSML tags are:

  • prosody tag - supports attributes, like rate and pitch, emotion, volume through which we can control different parameters of the generated audio.

  • voice name tag - can be used to dynamically change voice names in the Chat Controller.

  • emotion tag - set in the TTS transcript, Chat Controller also sends these tags to Audio2Face.

Note

For more information about supported SSML tags with SSML texts, refer to the Riva TTS SSML documentation.

You can embed these tags as part of the textual response templates corresponding to the questions. For example:

"<speak><prosody emotion='fearful:1.0' pitch='1' volume='loud'>Hi,  I am Violet, a food ordering assistant bot. How can I help you?</prosody></speak>"