Add Speech Capabilities to a Conversational AI Application (Latest Version)

Step #1: Hands On Lab

Within this lab, you will dive deeper into the weather VA sample, specifically the sections of the code pertaining to Riva Streaming ASR, and Riva TTS calls. There are a few lines of code missing where the actual Riva ASR/TTS service calls are made, and as an exercise you are required to fill-in those missing pieces to complete the application. The solution to the exercises has been provided towards the end of the guide.

Note

The Riva lab will use two important links from the left-hand navigation pane throughout the course of the lab.

riva-intermediate-003.png

Please use Chrome or Firefox when trying the weather VA sample in this lab. The Web Audio APIs used to handle audio in the application works best with these browsers.

For this lab, the Riva server has been set up for you.

  1. Open the VM Console by selecting VM Console on the left-hand navigation page.

  2. Go to the “lab-2” directory which contains the code for virtual assistant. We will also activate the conda environment which contains all the dependencies required for this application.

    Copy
    Copied!
                

    cd riva-launchpad/lab-2 source lab2/bin/activate cd virtual-assistant

  3. Once inside the directory, you will find the exercise code in the current working directory. Edit the configuration file config.py, and set the weatherstack API key. The VA uses weatherstack for weather fulfillment, that is when the weather intents are recognized, real-time weather information is fetched from weatherstack.

    • Open the configuration file

    Copy
    Copied!
                

    vim config.py

    • Sign up to the free tier of weatherstack, and get your API access key. Copy your access key in config.py. The code snippet will look like the example below. Save and close the file once done.

    Copy
    Copied!
                

    riva_config = { ... "WEATHERSTACK_ACCESS_KEY": "<API_ACCESS_KEY>", # Get your access key at - https://weatherstack.com/ ... }

The VA transcribes user utterances using Riva’s streaming recognition API. The proto defining the services and messages pertaining to ASR is present here.

Note

An example of using the Riva streaming ASR API can also be found in the Riva python clients repository.

Let’s go over some of the salient bits in using Riva’s Streaming ASR service.

The first input audio_chunks to the streaming_response_generator method accepts an iterable of audio chunks which yields the byte-sequences of audio content that would be sent to the Riva speech server .The second input streaming_config to the streaming_response_generator method contains the configuration that provides information on how to process the request. The subsequent messages sent in the stream must contain only raw bytes of the audio data to be recognized.

Copy
Copied!
            

# Boilerplate import riva.client auth = riva.client.Auth(uri='localhost:50051') # Channel to Riva Server riva_asr = riva.client.ASRService(auth) # Configuration config = riva.client.RecognitionConfig() config.sample_rate_hertz = 16000 config.language_code = "en-US" config.max_alternatives = 1 config.enable_automatic_punctuation = True enable_word_time_offsets = True config.verbatim_transcripts = False # Provides information to the recognizer that specifies how to process the request streaming_config = riva.client.StreamingRecognitionConfig(config=config, interim_results=True) # read data

For a given stream, sequential chunks of audio data are sent in sequential requests. We can leverage a Python generator function to compose the input for streaming_response_generator, that yields the chunks of audio data.

Copy
Copied!
            

"""Generates byte-sequences of audio chunks from the audio buffer""" def build_request_generator(self): while not self.closed: # Use a blocking get() to ensure there's at least one chunk of # data, and stop iteration if the chunk is None, indicating the # end of the audio stream. chunk = self._buff.get() if chunk is None: return data = [chunk] # Now consume whatever other data's still buffered. while True: try: chunk = self._buff.get(block=False) if chunk is None: return data.append(chunk) except queue.Empty: break yield b''.join(data) # Next, we call the Riva streaming_response_generator method to generate transcripts responses = riva_asr.streaming_response_generator(self.build_request_generator(), streaming_config)

This returns a list of StreamingRecognitionResult objects, which include the different alternatives (recognition hypotheses), as well as a is_final boolean that indicates whether the result is interim or final.

The VA implements ASR support through its ASRPipe class in virtual-assistant/riva_local/asr/asr.py.

ASRPipe contains methods to communicate with RIVA ASR service using the StreamingRecognize API, as well as the interface to operate buffers for the audio input, and transcription output. In tune to the API above, the ASRPipe.main_asr() method sends requests with a generator for the stream of inputs in the audio buffer, queries the RIVA ASR service, and calls another method ASRPipe.listen_print_loop() to iterate over the output stream of responses.

Exercise 1 - Take a look at the ASRPipe.main_asr() method, and fill-in the missing line of the code that calls the Riva ASR service.

After any dialog state transitions, the response text generated by the virtual assistant is synthesized into audio using Riva’s SynthesizeOnline API. The input SynthesizeSpeechRequest contains the desired text and the configuration, and this API returns the stream of audio bytes in the requested format as it becomes available.

Copy
Copied!
            

# Boilerplate import riva.client auth = riva.client.Auth(uri='localhost:50051') # Channel to Riva Server riva_tts = riva.client.SpeechSynthesisService(auth) # Query the Riva TTS service with request related arguments responses = riva_tts.synthesize_online( text = "Hello", language_code = "en-US", encoding = riva.client.AudioEncoding.LINEAR_PCM, sample_rate_hz = 22050, voice_name = "ljspeech" ) # This returns an iterable (`response`) object that contains the stream of results. for resp in responses: audio_bytes = resp.audio // .. parse audio

The VA implements Speech Synthesis through its TTSPipe class in virtual-assistant/riva_local/tts/tts_stream.py.``TTSPipe`` contains methods to communicate with RIVA TTS service using the SynthesizeOnline API, as well as the interface to access buffers for the desired text, and the speech output.

The TTSPipe.get_speech() method composes a request, queries the RIVA TTS service, and then loops over the response iterable to process and yield segments of audio bytes.

Exercise 2 - Take a look at the TTSPipe.get_speech() method, and fill-in the missing line of the code that calls the Riva TTS service.

Note

If you’re curious about the offline Speech Synthesis API, you can find that in the virtual-assistant/riva_local/tts/tts.py file. The imports in virtual-assistant/riva_local/chatbot/chatbot.py determine which mode gets used.

Once the exercises are completed, start the VA by running the command below.

Copy
Copied!
            

python3 main.py

Open the Client Application by clicking the link on the left-hand navigation pane. This will start another tab in your browser with the VA application.

Below are a example questions to ask the VA.

  • What is the weather in San Francisco?

  • What is the temperature in Chicago on Friday?

  • How humid is it right now?

Additional sample questions can be found in Riva docs here.

The missing code-snippets for the exercises in the previous section can be found here. Click on the widgets below to reveal them:

Exercise 1

Copy
Copied!
            

responses = self.riva_asr.streaming_response_generator(audio_chunks=self.request_generator, streaming_config=streaming_config)

Exercise 2

Copy
Copied!
            

responses = self.riva_tts.synthesize_online( text = text, language_code = self.language_code, encoding = riva.client.AudioEncoding.LINEAR_PCM, sample_rate_hz = self.sample_rate, voice_name = self.voice_name )

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.