Bias ASR at request time in Python

This notebook walks through some of the advanced features of Riva Speech Skills’s ASR Services.

Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services such as named entity recognition (NER), punctuation, intent classification.

In this notebook, we will focus on the more advances features of the Automated speech recognition (ASR) APIs.
To understand the basics of Riva ASR APIs, please refer to Getting started with Riva ASR in Python.

For more detailed information on Riva, please refer to the Riva developer documentation.

Requirements and setup

To execute this notebook, please follow the setup steps in README.

Import Riva clent libraries

We first import some required libraries, including the Riva client libraries

import io
import librosa
import IPython.display as ipd
import grpc

import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra

Create Riva clients and connect to Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

channel = grpc.insecure_channel('localhost:50051')

riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)

Word Boosting

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model.

ASR inference without Word Boosting

First let us run ASR on our sample audio clip without word boosting

# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "../_static/data/asr/en-US_wordboosting_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)
# Creating RecognitionConfig
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=sr,
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
    audio_channel_count = 1
)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript without Word Boosting:", asr_best_transcript)
ASR Transcript without Word Boosting: Anti, Berta and Aber, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 

As you can see in the above transcription, ASR is having a hard time recognizing domain specific terms like AntiBERTa and ABlooper.
Now let us use word boosting to try to improve ASR for these domain specific terms.

ASR inference with Word Boosting

Let us look at how we add the boosted words to RecognitionConfig, with SpeechContext. (For more information about SpeechContext, refer to the docs here)

# Creating SpeechContext for Word Boosting
boosted_lm_words = ["AntiBERTa", "ABlooper"]
boosted_lm_score = 10.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score

# Update RecognitionConfig with SpeechContext
config.speech_contexts.append(speech_context)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript with Word Boosting:", asr_best_transcript)
ASR Transcript with Word Boosting: AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 

As you can see in the above transcription, with word boosting, ASR is able to correctly transcribe the domain specific terms AntiBERTa and ABlooper.

By default, no words are boosted on the server side. Only words passed by the client are boosted.

Boosting different words at different levels

With Riva ASR, we can also have different boost values for different words. For example, here AntiBERTa is boosted by 10 and ABlooper is boosted by 20:

# Creating RecognitionConfig
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=sr,
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
    audio_channel_count = 1
)

# Creating SpeechContext for Word Boosting AntiBERTa
speech_context1 = rasr.SpeechContext()
speech_context1.phrases.append("AntiBERTa")
speech_context1.boost = 10.

# Creating SpeechContext for Word Boosting ABlooper
speech_context2 = rasr.SpeechContext()
speech_context2.phrases.append("ABlooper")
speech_context2.boost = 20.
config.speech_contexts.append(speech_context2)

# Update RecognitionConfig with both SpeechContexts
config.speech_contexts.append(speech_context1)
config.speech_contexts.append(speech_context2)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript with Word Boosting:", asr_best_transcript)
ASR Transcript with Word Boosting: AntiBERTa and ABlooper, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigens. 

Negative word boosting for undesired words

We can even use word boosting to discourage prediction of some words, by using negative boost scores.

# Creating RecognitionConfig
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=sr,
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
    audio_channel_count = 1
)

# Creating SpeechContext for Word Boosting
negative_boosted_lm_word = "antigens"
negative_boosted_lm_score = -100.0
speech_context = rasr.SpeechContext()
speech_context.phrases.append(negative_boosted_lm_word)
speech_context.boost = negative_boosted_lm_score

# Update RecognitionConfig with SpeechContext
config.speech_contexts.append(speech_context)

# Creating RecognizeRequest
req = rasr.RecognizeRequest(audio = content, config = config)

# ASR Inference call with Recognize 
response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript with Negative Word Boosting:", asr_best_transcript)
ASR Transcript with Negative Word Boosting: Anti, Berta and Aber, both transformer based language models are examples of the emerging work in using graph networks to design protein sequences for particular target antigen. 

By providing a negative boost score for antigens, we made Riva ASR transcribe antigen instead of antigens

Please Note:

  • There is no limit on the number of words that can be boosted. You should see no impact on latency for all requests, even for ~100 boosted words, except for the first request, which is expected.

  • Boosting phrases or combination of words is not yet fully supported (but do work). We will revisit finalizing this support in an upcoming release.

Information about Word Boosting can also be found in the documentation here

Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva APIs, you may like to try out:

Sample apps

Riva comes with various sample apps as a demonstration for how to use the APIs to build interesting applications such as a chatbot, a domain specific speech recognition or keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amount of requests at the same time. (SpeechSquad) Have a look at the Sample Application section in the Riva developer documentation for all the sample apps.

Finetune a domain specific speech model and deploy into Riva

Train the latest state-of-the-art speech and natural language processing models on your own data using Transfer Learning ToolKit or NeMo and deploy them on Riva using the Riva ServiceMaker tool.

Further resources

Explore the details of each of the APIs and their functionalities in the docs.