Getting started with ASR in Python

This notebook walks through the basics of Riva Speech Skills’s ASR Services.

Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services such as named entity recognition (NER), punctuation, intent classification.

In this notebook, we will focus on interacting with the Automated speech recognition (ASR) APIs.

For more detailed information on Riva, please refer to the Riva developer documentation.

Requirements and setup

To execute this notebook, please follow the setup steps in README.

Import Riva clent libraries

We first import some required libraries, including the Riva client libraries

import io
import librosa
import IPython.display as ipd
import grpc

import riva_api.riva_asr_pb2 as rasr
import riva_api.riva_asr_pb2_grpc as rasr_srv
import riva_api.riva_audio_pb2 as ra

Create Riva clients and connect to Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

channel = grpc.insecure_channel('localhost:50051')

riva_asr = rasr_srv.RivaSpeechRecognitionStub(channel)

Offline recognition

Riva ASR can be used either in streaming mode or offline mode. In streaming mode, a continuous stream of audio is captured and recognized, producing a stream of transcribed text. In offline mode, an audio clip of set length is transcribed to text.
Let us look at an example showing Offline ASR API usage:

Make a gRPC request to the Riva Speech API server

Riva ASR API supports .wav files in PCM format, .alaw, .mulaw and .flac formats with single channel in this release.

Now let us make a gRPC request to the Riva Speech server for ASR with a sample .wav file in offline mode. Start by loading the audio.

# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = "../_static/data/asr/en-US_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Next, create an audio RecognizeRequest object, setting the configuration parameters as required.

# Set up an offline/batch recognition request
req = rasr.RecognizeRequest()
req.audio = content                                   # raw bytes
req.config.encoding = ra.AudioEncoding.LINEAR_PCM     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
req.config.sample_rate_hertz = sr                     # Audio will be resampled if necessary
req.config.language_code = "en-US"                    # Ignored, will route to correct model in future release
req.config.max_alternatives = 1                       # How many top-N hypotheses to return
req.config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
req.config.audio_channel_count = 1                    # Mono channel

Finally, submit the request to the server.

response = riva_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)
ASR Transcript: What is natural language processing? 


Full Response Message:
results {
  alternatives {
    transcript: "What is natural language processing? "
    confidence: 1.0
  }
  channel_tag: 1
  audio_processed: 4.1519999504089355
}

Understanding ASR API parameters

Riva ASR supports a number of options while making a transcription request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

  • enable_automatic_punctuation - Adds a punctuation at the end of VAD (Voice Activity Detection).

  • encoding - Type of audio encoding to use (LINEAR_PCM, FLAC, MULAW or ALAW).

  • language_code - Language of the audio. “en-US” represents english (US).

  • audio_channel_count - Number of audio channels. Typical microphones have 1 audio channel.

Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva APIs, you may like to try out:

Advanced ASR notebook

Checkout this notebook to understand how to use some of the advanced features of Riva ASR.

Sample apps

Riva comes with various sample apps as a demonstration for how to use the APIs to build interesting applications such as a chatbot, a domain specific speech recognition or keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amount of requests at the same time. (SpeechSquad) Have a look at the Sample Application section in the Riva developer documentation for all the sample apps.

Finetune a domain specific speech model

Train the latest state-of-the-art speech and natural language processing models on your own data using Transfer Learning ToolKit or NeMo and deploy them on Riva using the Riva ServiceMaker tool.

Further resources

Explore the details of each of the APIs and their functionalities in the docs.