<img src="https://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-basics/nvidia_logo.png" style="width: 90px; float: right;">

# Riva ASR NIM Tutorial

This tutorial walks you through the various features of Riva ASR NIM and how to the APIs in a Python application. Riva ASR NIM uses gRPC API to serve offline and online use cases.

## Prerequisites

1. Deploy Riva ASR NIM with Parakeet 1.1b en-US model by following the [Riva ASR NIM](https://docs.nvidia.com/nim/riva/asr/latest/overview.html) documentation.
2. Install the Riva Python Client library:
    ```bash
    sudo apt-get install python3-pip
    pip install -U nvidia-riva-client
    ```
3. Clone the Git repository at https://github.com/nvidia-riva/tutorials for audio samples. The repository is assumed to be cloned in the `$HOME` directory.

## Offline Recognition

In offline transcription, the entire input speech is submitted to the service, and the final transcript is received in one response.

Try generating the transcripts using the Riva ASR APIs for some sample audio clips in English.

### Import Riva Client Libraries

Begin by importing some of the necessary libraries, including the Riva Client libraries.

In [None]:
# Import required libraries
import io
from pathlib import Path
import grpc
import riva.client
import IPython.display as ipd


The following URI assumes a local deployment of the Riva ASR NIM server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [None]:
# Create a Riva client and connect to the Riva ASR NIM
auth = riva.client.Auth(uri='0.0.0.0:50051')
asr_service = riva.client.ASRService(auth)

In [None]:
# Get list of available ASR models for offline use case
print("Available ASR models for offline use case")
config_response = asr_service.stub.GetRivaSpeechRecognitionConfig(riva.client.proto.riva_asr_pb2.RivaSpeechRecognitionConfigRequest())
for model_config in config_response.model_config:
    if model_config.parameters["type"] == "offline":
        print(f"{model_config.parameters['language_code']} : {model_config.model_name}")

Riva ASR NIM supports 16-bit, single channel audio in `LPCM`, `alaw`, `ulaw` encoding in `.raw` (headerless) and `.wav` format. It also supports and `.opus` and `.flac` formats. File format is auto-detected from the provided input audio.

In [None]:
# This example uses a .wav file with LINEAR_PCM encoding.
# read in an audio file from local disk
path = Path("~/tutorials/audio_samples/en-US_sample.wav").expanduser()
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Setup configuration parameters

In [None]:
# Set up an recognition config
config = riva.client.RecognitionConfig()
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return. Only value of 1 is supported.
config.enable_automatic_punctuation = True        # Enable punctuation and capitalization
config.audio_channel_count = 1                    # Mono - Default
config.verbatim_transcripts = False               # Set to True to return verbatim transcripts
config.profanity_filter = False                   # Set to True to filter and replace profane words with first letter followed by asterisks (e.g. "f***")

# In cases where audio samples are submitted in `.raw` format, you need to set the following parameters appropriately:
# config.encoding = riva.client.AudioEncoding.LINEAR_PCM
# config.sample_rate_hertz = 16000


Submit the request to the server and print response.

In [None]:
# Make a gRPC request and invoke the ASR service
request = riva.client.proto.riva_asr_pb2.RecognizeRequest(config=config, audio=content)
response = asr_service.stub.Recognize(request)

# Full response shows additional information like word time offsets and
# word confidence along with the transcript.
print(f"Full Response: {response}")

# Print the final transcript by combining transcripts from all results
final_transcript = ""
for res in response.results:
    final_transcript += res.alternatives[0].transcript
print("Final transcript:", final_transcript)

### Offline recognition for non-English languages

We can run Riva ASR for non-English languages in the same manner by setting `config.language_code` to the appropriate language code. Riva ASR NIM must be deployed with the model for the required language. For a list of available models, refer to [Supported Models](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#supported-models) in the Riva ASR NIM documentation.

## Streaming Recognition

In case of streaming transcription, the input speech is submitted to the service in chunks, and the final transcript is received incrementally as the input speech is processed.

Try generating the transcripts using the Riva ASR APIs for some sample audio clips in English.

Begin by importing some of the necessary libraries, including the Riva Client libraries.

In [None]:
# Import required libraries
import io
from pathlib import Path
import grpc
import riva.client
import IPython.display as ipd

The following URI assumes a local deployment of the Riva ASR NIM server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

In [None]:
# Create a Riva client and connect to the Riva ASR NIM
auth = riva.client.Auth(uri='10.176.8.103:50051')
asr_service = riva.client.ASRService(auth)

In [None]:
# Get list of available ASR models for streaming/online use case
print("Available ASR models for streaming/online use case")
config_response = asr_service.stub.GetRivaSpeechRecognitionConfig(riva.client.proto.riva_asr_pb2.RivaSpeechRecognitionConfigRequest())
for model_config in config_response.model_config:
    if model_config.parameters["type"] == "online":
        print(f"{model_config.parameters['language_code']} : {model_config.model_name}")

In [None]:
# This example uses a .wav file with LINEAR_PCM encoding.
input_speech_file = "~/tutorials/audio_samples/en-US_sample.wav"
path = Path(input_speech_file).expanduser()
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

Send streaming requests with the first config and then followed by audio chunks to the server. Receive the responses and print them to the console.

In [None]:
def read_responses(responses):
    try:
        final_transcript = ""
        for response in responses:
            if not response.results:
                continue
            for result in response.results:
                if not result.alternatives:
                    continue
                if result.is_final:
                    final_transcript += result.alternatives[0].transcript
                    print(f"FINAL: {result.audio_processed:.2f} : {result.alternatives[0].transcript}")
                else:
                    print(f"PARTIAL: {result.audio_processed:.2f} : {result.alternatives[0].transcript}")

        print("Transcript:", final_transcript)

    except grpc.RpcError as error:
        print(error.code(), error.details())
        return


def generate_requests(input_speech_file: str):
    print(f"Transcribing File: {input_speech_file}")

    # Set up an recognition config
    streaming_config = riva.client.StreamingRecognitionConfig(
        config = riva.client.RecognitionConfig(
            language_code = "en-US",                    # Language code of the audio clip
            max_alternatives = 1,                       # How many top-N hypotheses to return. Only value of 1 is supported.
            enable_automatic_punctuation = True,        # Enable punctuation and capitalization
            audio_channel_count = 1,                    # Mono - Default
            verbatim_transcripts = False,               # Set to True to return verbatim transcripts
            profanity_filter = False,                   # Set to True to filter and replace profane words with first letter followed by asterisks (e.g. "f***")
        ),
        interim_results = True
    )

    # In cases where audio samples are submitted in `.raw` format, you need to set the following parameters appropriately:
    # streaming_config.config.encoding = riva.client.AudioEncoding.LINEAR_PCM
    # streaming_config.config.sample_rate_hertz = 16000

    # First send the config
    yield riva.client.proto.riva_asr_pb2.StreamingRecognizeRequest(streaming_config=streaming_config)

    # Followed by audio chunks
    try:
        # stream audio in chunks of 100ms
        chunk_size_ms = 100
        for audio_chunk in riva.client.AudioChunkFileIterator(input_speech_file, chunk_size_ms):
            yield riva.client.proto.riva_asr_pb2.StreamingRecognizeRequest(audio_content=audio_chunk)
    except Exception as e:
        print(e)
        return

# Get response stream to read transcripts
read_responses(asr_service.stub.StreamingRecognize(generate_requests(input_speech_file)))