Automatic Speech Recognition¶

Abstract: This NVIDIA Jarvis Automatic Speech Recognition (ASR) 0.2 Early Access (EA) User Guide provides step-by-step instructions for training and deploying your model as well as how to use the ASR service with Jarvis. ASR is a speech-to-text application based on the Jasper model. Specifically, ASR takes as input an audio stream or audio buffer and returns the English language text transcript, along with additional optional metadata. ASR represents a full speech recognition pipeline that is GPU accelerated with optimized performance and accuracy.

Introduction¶

Automatic Speech Recognition (ASR) is a speech-to-text application based on the Jasper model. Specifically, ASR takes as input an audio stream or audio buffer and returns the English language text transcript, along with additional optional metadata. ASR represents a full speech recognition pipeline that is GPU accelerated with optimized performance and accuracy. ASR supports synchronous and streaming recognition modes.

Specifically, in synchronous mode, the full audio signal is first read from a file or captured from a microphone. Following the capture of the entire signal, the client makes a request to the NVIDIA Triton Inference Server to transcribe it. The client then waits for the response from the server.

Note: This method might have long latency since the processing of the audio signal only starts once the full audio signal has been captured or read from the file.

In streaming recognition mode, as soon as an audio segment of a specified length is captured or read, a request is made to the server to process that segment. On the server side, a response is returned as soon as an intermediate transcript is available.

Note: The length of the audio segments can be selected by the user based on speed and memory requirements.

Jarvis ASR features include:

Support for offline and streaming use cases
A streaming mode that returns intermediate transcripts with low latency
GPU-accelerated feature extraction
Jasper and QuartzNet support with TensorRT 7.x.x.
Beam decoder based on n-gram language model
Voice activity detection algorithms (CTC decoder based)
Basic punctuation based on voice activity detector (VAD) algorithms
Helm chart to easily deploy to Kubernetes cluster
Ability to return top-N transcripts from beam decoder
Word-level time stamps

Benefits Of Jarvis ASR¶

The Triton Inference Server implementation of the speech-to-text pipeline based on the Jasper model provides the following benefits:

Ease of use Obtaining transcripts of audio files is as simple as providing the content of the audio files and receiving the most likely transcripts for those files.

Fast Because Jarvis is an NVIDIA product that leverages GPUs, the Triton Inference Server speech-to-text pipeline achieves state-of-the-art performance.

Accurate Jarvis ASR provides state-of-the-art accuracy as measured by the word error rate (WER) on typical datasets such as LibriSpeech.

Modular Even though this implementation of speech-to-text uses the Jasper neural network, it is modular such that you can easily replace one or many components of this pipeline.

Jasper¶

The Jasper model is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment.

The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.

The original paper takes the output of the Jasper acoustic model and shows results for 3 different decoding variations: greedy decoding, beam search with a 6-gram language model and beam search with further rescoring of the best-ranked hypotheses with Transformer XL, which is a neural language model. Beam search and the rescoring with the neural language model scores are run on CPU and result in better word error rates compared to greedy decoding.

Details on the model architecture can be found in the paper Jasper: An End-to-End Convolutional Neural Acoustic Model.

Modules¶

The speech recognition framework is based on the Jasper model and can be separated into the following speech-to-text components:

Feature Extractor Converts the raw audio signal to a set of audio features. The audio features are typically a mel-scale log spectrogram or filter banks of the audio signal.

The audio feature extractor and preprocessor is responsible for extracting audio features such as filter banks or a spectrogram from the raw audio signal. This operation involves signal processing operations such as Fast Fourier Transformers. This component also needs to use a voice activity detection algorithm to detect pauses in the audio signal. This information will be given as input to the decoder to reset its states between sentences.

Acoustic Model A deep time-delay neural network (TDNN) consisting of blocks of 1D-convolutional layers. The Jasper neural network supports both TensorFlow and PyTorch frameworks.

The Jasper neural network computes a probability distribution over the characters of a vocabulary at every frame in the sequence. For more information about the Jasper neural network, refer to the Jasper: An End-to-End Convolutional Neural Acoustic Model paper and OpenSeq2Seq Jasper Model documentation.

Voice Activity Detector The voice activity detector uses the output of the Jasper neural network to identify beginning and end of sentences. The decoder uses that information to reset its state between sentences.

Beam Search Decoder Uses an n-gram language model to generate candidate transcripts, along with their scores (probabilities).

For more information about the beam-search decoder, refer to the First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs paper and the CTC Networks and Language Models: Prefix Beam Search Explained article.

Training A Model With Your Data¶

Jarvis-compatible ASR models, for example, Jasper, can be trained using Neural Modules (NeMo). To train your own acoustic model using NeMo, follow the step-by-step instructions here: https://nvidia.github.io/NeMo/asr/tutorial.html

After you’ve trained your Jasper model in NeMo and have Jasper model’s checkpoints for the encoder and the decoder, you can export those checkpoints to ONNX for deployment with Jarvis by using the provided Nemo script, scripts/export_jasper_to_onnx.py.

After this script is executed, two files are created; nn_encoder.onnx and nn_decoder.onnx.

Using the nn_encoder.onnx and nn_decoder.onnx files, generate the model with the Jarvis ASR service as described in Generating The Triton Inference Server Model Repository.

Generating The Triton Inference Server Model Repository¶

There are two ways to generate a Triton Inference Server model repository:

You can use one of our pre-trained models in NGC, or
You can use a fine-tune custom model with Neural Modules (NeMo).

Creating A Model Repository Using A Pre-Trained NGC Model¶

To deploy the ASR models and generate a model repository configured for the ASR service, first download the Quick Start scripts from NGC. The configuration file config.sh should be edited to configure the deployment. To disable services other than ASR, simply set:

service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=false
service_enabled_vision=false

Users should also set the ID of the GPU to use, and the type of GPU being used (t4 for compute capability 7.5 GPUs and v100 for compute capability 7.0 GPUs). For use cases where low latency of intermediate transcripts is more important than maximum throughput, the following ASR model configuration files should be used:

models_asr=(
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_streaming.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_vad_streaming.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_streaming_offline.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_vad_streaming_offline.yaml:ea2"
)

For use cases where maximum throughput is more important, the following ASR configuration files should be used:

models_asr=(
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_streaming_throughput.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_vad_streaming_throughput.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_streaming_offline.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_base:config_jasper_asr_trt_ensemble_vad_streaming_offline.yaml:ea2"
)

After the quickstart/config.sh file is properly configured, generate the Triton model repository by running:

quickstart/jarvis_init.sh

For more information, refer to Local Deployment Using Quick Start Scripts in the Jarvis AI Services Quick Start Guide.

Creating A Model Repository Using A Fine-Tuned Model From NeMo¶

To generate a model repository using the models generated from Nemo, move the nn_encoder.onnx and nn_decoder.onnx files to a subdirectory called nemo at the path where you want to generate the model repository.

Note that when using a fine-tuned NeMo model, the variable jarvis_model_loc in the quickstart/config.sh file must be a local folder, such as /tmp/jarvis. For example, if you are generating the model repository at /tmp/jarvis, run the following commands:

NEMO_MODEL_DIR=/tmp/jarvis/nemo/jarvis_asr_jasper_english_nemo/1/
mkdir -p $NEMO_MODEL_DIR

cp <path to nn_encoder.onnx> $NEMO_MODEL_DIR/nn_encoder.onnx
cp <path to nn_decoder.onnx> $NEMO_MODEL_DIR/nn_decoder.onnx

Then, modify the quickstart/config.sh by replacing jarvis_asr_jasper_english_base with jarvis_asr_jasper_english_nemo. For example:

models_asr=(
  "ea-2-jarvis::jarvis_asr_jasper_english_nemo:config_jasper_asr_trt_ensemble_streaming.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_nemo:config_jasper_asr_trt_ensemble_vad_streaming.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_nemo:config_jasper_asr_trt_ensemble_streaming_offline.yaml:ea2"
  "ea-2-jarvis::jarvis_asr_jasper_english_nemo:config_jasper_asr_trt_ensemble_vad_streaming_offline.yaml:ea2"
)

Finally, generate the model repository by running:

quickstart/jarvis_init.sh

Modifying The Model Repository To Use Custom Language Model¶

Advanced users have the ability to modify the ASR language model used by Jarvis. Scripts to facilitate the creation and the tuning of language models are provided on NGC in the asr_lm_tools folder.

To create a new language model binary from a text file or an .arpa file, users must first modify the file config_LM_creation.sh to specify the data to use for the language model, the parameters for the creation of the language model and the name of the language model binary file. Then users should run the following command:

create_LM.sh config.sh config_LM_creation.sh

where config.sh is the absolute path to the Jarvis config file and config_LM_creation.sh is the absolute path to the configuration file for the language model creation. Upon successful completion, the language model binary file should be found in the output directory specified by the user.

Users can update the ASR model repository (previously generated by running jarvis_init.sh) to use the new language model by running:

update_jarvis_LM.sh config.sh new_lm.binary

where config.sh is the absolute path to the Jarvis config file and new_lm.binary is the absolute path to the new language model binary.

After the Jarvis ASR model repository has been updated to use the new language model, users can also tune the language model parameters (beam_search_width, alpha and beta) by running:

tune_LM.sh config.sh config_LM_tuning.sh

where config.sh is the absolute path to the Jarvis config file and config_LM_tuning.sh is the absolute path to the configuration file for the language model tuning.

In the config_LM_tuning.sh file, users must specify ranges of values for beam_search_width, alpha and beta, as well as the audio files to use for the tuning. Upon completion, the script will print optimal values for beam_search_width, alpha and beta. Those values can then be used to update once again the ASR model repository by running:

update_jarvis_LM.sh config.sh new_lm.binary beam_search_width alpha beta

Deploying Your Model¶

Regardless of whether you generated a Triton Inference Server model repository using a model that was pre-trained from NGC or whether you used a fine-tuned model that was trained in NeMo, the deployment process is the same.

To deploy your model, you can choose from the following:

You can use the provided jarvis_start.sh script to launch Triton and the Jarvis Speech server.
You can use a Helm chart to launch on Kubernetes to deploy.

Deploying A Model Using A Docker Container¶

After your local model repository is properly configured (see the previous section), deployment of the model involves starting the Triton Inference Server and Jarvis API Server. Using the Quick Start scripts provided in NGC, this can be done by running:

quickstart/jarvis_start.sh

To verify that the Triton server has been launched properly, run:

docker logs jarvis-triton

The log will look similar to the following:

Starting endpoints, 'inference:0' listening on
I0428 00:48:19.701836 1 grpc_server.cc:1973] Started GRPCService at 0.0.0.0:8001
I0428 00:48:19.701868 1 http_server.cc:1443] Starting HTTPService at 0.0.0.0:8000
I0428 00:48:19.744082 1 http_server.cc:1458] Starting Metrics Service at 0.0.0.0:8002

Similarly, to verify that the Jarvis Speech server is running properly, run:

docker logs jarvis-speech

The log will look similar to the following:

I0428 00:48:25.747217 1 model_registry.cc:89] Registered 'jasper-asr-trt-ensemble-streaming' model
I0428 00:48:25.747478 1 model_registry.cc:89] Registered 'jasper-asr-trt-ensemble-vad-streaming-offline' model
I0428 00:48:25.747623 1 model_registry.cc:89] Registered 'jasper-asr-trt-ensemble-vad-streaming' model
I0428 00:48:25.747802 1 model_registry.cc:89] Registered 'jasper-asr-trt-ensemble-streaming-offline' model
I0428 00:48:25.747862 1 model_registry.cc:94] Total models available for category 0 on server is 4
I0428 00:48:25.748091 1 grpc_jarvis_asr.cc:195] Seeding RNG used for correlation id with time: 1588034905
I0428 00:48:25.748339 1 jarvis_server.cc:68] ASR Server connected to TensorRT Inference Server at jarvis-triton:8001
I0428 00:48:25.748348 1 jarvis_server.cc:71] Jarvis Conversational AI Server listening on 0.0.0.0:50051

Deploying A Model Using A Helm Chart¶

The Helm chart provided for Jarvis is responsible for downloading model artifacts (if necessary), setting up a model repository, and launching the required services. The Using Helm To Deploy Jarvis AI Services on Kubernetes section in the Jarvis Services Quick Start Guide describes in detail the process for retrieving the Helm chart from NGC and how to install.

When deploying to Kubernetes via Helm, it is possible to disable components that are not required. If Jarvis services other than ASR are not required, modify the values.yaml file before installing the Helm chart.

If NLP and/or TTS is not required, set jarvis.speechServices.[nlp|tts] = false in values.yaml. Optionally, you may remove any subset of NLP-related and/or TTS-related models from modelRepoGenerator.ngcModelConfigs while keeping the services enabled. Models that are not needed (due to a service being disabled) will not be downloaded and installed.

If deploying fine-tuned models, configure modelTemplateVolume to map to a persistent storage device. This volume will be made available to the trtis-model-repo container in /templates.

When building your custom model deployments, use absolute paths including /templates to link to model artifacts stored in this persistent volume. Concretely, the yaml file used for the model generator should be stored in /templates/<name of your model>/config.yaml, along with any other model artifacts. These config paths are then specified in values.yaml in the localModelConfigs array.

Using The Jarvis ASR Service¶

Client applications interact with the Jarvis ASR Service using the gRPC protocol which supports multiple programming languages. For more information on the API, refer to the ASR API document.

We provide protobuf files so you can generate bindings for your language of choice. These files are located in the Jarvis Quick Start model script on NGC. In addition, a pip wheel is included for easy installation of the client bindings in Python. For more information, refer to the gRPC documentation for the respective programming language.

The container nvcr.io/ea-2-jarvis/jarvis-api-client:ea2 also contains pre-generated bindings for Python in the same folder as a wheel package and C++ client binaries: /usr/local/bin/jarvis_asr_client and /usr/local/bin/jarvis_streaming_asr_client

Interacting With The Jarvis ASR Service Using Pre-Generated Python Bindings¶

To interact with the Jarvis ASR Service using Python, use the provided Python bindings at /work/src/jarvis_proto in the jarvis-api-client container.

The following sample code shows how to interact with the Jarvis ASR Service using its gRPC interface. The code below is taken from the Jupyter notebook also provided in the jarvis-api-client container at /work/notebook/Jarvis_speech_API_demo.ipynb.

import numpy as np
import librosa
import grpc

import jarvis_api.jarvis_asr_pb2 as jasr
import jarvis_api.jarvis_asr_pb2_grpc as jasr_srv
import jarvis_api.audio_pb2 as ja

# Establish connection to Jarvis API server and ASR service
jarvis_api_uri = 'localhost:50051' 
# if running the example in the
# same host as the server,
# replace with proper URI otherwise

channel = grpc.insecure_channel(jarvis_api_uri)
jarvis_asr = jasr_srv.JarvisASRStub(channel)
path = "wav/test/1272-135031-0000.wav"
audio, sr = librosa.core.load(path)
with io.open(path, 'rb') as fh:
    content = fh.read()

# Set up an offline/batch recognition request
req = jasr.RecognizeRequest()
req.audio = content # raw bytes

# Only PCM is supported in this release
req.config.encoding = ja.AudioEncoding.LINEAR_PCM 

# Audio will be resampled if necessary
req.config.sample_rate_hertz = 16000 

# Route to correct model for language if it is deployed
req.config.language_code = "en-US"

# How many top-N hypotheses to return
req.config.max_alternatives = 1 

# Add punctuation when end of VAD detected
req.config.enable_automatic_punctuation = True


response = jarvis_asr.Recognize(req)
asr_best_transcript = response.results[0].alternatives[0].transcript

print("ASR Transcript:", asr_best_transcript)
print("\nFull Response Message:")
print(response)

Within the nvcr.io/ea-2-jarvis/jarvis-api-client:ea2 container, refer to the Jupyter Notebook located in the /notebooks/Jarvis_AI_services_demo.ipynb folder for an example of how to integrate the ASR service with Jarvis. To run it, launch the nvcr.io/ea-2-jarvis/jarvis-api-client:ea2 container with the following command:

./quickstart/jarvis_start_client.sh

From inside the client container, run:

cd /notebooks; jupyter notebook –allow-root

Then, follow the link shown on the screen to access the notebook in your browser.

A Python example doing streaming recognition is also provided in the client container at ./examples/jarvis_streaming_asr_client.py.

Interacting With The Jarvis ASR Service Using The Provided C++ Clients¶

Two client binaries are also provided to perform offline and streaming recognition. For example to perform offline recognition on all the files in folder /work/wav/test/ of the client container, run:

/usr/local/bin/jarvis_asr_client --word_time_offsets=false --audio_file=/work/wav/test/

To perform streaming recognition, run:

/usr/local/bin/jarvis_streaming_asr_client --audio_file=/work/wav/test/1272-135031-0000.wav

The client binaries have more options which can be viewed by running /usr/local/bin/jarvis_asr_client and /usr/local/bin/jarvis_streaming_asr_client without any arguments.

Troubleshooting And Support¶

FAQs¶

Q: If our pipeline is currently configured to run at 48 kHz, however, ASR currently supports only 16 kHz data, will there be any degradation in accuracy if we downsample the data to 16 kHz and pass it to QuartzNet?

A: The API is able to automatically resample input data to match the sample rate expected by the model. Jarvis pretrained models are typically trained mostly on 16khz data, and occasionally upsampled 8khz data). Generally, 16khz audio bandwidth is enough to represent the most important parts of the spectrum for voice.

Q: Should I provide denoised audio as input to the ASR or original noisy audio? Our denoising is designed for better human interpretation and might not necessarily improve machine interpretation, depending on how the network was trained.

A: It is generally better to pass in the raw audio. The models are built to be somewhat robust to noise. Denoising algorithms have a different objective (human intelligibility), and sometimes cause a mismatch in training and inference conditions.