ASR Customization Best Practices#

The Riva Quick Start scripts allow you to easily deploy preconfigured ASR pipelines that are very accurate for most applications. The Pipeline Configuration section provides the riva-build commands used to configure the ASR pipelines that are in the Quick Start scripts. You can also easily customize the Riva ASR pipeline in order to meet your specific needs.

This section provides best practices when customizing the Riva ASR pipeline for recognition accuracy. Customization techniques are helpful when out-of-the-box Riva models fall short dealing with challenging scenarios not seen in the training data, such as recognizing narrow domain terminologies, new accents, or noisy environments.

The customization steps and components include:

  • Feature extractor - The audio signal first passes through a feature extractor which segments it into blocks (say, of 80 ms each). Each block is then converted from the temporal domain to the frequency domain, in the form of a spectrogram or mel spectrogram.

  • Acoustic model - The spectrogram data is then fed into an acoustic model, which outputs probabilities over characters (or more generally, text tokens) at each block, meaning time step. Some acoustic models supported by Riva are QuartzNet, Citrinet, Jasper, and Conformer.

  • Decoder and Language model - A decoder converts this matrix of probabilities into a sequence of characters. A language model can give a score indicating the likelihood of this text appearing in its training corpus. An advanced decoder like Flashlight can inspect multiple text sequences (hypotheses) while combining the acoustic model score and the language model score.

  • Punctuation and Capitalization - The text produced by the decoder comes without punctuation and capitalization, which is the job of the Punctuation and Capitalization model.

  • Inverse Text Normalization - Finally, Inverse Text Normalization (ITN) rules are applied to transform the text in verbal format into a desired written format.

The following flow diagram shows the Riva speech recognition pipeline along with the possible customizations.

Riva speech recognition steps and possible customizations

In the following table, the corresponding customizations are listed in increasing order of difficulty and efforts:

Techniques

Difficulty

What it Does

When to Use

How to Use

  1. Word boosting

Quick and easy

Extends the vocabulary while increasing the chance of recognition for a provided list of keywords. This strategy enables you to easily improve recognition of specific words at request time.

When certain words or phrases are important in a particular context. For example, attendee names in a meeting.

Demo tutorial

  1. Custom vocabulary

Easy

Extends the default vocabulary to cover new words of interest.

When the default model vocabulary does not sufficiently cover the domain of interest.

Demo tutorial

  1. Custom pronunciation (Lexicon mapping)

Easy

Explicitly guides the decoder to map pronunciations (that is, token sequences) to specific words. The lexicon decorder emits words that are present in the decoder lexicon. It is possible to modify the lexicon used by the decoder to improve recognition.

When a word can have one or several possible pronunciations.

Demo tutorial

  1. Retrain language model

Moderate

Trains a new language model for the application domain to improve the recognition of domain specific terms. The Riva ASR pipeline supports the use of n-gram language models. Using a language model that is tailored to your use case can greatly help in improving the accuracy of transcripts.

When domain text data is available.

Demo notebook

  1. Create new inverse text normalization rules

Moderately hard

Maps the sequence of transcribed spoken words into a desired written format.

When a particular written format is required.

Demo tutorial

  1. Fine tune an existing acoustic model

Moderately hard

Fine-tunes an existing acoustic model using a small amount of domain data to better suit the domain.

When transcribed domain audio data (10h-100h) is available, other easier approaches fall short.

Speech to Text Citrinet notebook

  1. Train a new acoustic model

Hard

Trains a brand new acoustic model from scratch or with cross-language transfer learning, using thousands of hours of audio data.

Only recommended when adapting Riva to a new language or dialect.

Demo tutorials

Choosing a Customization Technique#

To decide which techniques to use and when, consult the following table. As a general practice, you should attempt the simpler techniques first, observe the impacts, then move to more complex techniques.

Goals/Issues

Solutions

To improve Riva recognition of specific words on special occasions, such as uncommon people’s names in a meeting recording.

Word boosting

To improve Riva recognition of specific terms in recurring contexts, such as product names in a potential application for a particular business.

  • Word boosting

  • Custom vocabulary

  • Lexicon mapping

  • Domain specific language model

  • Fine-tune acoustic model

To improve Riva recognition of general words in challenging acoustic environments such as noise, data quality (for example, phone line), or due to a new dialect.

Fine-tune acoustic model with the relevant data and data augmentation strategy, for example, noise-augmented training data to cope with noisy environments.

Riva correctly recognizes the words, but the final text output format does not meet specification, for example, “e x three 0 five q” instead of “EX305Q”.

Custom inverse text normalization

Model Adaptation#

Model adaptation is a group of techniques that help a pre-trained model adapt to new application scenarios in a quick and easy manner (without fine-tuning or retraining). Some high-level recommendations when using these techniques include:

  1. Word boosting can be used with any word. If the word is not in the lexicon of the decoder, it is automatically tokenized and added to the lexicon for that particular request (other simultaneous requests are independent and will not recognize those boosted words).

  2. Manually coming up with a pronunciation spelling for words not in the lexicon is a workaround and is not needed in many cases. This is not possible with the word boosting API and can only be done by manually modifying the deployed lexicon file.

  3. Modifying the lexicon or language model should be the preferred method in cases where there are words known that should always be recognized (for example, brands, product names, and so on).

  4. Word boosting API is intended for dynamic emphasis of words that are only known at request time (for example, a users’ address book, attendees of the meeting being transcribed, and so on).

Word Boosting#

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model.

Word boosting provides a quick and temporary adaptation for the model to cope with new scenarios, such as recognizing proper names, products, and new or domain specific terminologies. For OOV (Out Of Vocabulary) words, the word boosting functionality extends the vocabulary at inference time. You will have to explicitly specify the list of boosted words at every request. Other adaptation methods such as custom vocabulary and lexicon mapping provide a more permanent solution, which affects every subsequent request.

Of all the adaptation techniques, word boosting is the easiest and quickest to implement. Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model. All you need to do is pass a list of words of importance to the model along with a weight as extra context to the API call, as shown in the following example.

# Word Boosting
boosted_lm_words = ["BMW", "Ashgard"]
boosted_lm_score = 20.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score
config.speech_contexts.append(speech_context)

# Creating StreamingRecognitionConfig instance with config
streaming_config = rasr.StreamingRecognitionConfig(config=config, interim_results=True)

Make note of the following while implementing word boosting:

  • There is no limit to the number of words that can be boosted. You should see minimal impact on latency for all requests, even for tens of boosted words, except for the first request, which is expected.

  • By default, no words are boosted on the server side. Only words passed by the client are boosted.

  • Out-of-vocabulary word boosting is supported.

  • Boosting phrases or combination of words is not yet fully supported (but do work). We will revisit finalizing this support in an upcoming release.

  • Word boosting can improve the chance of recognition of the desired words, but at the same time, can increase false positives. As such, start with a small positive weight and gradually increase till you see positive effects. Start with a boosted score of 20 and increase up to 100 if needed.

  • Word boosting is most suitable as a temporary fix. For best output, you can attempt a binary search for the boosted weights while monitoring the accuracy metrics on a test set. The accuracy metrics should include both the word error rate (WER) and possibly a form of terms error rate (TER) focusing on the terms of interest.

Word boosting examples - Examples demonstrating how to use word boosting can be found in the /work/examples/transcribe_file_offline.py and /work/examples/transcribe_file.py Python scripts in the Riva client image. The following sample command shows how to run these scripts (and the outputs they generate) from within the Riva client container:

/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav
Final transcript: I had a meeting today with Muhammad Oscar and Katherine Rutherford about the future of Riva at NVIDIA.
/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav --boosted_lm_words "asghar"
Final transcript: I had a meeting today with Muhammad Asghar and Katherine Rutherford about the future of Riva at NVIDIA.

These scripts show how to add the boosted words to RecognitionConfig, with SpeechContext (look for the "# Append boosted words/score" comment). For more information about SpeechContext, refer to the riva/proto/riva_asr.proto description here.

We recommend using boosting score values between 20. and 100. A higher score increases the likelihood that the boosted words appear in the transcript if the words occurred in the audio. However, it can also increase the likelihood that the boosted words appear in the transcription even though they did not occur in the audio. Try experimenting with the boosting score values until you get accurate transcription results.

The following word boosting code snippets are included in these example scripts:

# Creating GRPC channel and RecognitionConfig instance
channel = grpc.insecure_channel(args.server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=wf.getframerate(),
  language_code=args.language_code,
  max_alternatives=1,
  enable_automatic_punctuation=True,
)

# Word Boosting
boosted_lm_words = ["first", "second", "third"]
boosted_lm_score = 10.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score
config.speech_contexts.append(speech_context)

# Creating StreamingRecognitionConfig instance with config
streaming_config = rasr.StreamingRecognitionConfig(config=config, interim_results=True)

You can also have different boost values for different words. For example, here first is boosted by 10 and second is boosted by 20:

speech_context1 = rasr.SpeechContext()
speech_context1.phrases.append("first")
speech_context1.boost = 10.
config.speech_contexts.append(speech_context1)

speech_context2 = rasr.SpeechContext()
speech_context2.phrases.append("second")
speech_context2.boost = 20.
config.speech_contexts.append(speech_context2)

Custom Vocabulary#

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided vocabulary file. That means, domain specific words that are not present in the vocabulary file have no chance of being generated.

There are two ways to expand the decoder vocabulary:

  • At Riva build time - When building a custom model, passing the extended vocabulary file to the --decoding_vocab=<vocabulary_file> parameter of the build command. Out of the box vocabulary files for Riva languages can be found on NGC, for example, for English, the vocabulary file named flashlight_decoder_vocab.txt can be found here.

  • After deployment - For a production Riva system, the lexicon file can be modified, extended, and take effect after a server restart. Refer to the next section.

Note that the greedy decoder (available during the riva-build process under the flag --decoder_type=greedy) is not vocabulary-based and therefore, can produce any character sequence.

Custom Pronunciation (Lexicon Mapping)#

When using the Flashlight decoder, the lexicon file provides a mapping between vocabulary dictionary words and its tokenized form, for example, sentence piece tokens for many Riva models.

Modifying the lexicon file serves two purposes:

  • Extends the vocabulary.

  • Provides one or more explicit custom pronunciations for a specific word. For example:

    manu ▁ma n u
    manu ▁man n n ew
    manu ▁man n ew
    

Training Language Models#

Introducing a language model to an ASR pipeline is an easy way to improve accuracy for natural language and can be fine-tuned for niche settings. In short, an n-gram language model estimates the probability distribution over groups of n or less consecutive words, P (word-1, …, word-n). By altering or biasing the data on which a language model is trained on, and thus the distribution it is estimating, it can be used to predict different transcriptions as more likely, and thus alter the prediction without changing the acoustic model. Riva supports n-gram models trained and exported from either NVIDIA TAO Toolkit or KenLM.

Custom language models can provide a permanent solution to improve the recognition of domain specific terms and phrases. Riva currently does not support LM fine tuning, however a domain specific custom LM can be mixed with a general domain LM via a process called interpolation.

TAO Toolkit Language Model#

The general TAO Toolkit model development pipeline is outlined in the Model Overview page. To train a new language model, run:

!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
                     export_to=PATH_TO_TAO_FILE \
                     training_ds.data_dir=PATH_TO_DATA \
                     model.order=4 \
                     model.pruning=[0,1,1,3]  \
                     -k $KEY

To export a pretrained model, run:

### For export to Riva
!tao n_gram export \
           -e /specs/nlp/intent_slot_classification/export.yaml \
           -m PATH_TO_TAO_FILE \
           export_to=PATH_TO_RIVA_FILE \
           binary_type=probing \
           -k $KEY

For more information, refer to the TAO Toolkit documentation. Try running the following Jupyter notebook N-Gram Language Model Notebook.

KenLM Setup#

KenLM is the recommended tool for building language models. This toolkit supports estimating, filtering, and querying n-gram language models. To begin, first make sure you have Boost and zlib installed. Depending on your requirements, you may require additional dependencies. Double check by referencing the dependencies list.

After all dependencies are met, create a separate directory to build KenLM.

wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

Estimating#

The next step is to gather and process data. In most cases, KenLM expects data to be natural language (suiting your use case). Common preprocessing steps include replacing numerics and removing umlauts, punctuation, or special characters. However, it is most important that your preprocessing steps are consistent between both your language and acoustic model.

Assuming your current working directory is the build subdirectory of KenLM, bin/lmplz performs estimation on the corpus provided through stdin and writes the ARPA (a human readable from of the language model) to stdout. Running bin/lmplz documents the command-line arguments, however, here are a few important ones:

  • -o: Required. The order of the language model. Depends on use case, but generally 3-8.

  • -S: Memory to use. Number followed by % for percentage, b for bytes, K for kilobytes, and so on. Default is 80%.

  • -T: Temporary file location

  • --text arg: Read text from a file instead of stdin.

  • --arpa arg: Write ARPA to a file instead of stdout.

  • --prune arg: Prune n-grams with count less than or equal to the given threshold, with one value specified for each order. For example, to prune singleton trigrams, --prune 0 0 1. The sequence of values must be nondecreasing and the last value applies to all remaining orders. Default is not to prune. Unigram pruning is not supported, so the first number must be 0.

  • --limit_vocab_file arg: Read allowed vocabulary separated by whitespace from file in argument and prune all n-grams containing vocabulary items not from the list. Can be combined with pruning.

Pruning and limiting vocabulary help to get rid of typos, uncommon words, and general outliers from the dataset, making the resulting ARPA smaller and generally less overfit, but potentially at the cost of losing some jargon or colloquial language.

With the appropriate options, the language model can be estimated.

bin/lmplz -o 4 < text > text.arpa

Querying and Evaluation#

For faster loading, convert the arpa file to binary.

bin/build_binary text.arpa text.binary

The binary or ARPA can be queried via the command-line.

bin/query text.binary < data

Training from Scratch#

When a substantial amount of raw text is available, a custom LM can be trained from scratch. Riva supports n-gram models trained and exported from either NVIDIA TAO Toolkit or KenLM.

For Riva ASR models in production, we aggregate all the transcribed text in the training data for training the language models.

Limit the vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.

When the text belongs to a narrow, niche domain, there might be an impact to the overall ASR pipeline in recognizing general domain language, as a trade-off. Therefore, experiment with mixing domain text with general text for a balanced representation.

LM Interpolation#

An alternative approach to mixing data is to mix two or more pretrained n-gram language models in .ARPA format. This can be carried out with a 3rd party tool, such as the SRI ngram tool. For example:

./ngram -order 4 -lm <lm1.arpa> -mix-lm <lm2.arpa> -lambda 0.4 -write-lm <output_lm.arpa>

This command interpolates lm1.arpa and lm2.arpa with weights [0.6, 0.4] while writing to output_lm.arpa.

Deploying a Custom Language Model#

A custom n-gram language model file in binary format can be deployed as part of an ASR pipeline by passing the binary language model file to riva-build using the flag --decoding_language_model_binary=<lm_binary>.

Inverse Text Normalization#

Riva implements inverse text normalization (ITN) for ASR requests. It uses weight finite state transducers (WFST) based models to convert spoken domain output from an ASR model into a written domain text to improve readability of the ASR systems output.

Text normalization converts text from written form into its verbalized form. It is used as a preprocessing step before TTS. It could also be used for preprocessing ASR training transcripts.

Inverse text normalization (ITN) is a part of the ASR post-processing pipeline. ITN is the task of converting the raw spoken output of the ASR model into its written form to improve text readability.

Riva implements NeMo ITN, which is based on weighted finite-state transducer (WFST) grammars. The tool uses Pynini to construct WFSTs. The created grammars can be exported and integrated into Sparrowhawk (an open-source version of the Kestrel TTS text normalization system) for production.

With a functional NeMo installation, the German ITN grammars for example, can be exported with the pynini_export.py tool as follows:

python3 pynini_export.py --output_dir . --grammars itn_grammars --input_case cased --language de

This exports the tokenizer_and_classify and verbalizes Fsts as OpenFst finite state archive (FAR) files, ready to be deployed with Riva.

[NeMo I 2022-04-12 14:43:17 tokenize_and_classify:80] Creating ClassifyFst grammars.
Created ./de/classify/tokenize_and_classify.far
Created ./de/verbalize/verbalize.far

To deploy these ITN rules with Riva, pass the FAR files to the riva-build command under these options:

riva-build speech_recognition
[--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
[--wfst_verbalizer_model WFST_VERBALIZER_MODEL]

To learn more on how to build grammars from the ground-up, consult the NeMo Weighted Finite State Transducers (WSFT) tutorial.

Details on the model architecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.

Training or Fine-Tuning an Acoustic Model#

Model fine-tuning is a set of techniques that makes fine adjustments to a pre-existing model using new data, so as to make it adapt to new situations while also retaining its original capabilities.

Model training refers to training a new model either from scratch (that is, starting from random weights), or with weights initialized from an existing model, but with the goal of having the model acquire totally new skills without necessarily retaining the original capabilities (such as in cross-language transfer learning).

Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few best practices to follow. Many of these best practices also apply to inputs at inference time.

  • Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.

  • Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.

  • Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.

  • Use a minimum sampling rate of 16kHz if possible, but do not resample.

  • If using TAO to fine-tune ASR models, refer to the TAO Toolkit documentation on training acoustic models here. Try running the following Jupyter notebooks: Speech To Text Citrinet notebook and Speech to Text Notebook

  • If using NeMo to fine-tune ASR models, refer to this tutorial. We recommend fine-tuning ASR models only with sufficient data approximately on the order of several hundred hours of speech. If such data is not available, it may be more useful to simply adapt the LM on in-domain text corpus than to train the ASR model.

  • There is no formal guarantee that the ASR model will or will not be streamable after training. We recommend fine-tuning the Conformer acoustic model if doing streaming recognition. In our experience, it provides better streaming WER after fine-tuning compared to Citrinet.

Training New Models#

Train models from scratch - End-to-end training of ASR models requires large datasets and heavy compute resources. There are more than 5,000 languages around the world, but very few languages have datasets large enough to train high quality ASR models. For this reason, we only recommend training models from scratch where several thousands of hours of transcribed speech data is available.

Cross-language transfer learning - Cross-language transfer learning is especially helpful when training new models for low-resource languages. But even when a substantial amount of data is available, cross-language transfer learning can help boost the performance further.

It is based on the idea that phoneme representation can be shared across different languages. Experiments by the NeMo team showed that on as little as 16h of target language audio data, transfer learning works substantially better than training from scratch. In the GTC 2020 talk, NVIDIA data scientists demonstrate cross-language transfer learning for a low resource language with <30h of speech data.

Fine-Tuning Existing Models#

When other easier approaches have failed to address accuracy issues in challenging situations brought about by significant acoustic factors, such as different accents, noisy environments or poor audio quality, fine-tuning acoustic models should be attempted.

We recommend fine-tuning ASR models with sufficient data approximately, on the order of 100 hours of speech or more. The minimal number of hours which we used for NeMo transfer learning was ~100 hours for CORAAL dataset, as shown in this paper. Our experiments demonstrate that in all three cases of cross-language transfer learning, continuous learning and domain adaptation, transfer learning from a good base model has higher accuracy than a model trained from scratch. It is also preferred to finetune large models than training small models from scratch, even if the dataset for fine-tuning is small.

Low-resource domain adaptation - In case of smaller datasets, such as ~10 hours, appropriate precautions should be taken to avoid overfitting to the domain and hence sacrificing significant accuracy in the general domains, also known as catastrophic forgetting. If fine-tuning will be done on this small dataset, mix with other larger datasets (“base”). For English for example, NeMo has a list of public datasets that it can be mixed with.

In transfer learning, continual learning is a sub-problem wherein models that are trained with new domain data should still retain good performance on the original source domain.

If using NeMo to fine-tune ASR models, refer to this tutorial.

Data quality and augmentation - Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality. As a regular practice, use a minimum sampling rate of 16kHz.

Augment training data with noise can improve the model’s ability to cope with noisy environments. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.

Streaming#

There is no guarantee that the ASR model will or will not be streamable after training. We see that with more training (thousands of hours of speech, 100-200 epochs), models generally obtain better offline scores and online scores do not degrade as severely (but still degrade to some extent due to differences between online and offline evaluation).

Punctuation and Capitalization Model#

ASR systems typically generate text with no punctuation and capitalization of the words. In Riva, the punctuation and capitalization model is responsible for formatting the text, enhanced with both punctuation and capitalization.

The punctuation and capitalization model should be customized when an out-of-the-box model does not perform well in the application context, such as when applying to a new language variant.

To either train or fine-tune, and then deploy a custom punctuation and capitalization model, refer to the Punctuation Capitalization Notebook.

Deploying a Custom Acoustic Model#

If using NVIDIA TAO Toolkit, refer to the TAO to Riva deployment notebook.

If using NVIDIA NeMo, the model must first be exported from a .nemo format to a .riva format using the nemo2riva tool that is available as part of the Riva distribution. Next, use the Riva ServiceMaker containers and tools (riva-build and riva-deploy) for deployment. For more information, refer to Deploying Your Custom Model into Riva.