How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping

This notebook walks you through the process of customizing Riva ASR vocabulary and lexicon, in order to improve Riva vocabulary coverage and recognition of difficult words, such as acronyms.

Overview

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file. That means, uncommon and new words, such as domain specific terminologies, that are not present in the lexicon file, will have no chance of being generated.

On the other hand, the greedy decoder (available as an option during the riva-build process with the flag --decoder_type=greedy) is not lexicon-based and hence can virtually produce any word or character sequence.

Pre requisite

This notebook assumes that the user is familiar with manually deploying a Riva ASR pipeline using the Riva ServiceMaker tool, riva-build and riva-deploy commands. See Riva documentation.

Terminologies

  • Vocabulary file: The vocabulary file is a flat text file containing a list of vocabulary words, each on its own line. For example:

the
i
to
and
a
you
of
that
...

This file is used by the riva-build process to generate the lexicon file.

  • Lexicon file: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a tab. Below is an example:

with    ▁with
not     ▁not
this    ▁this
just    ▁just
my      ▁my
as      ▁as
don't   ▁don ' t

Note: Ultimately, the Riva decoder makes use only of the lexicon file directly at run time (but not the vocabulary file).

Riva ServiceMaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the .riva file. By default, Riva generates 1 tokenized form for each word in the vocabulary file.

What can be customized?

Both the vocabulary and the lexicon files can be customized.

  • Extending the vocabulary enriches the Riva default vocabulary, providing additional coverage for out-of-vocabulary words, terminologies, and abbreviations.

  • Customizing the lexicon file can further enrich the Riva knowledge base by providing one or more explicit pronunciations, in the form of tokenized sequences.

Extending the vocabulary

Extending the vocabulary must be done at Riva build time.

When building a Riva ASR pipeline, pass the extended vocabulary file to the --decoding_vocab=<vocabulary_file> parameter of the build command. For example, the build command for the Citrinet model:

    riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<vocabulary_file> \
   --language_code=en-US \
   <other_parameters>...

Refer to Riva documentation for build commands for supported models.

How to modify vocabulary file

You can either provide your own vocabulary file, or extend Riva’s default vocabulary file.

  • BYO vocabulary file: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of “difficult words”, but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.

  • Modifying an existing one: Out of the box vocabulary files for Riva supported languages can be found on NGC, for example, for English, the vocabulary file named flashlight_decoder_vocab.txt can be found at this link. Alternatively, it can also be found in a deployed Riva ASR pipeline, for example, under /data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/dict_vocab.txt in the Riva server docker container. You can make a copy, then extend this default vocabulary file with the words of interest.

Once modified, you’ll have to redeploy the Riva ASR pipeline with riva-build while passing the flag --decoding_vocab=<modified_vocabulary_file>.

Customizing pronunciation with lexicon mapping

The lexicon file that is used by the Flashlight decoder can be found in the Riva server docker container, under the Triton model folder, for example /data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/lexicon.txt.

Note: the actual physical location of this file in the host file system, along with other Riva model assets, is specified by the riva_model_loc parameter in the Riva configuration file config.shunder the Riva quickstart root folder.

How to modify the lexicon file

First, locate and make a copy of the lexicon file. For example:

cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt

Next, modify it to add the sentencepiece tokenizations for the words of interest. For example, one could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

which are 3 different pronunciations/tokenizations of the word manu. If the acoustic model predicts those tokens, they will be decoded as manu.

Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

How to generate the correct tokenized form

When modifying the lexicon file, ensure that:

  • The new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model.

  • The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).

The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the sentencepiece Python package (pip install sentencepiece). You can get the tokenizer model for the deployed pipeline from the model repository ctc-decoder directory for your model. It will be named <hash>_tokenizer.model.

!pip install sentencepiece

You can then generate new lexicon entries, for example:

TOKEN="BRAF"
PRONUNCIATION="b raf"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))
BRAF	▁b ▁ra f
BRAF	▁b ▁ r a f
BRAF	▁b ▁ra f
BRAF	▁b ▁ra f
BRAF	▁ b ▁ra f

Note: TOKEN represents the desired written form of the word, while PRONUNCIATION is what the word should sound like.

Other examples:

TOKEN="WhatsApp"
PRONUNCIATION="what's app"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))
WhatsApp	▁what ' s ▁app
WhatsApp	▁w h at ' s ▁app
WhatsApp	▁w h at ' s ▁app
WhatsApp	▁what ' s ▁a pp
WhatsApp	▁ w ha t ' s ▁app
TOKEN="Cya"
PRONUNCIATION="See ya"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁ y a

Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva ASR APIs, you can try:

Additional Riva tutorials

Checkout more Riva ASR (and TTS) tutorials here to understand how to use some of the advanced features of Riva ASR, including customizing ASR for your specific needs.

Sample applications

Riva comes with various sample applications. They demonstrate how to use the APIs to build applications such as a chatbot, a domain specific speech recognition, keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amounts of requests at the same time. Refer to (SpeechSquad) for more information.
Refer to the Sample Application section in the Riva developer documentation for more information.

Riva Text-To-Speech (TTS)

Riva’s TTS offering comes with two OOTB voices that can be used in streaming or batch inference modes. They can be easily deployed using the Riva Quick Start scripts. Follow this link to understand Riva’s TTS capabilities. Explore how to use Riva TTS APIs with the OOTB voices with this Riva TTS tutorial.

Additional resources

For more information about each of the APIs and their functionalities, refer to the documentation.