How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping#

This notebook walks you through the process of customizing Riva ASR vocabulary and lexicon, in order to improve Riva vocabulary coverage and recognition of difficult words, such as acronyms.

Overview#

The Flashlight decoder, deployed by default in Riva, is a lexicon-based decoder and only emits words that are present in the provided lexicon file. That means, uncommon and new words, such as domain specific terminologies, that are not present in the lexicon file, will have no chance of being generated.

On the other hand, the greedy decoder (available as an option during the riva-build process with the flag --decoder_type=greedy) is not lexicon-based and hence can virtually produce any word or character sequence.

Prerequisite#

This notebook assumes that the user is familiar with manually deploying a Riva ASR pipeline using the Riva ServiceMaker tool, riva-build and riva-deploy commands. See Riva documentation.

Terminologies#

Vocabulary file: The vocabulary file is a flat text file containing a list of vocabulary words, each on its own line. For example:

the
i
to
and
a
you
of
that
...

This file is used by the riva-build process to generate the lexicon file.

Lexicon file: The lexicon file is a flat text file that contains the mapping of each vocabulary word to its tokenized form, e.g, sentencepiece tokens, separated by a tab. Below is an example:

with    ▁with
not     ▁not
this    ▁this
just    ▁just
my      ▁my
as      ▁as
don't   ▁don ' t

Note: Ultimately, the Riva decoder makes use only of the lexicon file directly at run time (but not the vocabulary file).

Riva ServiceMaker automatically tokenizes the words in the vocabulary file to generate the lexicon file. It uses the correct tokenizer model that is packaged together with the acoustic model in the .riva file. By default, Riva generates 1 tokenized form for each word in the vocabulary file.

What can be customized?#

Both the vocabulary and the lexicon files can be customized.

Extending the vocabulary enriches the Riva default vocabulary, providing additional coverage for out-of-vocabulary words, terminologies, and abbreviations.
Customizing the lexicon file can further enrich the Riva knowledge base by providing one or more explicit pronunciations, in the form of tokenized sequences.

Extending the vocabulary#

Extending the vocabulary must be done at Riva build time.

When building a Riva ASR pipeline, pass the extended vocabulary file to the --decoding_vocab=<vocabulary_file> parameter of the build command. For example, the build command for the Citrinet model:

    riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<vocabulary_file> \
   --language_code=en-US \
   <other_parameters>...

Refer to Riva documentation for build commands for supported models.

How to modify vocabulary file#

You can either provide your own vocabulary file, or extend Riva’s default vocabulary file.

BYO vocabulary file: provide a flat text file containing a list of vocabulary words, each on its own line. Note that this file must not only contain a small list of “difficult words”, but must contains all the words that you want the ASR pipeline to be able to generate, that is, including all common words.
Modifying an existing one: This is the recommended approach. Out-of-the-box vocabulary files for Riva supported languages can be found either:
- On NGC, for example, for English, the vocabulary file named flashlight_decoder_vocab.txt can be found at this link.
- Or in a local Riva deployment: The actual physical location of Riva assets depends on the value of the riva_model_loc variable in the config.sh file under the Riva quickstart folder. The vocabulary file is bundled with the Flashlight decoder.
  - By default, riva_model_loc is set to riva-model-repo, which is a docker volume. You can inspect this docker volume and copy the vocabulary file from within the docker volume to the host file system with commands such as:
```
# Inspect the Riva model docker volume
docker inspect riva-model-repo

# Inspect the content of the Riva model docker volume
docker run --rm -v riva-model-repo:/riva-model-repo alpine ls /riva-model-repo

# Copy the vocabulary file from the docker volume to the current directory
docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt /dest
```
  - If you modify riva_model_loc to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming <RIVA_REPO_DIR> is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, <RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/dict_vocab.txt.

You can make a copy, then extend this default vocabulary file with the words of interest.

Once modified, you’ll have to redeploy the Riva ASR pipeline with riva-build while passing the flag --decoding_vocab=<modified_vocabulary_file>.

Customizing pronunciation with lexicon mapping#

The lexicon file that is used by the Flashlight decoder can be found in the Riva assets directory, as specified by the value of the riva_model_loc variable in the config.sh file under the Riva quickstart folder (see above).

If riva_model_loc points to a docker volume (by default), you can find and copy the lexicon file with:

        # Copy the lexicon file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt /dest

If you modify riva_model_loc to an absolute path pointing to a folder, then the specified folder in the local file system will be used to store Riva assets instead. Assuming <RIVA_REPO_DIR> is the directory where Riva assets are stored, then the vocabulary file can similarly be found under, for example, <RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt.

How to modify the lexicon file#

First, locate and make a copy of the lexicon file. For example:

cp <RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/lexicon.txt modified_lexicon.txt

Next, modify it to add the sentencepiece tokenizations for the words of interest. For example, one could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

which are 3 different pronunciations/tokenizations of the word manu. If the acoustic model predicts those tokens, they will be decoded as manu.

Finally, once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing --decoding_lexicon=modified_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

How to generate the correct tokenized form#

When modifying the lexicon file, ensure that:

The new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model.
The tokens are valid tokens as determined by the tokenizer model (packaged with the Riva acoustic model).

The latter ensures that you use only tokens that the acoustic model has been trained on. To do this, you’ll need the tokenizer model and the sentencepiece Python package (pip install sentencepiece). You can get the tokenizer model for the deployed pipeline from the model repository ctc-decoder-... directory for your model. It will be named <hash>_tokenizer.model. For example:

<RIVA_REPO_DIR>/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model

When using a docker volume to store Riva assets (by default), you can copy the tokenizer model to the local directory with a command such as:

        # Copy the tokenizer model file from the docker volume to the current directory
        docker run --rm -v $PWD:/dest -v riva-model-repo:/riva-model-repo alpine cp  /riva-model-repo/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-offline/1/498056ba420d4bb3831ad557fba06032_tokenizer.model /dest

!pip install sentencepiece

You can then generate new lexicon entries, for example:

TOKEN="BRAF"
PRONUNCIATION="b raf"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

BRAF	▁b ▁ra f
BRAF	▁b ▁ r a f
BRAF	▁b ▁ra f
BRAF	▁b ▁ra f
BRAF	▁ b ▁ra f

Note: TOKEN represents the desired written form of the word, while PRONUNCIATION is what the word should sound like.

Other examples:

TOKEN="WhatsApp"
PRONUNCIATION="what's app"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

WhatsApp	▁what ' s ▁app
WhatsApp	▁w h at ' s ▁app
WhatsApp	▁w h at ' s ▁app
WhatsApp	▁what ' s ▁a pp
WhatsApp	▁ w ha t ' s ▁app

TOKEN="Cya"
PRONUNCIATION="See ya"

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='tokenizer.model')
for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁y a
Cya	▁ S e e ▁ y a

Go deeper into Riva capabilities#

Additional Riva Tutorials#

Checkout more Riva tutorials here to understand how to use some of the advanced features of Riva ASR, including customizing ASR for your specific needs.

Sample Applications#

Riva comes with various sample applications. They demonstrate how to use the APIs to build various applications. Refer to Riva Sample Apps for more information.

Additional Resources#

For more information about each of the Riva APIs and their functionalities, refer to the documentation.

NVIDIA Riva

How to Customize Riva ASR Vocabulary and Pronunciation with Lexicon Mapping

Contents