Custom Recognition¶

The Riva Quick Start scripts allow you to easily deploy pre-configured ASR pipelines that are very accurate for most applications. The Pipeline Configuration section provides the riva-build commands used to configure the ASR pipelines that are in the Quick Start scripts. You can also easily customize the Riva ASR pipeline in order to meet your specific needs. The following sections describe the different ways in which the ASR pipeline can be customized. To improve the speech recognition accuracy, we recommend the customization strategies in the following order:

Word boosting. This strategy enables you to easily improve recognition of specific words at request time. More information can be found here.
Decoder lexicon. Riva uses a lexicon-based decoder which only emits words that are present in the decoder lexicon. It is possible to modify the lexicon used by the decoder to improve recognition. More information can be found here.
Language model. The Riva ASR pipeline supports the use of n-gram language models. Using a language model that is tailored to your use case can greatly help in improving the accuracy of transcripts. Refer to the Training a Language Model and Language Models for more information about how to train and use a new language model.
Acoustic model training or fine-tuning. If the strategies above do not help to improve recognition, training or fine-tuning the acoustic model might be required. More information can be found here.

Training or Fine-Tuning an Acoustic Model¶

Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few best practices to follow. Many of these best practices also apply to inputs at inference time.

Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.
Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.
Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.
Use a minimum sampling rate of 16kHz if possible, but do not resample.
If using TAO to fine-tune ASR models, refer to the TAO Toolkit documentation on training acoustic models here. Try running the following Jupyter notebooks Speech to Text Notebook and Speech to Text Citrinet Notebook.
If using NeMo to fine-tune ASR models, refer to this tutorial. We recommend fine-tuning ASR models only with sufficient data approximately on the order of several hundred hours of speech. If such data is not available, it may be more useful to simply adapt the LM on in-domain text corpus than to train the ASR model.
There is no formal guarantee that the ASR model will or won’t be streamable after training. We see that with more training (thousands of hours of speech, 100-200 epochs), models generally obtain better offline scores. Online scores do not degrade as severely (but still degrade to some extent due to the differences between online and offline evaluation).

Inverse Text Normalization¶

Riva implements inverse text normalization (ITN) for ASR requests. It uses weight finite state transducers (WFST) based models to convert spoken domain output from an ASR model into a written domain text to improve readability of the ASR systems output.

Details on the model architecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.

Word Boosting¶

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model.

Examples demonstrating how to use word boosting can be found in the /work/examples/transcribe_file_offline.py and /work/examples/transcribe_file.py Python scripts in the Riva client image. The following sample command shows how to run these scripts (and the outputs they generate) from within the Riva client container:

/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav
Final transcript: I had a meeting today with Muhammad Oscar and Katherine Rutherford about the future of Riva at NVIDIA.
/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav --boosted_lm_words "asghar"
Final transcript: I had a meeting today with Muhammad Asghar and Katherine Rutherford about the future of Riva at NVIDIA.

These scripts show how to add the boosted words to RecognitionConfig, with SpeechContext (look for the "# Append boosted words/score" comment). For more information about SpeechContext, refer to the riva/proto/riva_asr.proto description here.

We recommend using boosting score values between 20. and 100. A higher score increases the likelihood that the boosted words appear in the transcript if the words occurred in the audio. However, it can also increase the likelihood that the boosted words appear in the transcription even though they didn’t occur in the audio. Try experimenting with the boosting score values until you get accurate transcription results.

The following word boosting code snippets are included in these example scripts:

# Creating GRPC channel and RecognitionConfig instance
channel = grpc.insecure_channel(args.server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=wf.getframerate(),
  language_code=args.language_code,
  max_alternatives=1,
  enable_automatic_punctuation=True,
)

# Word Boosting
boosted_lm_words = ["first", "second", "third"]
boosted_lm_score = 10.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score
config.speech_contexts.append(speech_context)

# Creating StreamingRecognitionConfig instance with config
streaming_config = rasr.StreamingRecognitionConfig(config=config, interim_results=True)

You can also have different boost values for different words. For example, here first is boosted by 10 and second is boosted by 20:

speech_context1 = rasr.SpeechContext()
speech_context1.phrases.append("first")
speech_context1.boost = 10.
config.speech_contexts.append(speech_context1)

speech_context2 = rasr.SpeechContext()
speech_context2.phrases.append("second")
speech_context2.boost = 20.
config.speech_contexts.append(speech_context2)

Note:

There is no limit to the number of words that can be boosted. You should see minimal impact on latency for all requests, even for tens of boosted words, except for the first request, which is expected.
By default, no words are boosted on the server side. Only words passed by the client are boosted.
Out-of-vocabulary word boosting is supported.
Boosting phrases or combination of words is not yet fully supported (but do work). We will revisit finalizing this support in an upcoming release.

Pipeline Configuration¶

In the simplest use case, you can deploy an ASR pipeline to be used with the StreamingRecognize API call (refer to riva/proto/riva_asr.proto) without any language model as follows:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --acoustic_model_name=<acoustic_model_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy 

where:

<rmir_filename> is the Riva rmir file that is generated
<riva_filename> is the name of the riva file to use as input
<encryption_key> is the encryption key used during the export of the .riva file
<name> and <acoustic_model_name> are optional user-defined names for the components in the model repository.

Note

<acoustic_model_name> is global and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and there will not be any incompatibilities in model weights or input shapes.
<wfst_tokenizer_model> is the name of the WFST tokenizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
<wfst_verbalizer_model> is the name of the WFST verbalizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
decoder_type is the type of decoder to use. Valid values are flashlight, os2s, greedy. We recommend using flashlight. Refer to Decoder Hyper-Parameters for more details.

Upon succesful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. Since no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted you need to include :<encryption_key> at the end of the RMIR filename and Riva filename. Otherwise, this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the riva-build-optional-parameters section and set the various max_batch_size and max_execution_batch_size parameters to 1 while executing the riva-build command.

The following summary lists the riva-build commands used to generate the RMIR files from the Quick Start scripts for different models, modes, and their limitations:

Limitations: None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-en-US-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Limitations: None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-en-US-asr-streaming-throughput \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US