Custom Recognition
Contents
Custom Recognition¶
The Riva Quick Start scripts allow you to easily deploy pre-configured ASR pipelines that are very accurate for most applications. The Pipeline Configuration section provides the riva-build
commands used to configure the ASR pipelines that are in the Quick Start scripts. You can also easily customize the Riva ASR pipeline in order to meet your specific needs. The following sections describe the different ways in which the ASR pipeline can be customized. To improve the speech recognition accuracy, we recommend the customization strategies in the following order:
Word boosting. This strategy enables you to easily improve recognition of specific words at request time. More information can be found here.
Decoder lexicon. Riva uses a lexicon-based decoder which only emits words that are present in the decoder lexicon. It is possible to modify the lexicon used by the decoder to improve recognition. More information can be found here.
Language model. The Riva ASR pipeline supports the use of n-gram language models. Using a language model that is tailored to your use case can greatly help in improving the accuracy of transcripts. Refer to the Training a Language Model and Language Models for more information about how to train and use a new language model.
Acoustic model training or fine-tuning. If the strategies above do not help to improve recognition, training or fine-tuning the acoustic model might be required. More information can be found here.
Training or Fine-Tuning an Acoustic Model¶
Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few best practices to follow. Many of these best practices also apply to inputs at inference time.
Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.
Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.
Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.
Use a minimum sampling rate of 16kHz if possible, but do not resample.
If using TAO to fine-tune ASR models, refer to the TAO Toolkit documentation on training acoustic models here. Try running the following Jupyter notebooks Speech to Text Notebook and Speech to Text Citrinet Notebook.
If using NeMo to fine-tune ASR models, refer to this tutorial. We recommend fine-tuning ASR models only with sufficient data approximately on the order of several hundred hours of speech. If such data is not available, it may be more useful to simply adapt the LM on in-domain text corpus than to train the ASR model.
There is no formal guarantee that the ASR model will or won’t be streamable after training. We see that with more training (thousands of hours of speech, 100-200 epochs), models generally obtain better offline scores. Online scores do not degrade as severely (but still degrade to some extent due to the differences between online and offline evaluation).
Inverse Text Normalization¶
Riva implements inverse text normalization (ITN) for ASR requests. It uses weight finite state transducers (WFST) based models to convert spoken domain output from an ASR model into a written domain text to improve readability of the ASR systems output.
Details on the model architecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.
Word Boosting¶
Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model.
Examples demonstrating how to use word boosting can be found in the /work/examples/transcribe_file_offline.py
and /work/examples/transcribe_file.py
Python scripts in the Riva client image.
The following sample command shows how to run these scripts (and the outputs they generate) from within the Riva client container:
/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav
Final transcript: I had a meeting today with Muhammad Oscar and Katherine Rutherford about the future of Riva at NVIDIA.
/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav --boosted_lm_words "asghar"
Final transcript: I had a meeting today with Muhammad Asghar and Katherine Rutherford about the future of Riva at NVIDIA.
These scripts show how to add the boosted words to RecognitionConfig
, with SpeechContext
(look for the "# Append boosted words/score"
comment).
For more information about SpeechContext
, refer to the riva/proto/riva_asr.proto
description here.
We recommend using boosting score values between 20. and 100. A higher score increases the likelihood that the boosted words appear in the transcript if the words occurred in the audio. However, it can also increase the likelihood that the boosted words appear in the transcription even though they didn’t occur in the audio. Try experimenting with the boosting score values until you get accurate transcription results.
The following word boosting
code snippets are included in these example scripts:
# Creating GRPC channel and RecognitionConfig instance
channel = grpc.insecure_channel(args.server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
encoding=ra.AudioEncoding.LINEAR_PCM,
sample_rate_hertz=wf.getframerate(),
language_code=args.language_code,
max_alternatives=1,
enable_automatic_punctuation=True,
)
# Word Boosting
boosted_lm_words = ["first", "second", "third"]
boosted_lm_score = 10.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score
config.speech_contexts.append(speech_context)
# Creating StreamingRecognitionConfig instance with config
streaming_config = rasr.StreamingRecognitionConfig(config=config, interim_results=True)
You can also have different boost values for different words. For example, here first
is boosted by 10 and second
is boosted by 20:
speech_context1 = rasr.SpeechContext()
speech_context1.phrases.append("first")
speech_context1.boost = 10.
config.speech_contexts.append(speech_context1)
speech_context2 = rasr.SpeechContext()
speech_context2.phrases.append("second")
speech_context2.boost = 20.
config.speech_contexts.append(speech_context2)
Note:
There is no limit to the number of words that can be boosted. You should see minimal impact on latency for all requests, even for tens of boosted words, except for the first request, which is expected.
By default, no words are boosted on the server side. Only words passed by the client are boosted.
Out-of-vocabulary word boosting is supported.
Boosting phrases or combination of words is not yet fully supported (but do work). We will revisit finalizing this support in an upcoming release.
Pipeline Configuration¶
In the simplest use case, you can deploy an ASR pipeline to be used with the StreamingRecognize
API call (refer to riva/proto/riva_asr.proto) without any language model as follows:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--acoustic_model_name=<acoustic_model_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=greedy
where:
<rmir_filename>
is the Rivarmir
file that is generated<riva_filename>
is the name of theriva
file to use as input<encryption_key>
is the encryption key used during the export of the.riva
file<name>
and<acoustic_model_name>
are optional user-defined names for the components in the model repository.Note
<acoustic_model_name>
is global and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and there will not be any incompatibilities in model weights or input shapes.<wfst_tokenizer_model>
is the name of the WFST tokenizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.<wfst_verbalizer_model>
is the name of the WFST verbalizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.decoder_type
is the type of decoder to use. Valid values areflashlight
,os2s
,greedy
. We recommend usingflashlight
. Refer to Decoder Hyper-Parameters for more details.
Upon succesful completion of this command, a file named <rmir_filename>
is created in the /servicemaker-dev/
folder. Since
no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva
archives are encrypted you need to include :<encryption_key>
at the end of the RMIR filename and Riva filename. Otherwise, this is unnecessary.
For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the riva-build-optional-parameters section and set the various max_batch_size
and max_execution_batch_size
parameters to 1 while executing the riva-build
command.
The following summary lists the riva-build
commands used to generate the RMIR files from the Quick Start scripts for different models, modes, and their limitations:
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-en-US-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-en-US-asr-streaming-throughput \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: Maximum audio duration of 15 minutes
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-en-US-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-es-US-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=es-US
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-es-US-asr-streaming-throughput \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=es-US
Limitations: Maximum audio duration of 15 minutes
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-es-US-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=es-US
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-de-DE-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-de-DE-asr-streaming-throughput \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: Maximum audio duration of 15 minutes
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-de-DE-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-ru-RU-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.8 \
--flashlight_decoder.word_insertion_score=0.75 \
--flashlight_decoder.beam_threshold=20. \
--flashlight_decoder.beam_size=64 \
--flashlight_decoder.beam_size_token=64 \
--language_code=ru-RU
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-ru-RU-asr-streaming-throughput \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.8 \
--flashlight_decoder.word_insertion_score=0.75 \
--flashlight_decoder.beam_threshold=20. \
--flashlight_decoder.beam_size=64 \
--flashlight_decoder.beam_size_token=64 \
--language_code=ru-RU
Limitations: Maximum audio duration of 15 minutes
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-ru-RU-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.8 \
--flashlight_decoder.word_insertion_score=0.75 \
--flashlight_decoder.beam_threshold=20. \
--flashlight_decoder.beam_size=64 \
--flashlight_decoder.beam_size_token=64 \
--language_code=ru-RU
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-zh-CN-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=os2s \
--os2s_decoder.beam_search_width=32 \
--os2s_decoder.language_model_alpha=0.5 \
--os2s_decoder.language_model_beta=1.0 \
--decoding_language_model_binary=<lm_binary> \
--language_code=zh-CN
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-zh-CN-asr-streaming-throughput \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=os2s \
--os2s_decoder.beam_search_width=32 \
--os2s_decoder.language_model_alpha=0.5 \
--os2s_decoder.language_model_beta=1.0 \
--decoding_language_model_binary=<lm_binary> \
--language_code=zh-CN
Limitations: Maximum audio duration of 15 minutes
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-zh-CN-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=os2s \
--os2s_decoder.beam_search_width=32 \
--os2s_decoder.language_model_alpha=0.5 \
--os2s_decoder.language_model_beta=1.0 \
--decoding_language_model_binary=<lm_binary> \
--language_code=zh-CN
Limitations: ONNX Runtime only
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-en-US-asr-streaming \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: ONNX Runtime only
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-en-US-asr-streaming-throughput \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: ONNX runtime
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=conformer-en-US-asr-offline \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
Limitations: ONNX Runtime only
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-de-DE-asr-streaming \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: ONNX Runtime only
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-de-DE-asr-streaming-throughput \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: ONNX runtime
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=conformer-de-DE-asr-offline \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=de-DE
Limitations: None
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-256-en-US-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
You can easily customize your deployment by specifying your own language model and your own decoder lexicon files for example.
The .riva
model, language model, lexicon vocabulary, and WFST files used to generate the RMIRs in the Quick Start scripts can be found at the following NGC locations:
Language |
|
Language Model and Lexicon Vocabulary |
WFST Tokenizer and Verbalizer Models |
---|---|---|---|
English |
Riva ASR English LM
(files |
||
Spanish |
|||
German |
|||
Russian |
N/A |
||
Mandarin |
N/A |
For details about the parameters passed to riva-build
to customize the ASR pipeline, run:
riva-build <pipeline> -h
Note
For information about deploying the now deprecated Jasper or Quartznet models in Riva, refer to the Riva ASR Pipeline Configuration section.
Streaming/Offline Recognition¶
The Riva ASR pipeline can be configured for both streaming and offline recognition use cases. When using the StreamingRecognize
API call (refer to riva/proto/riva_asr.proto), we recommend the following riva-build
parameters for low-latency streaming recognition:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=greedy \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
For high throughput streaming recognition with the StreamingRecognize
API call, chunk_size
and padding_size
can be set as follows:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=greedy \
--chunk_size=0.8 \
--padding_size=1.6 \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
Finally, to configure the ASR pipeline for offline recogition with the Recognize
API call (refer to riva/proto/riva_asr.proto), we recommend the following settings:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--offline \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=greedy \
--chunk_size=900. \
--padding_size=0. \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
Note
When deploying the offline ASR models with riva-deploy
, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.
In offline mode, the riva-build
command allows transcribing audio files up to 900 seconds (15 minutes) long.
It is possible to increase the chunk_size
parameter to larger values, however, this increases the GPU
memory usage of the offline ASR pipeline when deployed. Ensure you tune this value based on the number of models
deployed, the memory capacity of the GPU used, and the maximum duration of the audio to transcribe in offline mode.
To transcribe audio files that are longer than the maximum allowable duration in offline mode, revert to
using the high throughput streaming recognition and the StreamingRecognize
API. This offline mode limitation
will be addressed in a future version of Riva.
Language Models¶
Riva ASR supports decoding with an n-gram language model. The n-gram language model can be provided in a few different ways.
A
.riva
file exported from TAO Toolkit.A
.arpa
format file.A KenLM binary format file.
For more information on building language models, refer to the Training Language Models section.
TAO Toolkit n-gram Language Model¶
When using a language model exported from TAO Toolkit, specify the language model by running the following riva-build
command:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
/servicemaker-dev/<n_gram_riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_vocab=<decoder_vocab_file>
where decoder_vocab_file
is the vocabulary used by the Flashlight lexicon decoder. The vocabulary file must contain one vocabulary word per line. A sample
vocabulary file (flashlight_decoder_vocab.txt
) can be downloaded from Riva ASR English LM.
ARPA format Language model¶
To configure the Riva ASR pipeline to use an n-gram language model stored in arpa
format, run:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_arpa=<arpa_filename> \
--decoding_vocab=<decoder_vocab_file>
KenLM binary Language model¶
To generate the Riva RMIR file when using a KenLM binary file to specify the language model, run:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_binary=<KENLM_binary_filename> \
--decoding_vocab=<decoder_vocab_file>
Decoder Hyper-Parameters¶
The decoder language model hyper-parameters can also be specified from the riva-build
command.
You can specify the Flashlight decoder hyper-parameters
beam_size
, beam_size_token
, beam_threshold
, lm_weight
and word_insertion_score
as follows:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--wfst_tokenizer_model=<wfst_tokenizer_model> \
--wfst_verbalizer_model=<wfst_verbalizer_model> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_binary=<arpa_filename> \
--decoding_vocab=<decoder_vocab_file> \
--flashlight_decoder.beam_size=<beam_size> \
--flashlight_decoder.beam_size_token=<beam_size_token> \
--flashlight_decoder.beam_threshold=<beam_threshold> \
--flashlight_decoder.lm_weight=<lm_weight> \
--flashlight_decoder.word_insertion_score=<word_insertion_score>
where:
beam_size
is the maximum number of hypothesis the decoder holds at each stepbeam_size_token
is the maximum number of tokens the decoder considers at each stepbeam_threshold
is the threshold to prune hypothesislm_weight
is the weight of the language model used when scoring hypothesisword_insertion_score
is the word insertion score used when scoring hypothesis
For advanced users, additional decoder hyper-parameters can also be specified. Refer to Riva-build Optional Parameters for a list of those parameters and their description.
Flashlight Decoder Lexicon¶
The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the decoder vocabulary file passed to the riva-build
command. The decoder vocabulary file
used to generate the ASR pipelines in the Quick Start scripts include words that cover a wide range of domains and should provide accurate transcripts for most applications.
It is also possible to build an ASR pipeline using your own decoder vocabulary file by using the parameter --decoding_vocab
of the riva-build
command.
For example, you could start with the riva-build
commands used to generate the ASR pipelines in our Quick Start scripts from section Pipeline Configuration and provide your own lexicon decoder vocabulary file.
You’ll need to ensure that words of interest are in the decoder vocabulary file. The Riva ServiceMaker automatically tokenizes the words in the decoder vocabulary file.
The number of tokenization for each word in the decoder vocabulary file can be controlled with the --flashlight_decoder.num_tokenization
parameter.
(Advanced) Manually Adding Additional Tokenizations of Words in Lexicon¶
It’s also possible to manually add additional tokenizations for the words in the decoder vocabulary by performing the following steps:
The riva-build
and riva-deploy
commands provided in the previous section store the lexicon in the /data/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-streaming/1/lexicon.txt
file of the Triton model repository.
To add additional tokenizations to the lexicon, copy the lexicon file:
cp /data/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt
and add the SentencePiece tokenization for the word of interest. For example, you could add:
manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew
to the decoding_lexicon.txt
file so that the word manu
is generated in the transcript if the acoustic model predicts those tokens. You’ll need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. After this is done, regenerate the model repository using the new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt
to riva-build
instead of --decoding_vocab=decoding_vocab.txt
.
Beginning/End of Utterance Detection¶
Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder
state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in
a 300ms window has non-blank characters. The end of an utterance is flagged when 98% of the frames in a 800ms window are
blank characters. You can tune those values for their particular use case by using the following riva-build
parameters:
--vad.vad_start_history=300 \
--vad.vad_start_th=0.2 \
--vad.vad_stop_history=800 \
--vad.vad_stop_th=0.98
Additionally, it is possible to disable the beginning/end of utterance detection by passing --vad.vad_type=none
to riva-build
.
Note that in this case, the decoder state resets after the full audio signal has been sent by the client. Similarly, the punctuator model is only called once.
Generating Multiple Transcript Hypotheses¶
By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for
each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter
--max_supported_transcripts=N
to the riva-build
command, where N
is the maximum number of
hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses
by setting the max_alternatives
field of RecognitionConfig
to values greater than 1.
Riva-build Optional Parameters¶
For details about the parameters passed to riva-build
to customize the ASR pipeline, issue:
riva-build speech_recognition -h
The following list includes descriptions for all optional parameters currently recognized by riva-build
:
usage: riva-build speech_recognition [-h] [-f] [--language_code LANGUAGE_CODE]
[--max_batch_size MAX_BATCH_SIZE]
[--acoustic_model_name ACOUSTIC_MODEL_NAME]
[--name NAME] [--streaming] [--offline]
[--chunk_size CHUNK_SIZE]
[--padding_factor PADDING_FACTOR]
[--left_padding_size LEFT_PADDING_SIZE]
[--right_padding_size RIGHT_PADDING_SIZE]
[--padding_size PADDING_SIZE]
[--max_supported_transcripts MAX_SUPPORTED_TRANSCRIPTS]
[--ms_per_timestep MS_PER_TIMESTEP]
[--lattice_beam LATTICE_BEAM]
[--decoding_language_model_arpa DECODING_LANGUAGE_MODEL_ARPA]
[--decoding_language_model_binary DECODING_LANGUAGE_MODEL_BINARY]
[--decoding_language_model_fst DECODING_LANGUAGE_MODEL_FST]
[--decoding_language_model_words DECODING_LANGUAGE_MODEL_WORDS]
[--rescoring_language_model_arpa RESCORING_LANGUAGE_MODEL_ARPA]
[--decoding_language_model_carpa DECODING_LANGUAGE_MODEL_CARPA]
[--rescoring_language_model_carpa RESCORING_LANGUAGE_MODEL_CARPA]
[--decoding_lexicon DECODING_LEXICON]
[--decoding_vocab DECODING_VOCAB]
[--tokenizer_model TOKENIZER_MODEL]
[--decoder_type DECODER_TYPE]
[--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
[--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
[--featurizer.max_sequence_idle_microseconds FEATURIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--featurizer.max_batch_size FEATURIZER.MAX_BATCH_SIZE]
[--featurizer.min_batch_size FEATURIZER.MIN_BATCH_SIZE]
[--featurizer.opt_batch_size FEATURIZER.OPT_BATCH_SIZE]
[--featurizer.preferred_batch_size FEATURIZER.PREFERRED_BATCH_SIZE]
[--featurizer.batching_type FEATURIZER.BATCHING_TYPE]
[--featurizer.preserve_ordering FEATURIZER.PRESERVE_ORDERING]
[--featurizer.instance_group_count FEATURIZER.INSTANCE_GROUP_COUNT]
[--featurizer.max_queue_delay_microseconds FEATURIZER.MAX_QUEUE_DELAY_MICROSECONDS]
[--featurizer.max_execution_batch_size FEATURIZER.MAX_EXECUTION_BATCH_SIZE]
[--featurizer.gain FEATURIZER.GAIN]
[--featurizer.dither FEATURIZER.DITHER]
[--featurizer.stddev_floor FEATURIZER.STDDEV_FLOOR]
[--featurizer.use_utterance_norm_params FEATURIZER.USE_UTTERANCE_NORM_PARAMS]
[--featurizer.precalc_norm_time_steps FEATURIZER.PRECALC_NORM_TIME_STEPS]
[--featurizer.precalc_norm_params FEATURIZER.PRECALC_NORM_PARAMS]
[--featurizer.norm_per_feature FEATURIZER.NORM_PER_FEATURE]
[--featurizer.mean FEATURIZER.MEAN]
[--featurizer.stddev FEATURIZER.STDDEV]
[--featurizer.transpose FEATURIZER.TRANSPOSE]
[--featurizer.padding_size FEATURIZER.PADDING_SIZE]
[--nn.max_sequence_idle_microseconds NN.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--nn.max_batch_size NN.MAX_BATCH_SIZE]
[--nn.min_batch_size NN.MIN_BATCH_SIZE]
[--nn.opt_batch_size NN.OPT_BATCH_SIZE]
[--nn.preferred_batch_size NN.PREFERRED_BATCH_SIZE]
[--nn.batching_type NN.BATCHING_TYPE]
[--nn.preserve_ordering NN.PRESERVE_ORDERING]
[--nn.instance_group_count NN.INSTANCE_GROUP_COUNT]
[--nn.max_queue_delay_microseconds NN.MAX_QUEUE_DELAY_MICROSECONDS]
[--nn.trt_max_workspace_size NN.TRT_MAX_WORKSPACE_SIZE]
[--nn.use_onnx_runtime]
[--nn.use_trt_fp32]
[--vad.max_sequence_idle_microseconds VAD.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--vad.max_batch_size VAD.MAX_BATCH_SIZE]
[--vad.min_batch_size VAD.MIN_BATCH_SIZE]
[--vad.opt_batch_size VAD.OPT_BATCH_SIZE]
[--vad.preferred_batch_size VAD.PREFERRED_BATCH_SIZE]
[--vad.batching_type VAD.BATCHING_TYPE]
[--vad.preserve_ordering VAD.PRESERVE_ORDERING]
[--vad.instance_group_count VAD.INSTANCE_GROUP_COUNT]
[--vad.max_queue_delay_microseconds VAD.MAX_QUEUE_DELAY_MICROSECONDS]
[--vad.ms_per_timestep VAD.MS_PER_TIMESTEP]
[--vad.vad_start_history VAD.VAD_START_HISTORY]
[--vad.vad_stop_history VAD.VAD_STOP_HISTORY]
[--vad.vad_start_th VAD.VAD_START_TH]
[--vad.vad_stop_th VAD.VAD_STOP_TH]
[--vad.vad_type VAD.VAD_TYPE]
[--vad.residue_blanks_at_start VAD.RESIDUE_BLANKS_AT_START]
[--vad.residue_blanks_at_end VAD.RESIDUE_BLANKS_AT_END]
[--vad.vocab_file VAD.VOCAB_FILE]
[--flashlight_decoder.max_sequence_idle_microseconds FLASHLIGHT_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--flashlight_decoder.max_batch_size FLASHLIGHT_DECODER.MAX_BATCH_SIZE]
[--flashlight_decoder.min_batch_size FLASHLIGHT_DECODER.MIN_BATCH_SIZE]
[--flashlight_decoder.opt_batch_size FLASHLIGHT_DECODER.OPT_BATCH_SIZE]
[--flashlight_decoder.preferred_batch_size FLASHLIGHT_DECODER.PREFERRED_BATCH_SIZE]
[--flashlight_decoder.batching_type FLASHLIGHT_DECODER.BATCHING_TYPE]
[--flashlight_decoder.preserve_ordering FLASHLIGHT_DECODER.PRESERVE_ORDERING]
[--flashlight_decoder.instance_group_count FLASHLIGHT_DECODER.INSTANCE_GROUP_COUNT]
[--flashlight_decoder.max_queue_delay_microseconds FLASHLIGHT_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--flashlight_decoder.max_execution_batch_size FLASHLIGHT_DECODER.MAX_EXECUTION_BATCH_SIZE]
[--flashlight_decoder.decoder_type FLASHLIGHT_DECODER.DECODER_TYPE]
[--flashlight_decoder.padding_size FLASHLIGHT_DECODER.PADDING_SIZE]
[--flashlight_decoder.max_supported_transcripts FLASHLIGHT_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
[--flashlight_decoder.asr_model_delay FLASHLIGHT_DECODER.ASR_MODEL_DELAY]
[--flashlight_decoder.ms_per_timestep FLASHLIGHT_DECODER.MS_PER_TIMESTEP]
[--flashlight_decoder.vocab_file FLASHLIGHT_DECODER.VOCAB_FILE]
[--flashlight_decoder.decoder_num_worker_threads FLASHLIGHT_DECODER.DECODER_NUM_WORKER_THREADS]
[--flashlight_decoder.language_model_file FLASHLIGHT_DECODER.LANGUAGE_MODEL_FILE]
[--flashlight_decoder.lexicon_file FLASHLIGHT_DECODER.LEXICON_FILE]
[--flashlight_decoder.beam_size FLASHLIGHT_DECODER.BEAM_SIZE]
[--flashlight_decoder.beam_size_token FLASHLIGHT_DECODER.BEAM_SIZE_TOKEN]
[--flashlight_decoder.beam_threshold FLASHLIGHT_DECODER.BEAM_THRESHOLD]
[--flashlight_decoder.lm_weight FLASHLIGHT_DECODER.LM_WEIGHT]
[--flashlight_decoder.blank_token FLASHLIGHT_DECODER.BLANK_TOKEN]
[--flashlight_decoder.sil_token FLASHLIGHT_DECODER.SIL_TOKEN]
[--flashlight_decoder.word_insertion_score FLASHLIGHT_DECODER.WORD_INSERTION_SCORE]
[--flashlight_decoder.forerunner_beam_size FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE]
[--flashlight_decoder.forerunner_beam_size_token FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE_TOKEN]
[--flashlight_decoder.forerunner_beam_threshold FLASHLIGHT_DECODER.FORERUNNER_BEAM_THRESHOLD]
[--flashlight_decoder.smearing_mode FLASHLIGHT_DECODER.SMEARING_MODE]
[--flashlight_decoder.forerunner_use_lm FLASHLIGHT_DECODER.FORERUNNER_USE_LM]
[--flashlight_decoder.num_tokenization FLASHLIGHT_DECODER.NUM_TOKENIZATION]
[--greedy_decoder.max_sequence_idle_microseconds GREEDY_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--greedy_decoder.max_batch_size GREEDY_DECODER.MAX_BATCH_SIZE]
[--greedy_decoder.min_batch_size GREEDY_DECODER.MIN_BATCH_SIZE]
[--greedy_decoder.opt_batch_size GREEDY_DECODER.OPT_BATCH_SIZE]
[--greedy_decoder.preferred_batch_size GREEDY_DECODER.PREFERRED_BATCH_SIZE]
[--greedy_decoder.batching_type GREEDY_DECODER.BATCHING_TYPE]
[--greedy_decoder.preserve_ordering GREEDY_DECODER.PRESERVE_ORDERING]
[--greedy_decoder.instance_group_count GREEDY_DECODER.INSTANCE_GROUP_COUNT]
[--greedy_decoder.max_queue_delay_microseconds GREEDY_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--greedy_decoder.max_execution_batch_size GREEDY_DECODER.MAX_EXECUTION_BATCH_SIZE]
[--greedy_decoder.decoder_type GREEDY_DECODER.DECODER_TYPE]
[--greedy_decoder.padding_size GREEDY_DECODER.PADDING_SIZE]
[--greedy_decoder.max_supported_transcripts GREEDY_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
[--greedy_decoder.asr_model_delay GREEDY_DECODER.ASR_MODEL_DELAY]
[--greedy_decoder.ms_per_timestep GREEDY_DECODER.MS_PER_TIMESTEP]
[--greedy_decoder.vocab_file GREEDY_DECODER.VOCAB_FILE]
[--greedy_decoder.decoder_num_worker_threads GREEDY_DECODER.DECODER_NUM_WORKER_THREADS]
[--os2s_decoder.max_sequence_idle_microseconds OS2S_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--os2s_decoder.max_batch_size OS2S_DECODER.MAX_BATCH_SIZE]
[--os2s_decoder.min_batch_size OS2S_DECODER.MIN_BATCH_SIZE]
[--os2s_decoder.opt_batch_size OS2S_DECODER.OPT_BATCH_SIZE]
[--os2s_decoder.preferred_batch_size OS2S_DECODER.PREFERRED_BATCH_SIZE]
[--os2s_decoder.batching_type OS2S_DECODER.BATCHING_TYPE]
[--os2s_decoder.preserve_ordering OS2S_DECODER.PRESERVE_ORDERING]
[--os2s_decoder.instance_group_count OS2S_DECODER.INSTANCE_GROUP_COUNT]
[--os2s_decoder.max_queue_delay_microseconds OS2S_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--os2s_decoder.max_execution_batch_size OS2S_DECODER.MAX_EXECUTION_BATCH_SIZE]
[--os2s_decoder.decoder_type OS2S_DECODER.DECODER_TYPE]
[--os2s_decoder.padding_size OS2S_DECODER.PADDING_SIZE]
[--os2s_decoder.max_supported_transcripts OS2S_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
[--os2s_decoder.asr_model_delay OS2S_DECODER.ASR_MODEL_DELAY]
[--os2s_decoder.ms_per_timestep OS2S_DECODER.MS_PER_TIMESTEP]
[--os2s_decoder.vocab_file OS2S_DECODER.VOCAB_FILE]
[--os2s_decoder.decoder_num_worker_threads OS2S_DECODER.DECODER_NUM_WORKER_THREADS]
[--os2s_decoder.language_model_file OS2S_DECODER.LANGUAGE_MODEL_FILE]
[--os2s_decoder.beam_search_width OS2S_DECODER.BEAM_SEARCH_WIDTH]
[--os2s_decoder.language_model_alpha OS2S_DECODER.LANGUAGE_MODEL_ALPHA]
[--os2s_decoder.language_model_beta OS2S_DECODER.LANGUAGE_MODEL_BETA]
[--kaldi_decoder.max_sequence_idle_microseconds KALDI_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--kaldi_decoder.max_batch_size KALDI_DECODER.MAX_BATCH_SIZE]
[--kaldi_decoder.min_batch_size KALDI_DECODER.MIN_BATCH_SIZE]
[--kaldi_decoder.opt_batch_size KALDI_DECODER.OPT_BATCH_SIZE]
[--kaldi_decoder.preferred_batch_size KALDI_DECODER.PREFERRED_BATCH_SIZE]
[--kaldi_decoder.batching_type KALDI_DECODER.BATCHING_TYPE]
[--kaldi_decoder.preserve_ordering KALDI_DECODER.PRESERVE_ORDERING]
[--kaldi_decoder.instance_group_count KALDI_DECODER.INSTANCE_GROUP_COUNT]
[--kaldi_decoder.max_queue_delay_microseconds KALDI_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
[--kaldi_decoder.max_execution_batch_size KALDI_DECODER.MAX_EXECUTION_BATCH_SIZE]
[--kaldi_decoder.decoder_type KALDI_DECODER.DECODER_TYPE]
[--kaldi_decoder.padding_size KALDI_DECODER.PADDING_SIZE]
[--kaldi_decoder.max_supported_transcripts KALDI_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
[--kaldi_decoder.asr_model_delay KALDI_DECODER.ASR_MODEL_DELAY]
[--kaldi_decoder.ms_per_timestep KALDI_DECODER.MS_PER_TIMESTEP]
[--kaldi_decoder.vocab_file KALDI_DECODER.VOCAB_FILE]
[--kaldi_decoder.decoder_num_worker_threads KALDI_DECODER.DECODER_NUM_WORKER_THREADS]
[--kaldi_decoder.fst_filename KALDI_DECODER.FST_FILENAME]
[--kaldi_decoder.word_syms_filename KALDI_DECODER.WORD_SYMS_FILENAME]
[--kaldi_decoder.default_beam KALDI_DECODER.DEFAULT_BEAM]
[--kaldi_decoder.max_active KALDI_DECODER.MAX_ACTIVE]
[--kaldi_decoder.acoustic_scale KALDI_DECODER.ACOUSTIC_SCALE]
[--kaldi_decoder.decoder_num_copy_threads KALDI_DECODER.DECODER_NUM_COPY_THREADS]
[--kaldi_decoder.determinize_lattice KALDI_DECODER.DETERMINIZE_LATTICE]
[--rescorer.max_sequence_idle_microseconds RESCORER.MAX_SEQUENCE_IDLE_MICROSECONDS]
[--rescorer.max_batch_size RESCORER.MAX_BATCH_SIZE]
[--rescorer.min_batch_size RESCORER.MIN_BATCH_SIZE]
[--rescorer.opt_batch_size RESCORER.OPT_BATCH_SIZE]
[--rescorer.preferred_batch_size RESCORER.PREFERRED_BATCH_SIZE]
[--rescorer.batching_type RESCORER.BATCHING_TYPE]
[--rescorer.preserve_ordering RESCORER.PRESERVE_ORDERING]
[--rescorer.instance_group_count RESCORER.INSTANCE_GROUP_COUNT]
[--rescorer.max_queue_delay_microseconds RESCORER.MAX_QUEUE_DELAY_MICROSECONDS]
[--rescorer.max_supported_transcripts RESCORER.MAX_SUPPORTED_TRANSCRIPTS]
[--rescorer.score_lm_carpa_filename RESCORER.SCORE_LM_CARPA_FILENAME]
[--rescorer.decode_lm_carpa_filename RESCORER.DECODE_LM_CARPA_FILENAME]
[--rescorer.word_syms_filename RESCORER.WORD_SYMS_FILENAME]
[--rescorer.word_insertion_penalty RESCORER.WORD_INSERTION_PENALTY]
[--rescorer.num_worker_threads RESCORER.NUM_WORKER_THREADS]
[--rescorer.ms_per_timestep RESCORER.MS_PER_TIMESTEP]
[--rescorer.boundary_character_ids RESCORER.BOUNDARY_CHARACTER_IDS]
[--rescorer.vocab_file RESCORER.VOCAB_FILE]
[--lm_decoder_cpu.beam_search_width LM_DECODER_CPU.BEAM_SEARCH_WIDTH]
[--lm_decoder_cpu.decoder_type LM_DECODER_CPU.DECODER_TYPE]
[--lm_decoder_cpu.padding_size LM_DECODER_CPU.PADDING_SIZE]
[--lm_decoder_cpu.language_model_file LM_DECODER_CPU.LANGUAGE_MODEL_FILE]
[--lm_decoder_cpu.max_supported_transcripts LM_DECODER_CPU.MAX_SUPPORTED_TRANSCRIPTS]
[--lm_decoder_cpu.asr_model_delay LM_DECODER_CPU.ASR_MODEL_DELAY]
[--lm_decoder_cpu.language_model_alpha LM_DECODER_CPU.LANGUAGE_MODEL_ALPHA]
[--lm_decoder_cpu.language_model_beta LM_DECODER_CPU.LANGUAGE_MODEL_BETA]
[--lm_decoder_cpu.ms_per_timestep LM_DECODER_CPU.MS_PER_TIMESTEP]
[--lm_decoder_cpu.vocab_file LM_DECODER_CPU.VOCAB_FILE]
[--lm_decoder_cpu.lexicon_file LM_DECODER_CPU.LEXICON_FILE]
[--lm_decoder_cpu.beam_size LM_DECODER_CPU.BEAM_SIZE]
[--lm_decoder_cpu.beam_size_token LM_DECODER_CPU.BEAM_SIZE_TOKEN]
[--lm_decoder_cpu.beam_threshold LM_DECODER_CPU.BEAM_THRESHOLD]
[--lm_decoder_cpu.lm_weight LM_DECODER_CPU.LM_WEIGHT]
[--lm_decoder_cpu.word_insertion_score LM_DECODER_CPU.WORD_INSERTION_SCORE]
[--lm_decoder_cpu.forerunner_beam_size LM_DECODER_CPU.FORERUNNER_BEAM_SIZE]
[--lm_decoder_cpu.forerunner_beam_size_token LM_DECODER_CPU.FORERUNNER_BEAM_SIZE_TOKEN]
[--lm_decoder_cpu.forerunner_beam_threshold LM_DECODER_CPU.FORERUNNER_BEAM_THRESHOLD]
[--lm_decoder_cpu.smearing_mode LM_DECODER_CPU.SMEARING_MODE]
[--lm_decoder_cpu.forerunner_use_lm LM_DECODER_CPU.FORERUNNER_USE_LM]
output_path source_path [source_path ...]
Generate a Riva Model from a speech_recognition model trained with NVIDIA
NeMo.
positional arguments:
output_path Location to write compiled Riva pipeline
source_path Source file(s)
optional arguments:
-h, --help show this help message and exit
-f, --force Overwrite existing artifacts if they exist
--language_code LANGUAGE_CODE
Language of the model
--max_batch_size MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--acoustic_model_name ACOUSTIC_MODEL_NAME
name of the acoustic model
--name NAME name of the ASR pipeline, used to set the model names
in the Riva model repository
--streaming Execute model in streaming mode
--offline In streaming mode, do not minimize latency
--chunk_size CHUNK_SIZE
Size of audio chunks to use during inference. If not
specified, default will be selected based on
online/offline setting
--padding_factor PADDING_FACTOR
Multiple on the chunk_size. Deprecated and will be
ignored
--left_padding_size LEFT_PADDING_SIZE
The duration in seconds of the backward looking
padding to prepend to the audio chunk. The acoustic
model input corresponds to a duration of
(left_padding_size + chunk_size + right_padding_size)
seconds
--right_padding_size RIGHT_PADDING_SIZE
The duration in seconds of the forward looking padding
to append to the audio chunk. The acoustic model input
corresponds to a duration of (left_padding_size +
chunk_size + right_padding_size) seconds
--padding_size PADDING_SIZE
padding_size
--max_supported_transcripts MAX_SUPPORTED_TRANSCRIPTS
The maximum number of hypothesized transcripts
generated per utterance
--ms_per_timestep MS_PER_TIMESTEP
The duration in milliseconds of one timestep of the
acoustic model output
--lattice_beam LATTICE_BEAM
--decoding_language_model_arpa DECODING_LANGUAGE_MODEL_ARPA
Language model .arpa used during decoding
--decoding_language_model_binary DECODING_LANGUAGE_MODEL_BINARY
Language model .binary used during decoding
--decoding_language_model_fst DECODING_LANGUAGE_MODEL_FST
Language model fst used during decoding
--decoding_language_model_words DECODING_LANGUAGE_MODEL_WORDS
Language model words used during decoding
--rescoring_language_model_arpa RESCORING_LANGUAGE_MODEL_ARPA
Language model .arpa used during lattice rescoring
--decoding_language_model_carpa DECODING_LANGUAGE_MODEL_CARPA
Language model .carpa used during decoding
--rescoring_language_model_carpa RESCORING_LANGUAGE_MODEL_CARPA
Language model .carpa used during lattice rescoring
--decoding_lexicon DECODING_LEXICON
Lexicon to use when decoding
--decoding_vocab DECODING_VOCAB
File of unique words separated by white space. Only
used if decoding_lexicon not provided.
--tokenizer_model TOKENIZER_MODEL
Sentencpiece model to use for encoding. Only include
if generating lexicon from vocab.
--decoder_type DECODER_TYPE
Type of decoder to use. Valid entries are greedy,
os2s, flashlight or kaldi
--wfst_tokenizer_model WFST_TOKENIZER_MODEL
Sparrowhawk model to use for tokenization and
classification, must be in .far format
--wfst_verbalizer_model WFST_VERBALIZER_MODEL
Sparrowhawk model to use for verbalizer, must be in
.far format.
featurizer:
--featurizer.max_sequence_idle_microseconds FEATURIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--featurizer.max_batch_size FEATURIZER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--featurizer.min_batch_size FEATURIZER.MIN_BATCH_SIZE
--featurizer.opt_batch_size FEATURIZER.OPT_BATCH_SIZE
--featurizer.preferred_batch_size FEATURIZER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--featurizer.batching_type FEATURIZER.BATCHING_TYPE
--featurizer.preserve_ordering FEATURIZER.PRESERVE_ORDERING
Preserve ordering
--featurizer.instance_group_count FEATURIZER.INSTANCE_GROUP_COUNT
How many instances in a group
--featurizer.max_queue_delay_microseconds FEATURIZER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--featurizer.max_execution_batch_size FEATURIZER.MAX_EXECUTION_BATCH_SIZE
Maximum Batch Size
--featurizer.gain FEATURIZER.GAIN
Adjust input signal with this gain multiplier prior to
feature extraction
--featurizer.dither FEATURIZER.DITHER
Augment signal with gaussian noise with this gain to
prevent quantization artifacts
--featurizer.stddev_floor FEATURIZER.STDDEV_FLOOR
Add this value to computed features standard
deviation. Higher values help reduce spurious
transcripts with low energy signals.
--featurizer.use_utterance_norm_params FEATURIZER.USE_UTTERANCE_NORM_PARAMS
Apply normalization at utterance level
--featurizer.precalc_norm_time_steps FEATURIZER.PRECALC_NORM_TIME_STEPS
Weight of the precomputed normalization parameters, in
timesteps. Setting to 0 will disable use of
precalculated normalization parameters.
--featurizer.precalc_norm_params FEATURIZER.PRECALC_NORM_PARAMS
Boolean that controls if precalculated Normalization
Parameters should be used
--featurizer.norm_per_feature FEATURIZER.NORM_PER_FEATURE
Normalize Per Feature
--featurizer.mean FEATURIZER.MEAN
Pre-computed mean values
--featurizer.stddev FEATURIZER.STDDEV
Pre-computed Std Dev Values
--featurizer.transpose FEATURIZER.TRANSPOSE
Take transpose of output features
--featurizer.padding_size FEATURIZER.PADDING_SIZE
padding_size
nn:
--nn.max_sequence_idle_microseconds NN.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--nn.max_batch_size NN.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--nn.min_batch_size NN.MIN_BATCH_SIZE
--nn.opt_batch_size NN.OPT_BATCH_SIZE
--nn.preferred_batch_size NN.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--nn.batching_type NN.BATCHING_TYPE
--nn.preserve_ordering NN.PRESERVE_ORDERING
Preserve ordering
--nn.instance_group_count NN.INSTANCE_GROUP_COUNT
How many instances in a group
--nn.max_queue_delay_microseconds NN.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--nn.trt_max_workspace_size NN.TRT_MAX_WORKSPACE_SIZE
Maximum workspace size (in bytes) to use for model
export to TensorRT
--nn.use_onnx_runtime
Use ONNX runtime instead of TRT
--nn.use_trt_fp32 Use TRT engine with fp32 instead of fp16
vad:
--vad.max_sequence_idle_microseconds VAD.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--vad.max_batch_size VAD.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--vad.min_batch_size VAD.MIN_BATCH_SIZE
--vad.opt_batch_size VAD.OPT_BATCH_SIZE
--vad.preferred_batch_size VAD.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--vad.batching_type VAD.BATCHING_TYPE
--vad.preserve_ordering VAD.PRESERVE_ORDERING
Preserve ordering
--vad.instance_group_count VAD.INSTANCE_GROUP_COUNT
How many instances in a group
--vad.max_queue_delay_microseconds VAD.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--vad.ms_per_timestep VAD.MS_PER_TIMESTEP
--vad.vad_start_history VAD.VAD_START_HISTORY
Size of the window, in milliseconds, to use to detect
start of utterance. If (vad_start_th) of
(vad_start_history) ms of the acoustic model output
have non-blank tokens, start of utterance is detected.
--vad.vad_stop_history VAD.VAD_STOP_HISTORY
Size of the window, in milliseconds, to use to detect
end of utterance. If (vad_stop_th) of
(vad_stop_history) ms of the acoustic model output
have non-blank tokens, end of utterance is detected.
--vad.vad_start_th VAD.VAD_START_TH
Percentage threshold to use to detect start of
utterance. If (vad_start_th) of (vad_start_history) ms
of the acoustic model output have non-blank tokens,
start of utterance is detected.
--vad.vad_stop_th VAD.VAD_STOP_TH
Percentage threshold to use to detect end of
utterance. If (vad_stop_th) of (vad_stop_history) ms
of the acoustic model output have non-blank tokens,
end of utterance is detected.
--vad.vad_type VAD.VAD_TYPE
Type of voice activity detection algorithm to use. Set
to none to disable VAD.
--vad.residue_blanks_at_start VAD.RESIDUE_BLANKS_AT_START
(Advanced) Number of time steps to ignore at the
beginning of the acoustic model output when trying to
detect start/end of speech
--vad.residue_blanks_at_end VAD.RESIDUE_BLANKS_AT_END
(Advanced) Number of time steps to ignore at the end
of the acoustic model output when trying to detect
start/end of speech
--vad.vocab_file VAD.VOCAB_FILE
Vocab file to be used with decoder
flashlight_decoder:
--flashlight_decoder.max_sequence_idle_microseconds FLASHLIGHT_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--flashlight_decoder.max_batch_size FLASHLIGHT_DECODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--flashlight_decoder.min_batch_size FLASHLIGHT_DECODER.MIN_BATCH_SIZE
--flashlight_decoder.opt_batch_size FLASHLIGHT_DECODER.OPT_BATCH_SIZE
--flashlight_decoder.preferred_batch_size FLASHLIGHT_DECODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--flashlight_decoder.batching_type FLASHLIGHT_DECODER.BATCHING_TYPE
--flashlight_decoder.preserve_ordering FLASHLIGHT_DECODER.PRESERVE_ORDERING
Preserve ordering
--flashlight_decoder.instance_group_count FLASHLIGHT_DECODER.INSTANCE_GROUP_COUNT
How many instances in a group
--flashlight_decoder.max_queue_delay_microseconds FLASHLIGHT_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--flashlight_decoder.max_execution_batch_size FLASHLIGHT_DECODER.MAX_EXECUTION_BATCH_SIZE
--flashlight_decoder.decoder_type FLASHLIGHT_DECODER.DECODER_TYPE
--flashlight_decoder.padding_size FLASHLIGHT_DECODER.PADDING_SIZE
padding_size
--flashlight_decoder.max_supported_transcripts FLASHLIGHT_DECODER.MAX_SUPPORTED_TRANSCRIPTS
--flashlight_decoder.asr_model_delay FLASHLIGHT_DECODER.ASR_MODEL_DELAY
(Advanced) Number of time steps by which the acoustic
model output should be shifted when computing
timestamps. This parameter must be tuned since the CTC
model is not guaranteed to predict correct alignment.
--flashlight_decoder.ms_per_timestep FLASHLIGHT_DECODER.MS_PER_TIMESTEP
--flashlight_decoder.vocab_file FLASHLIGHT_DECODER.VOCAB_FILE
Vocab file to be used with decoder
--flashlight_decoder.decoder_num_worker_threads FLASHLIGHT_DECODER.DECODER_NUM_WORKER_THREADS
Number of threads to use for CPU decoders. If < 1,
maximum hardware concurrency is used.
--flashlight_decoder.language_model_file FLASHLIGHT_DECODER.LANGUAGE_MODEL_FILE
Language model file in binary format to be used by
KenLM
--flashlight_decoder.lexicon_file FLASHLIGHT_DECODER.LEXICON_FILE
Lexicon file to be used with decoder
--flashlight_decoder.beam_size FLASHLIGHT_DECODER.BEAM_SIZE
Maximum number of hypothesis the decoder holds after
each step
--flashlight_decoder.beam_size_token FLASHLIGHT_DECODER.BEAM_SIZE_TOKEN
Maximum number of tokens the decoder considers at each
step
--flashlight_decoder.beam_threshold FLASHLIGHT_DECODER.BEAM_THRESHOLD
Threshold to prune hypothesis
--flashlight_decoder.lm_weight FLASHLIGHT_DECODER.LM_WEIGHT
Weight of language model
--flashlight_decoder.blank_token FLASHLIGHT_DECODER.BLANK_TOKEN
Blank token
--flashlight_decoder.sil_token FLASHLIGHT_DECODER.SIL_TOKEN
Silence token
--flashlight_decoder.word_insertion_score FLASHLIGHT_DECODER.WORD_INSERTION_SCORE
Word insertion score
--flashlight_decoder.forerunner_beam_size FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE
Maximum number of hypothesis the decoder holds after
each step, for forerunner transcript
--flashlight_decoder.forerunner_beam_size_token FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE_TOKEN
Maximum number of tokens the decoder considers at each
step, for forerunner transcript
--flashlight_decoder.forerunner_beam_threshold FLASHLIGHT_DECODER.FORERUNNER_BEAM_THRESHOLD
Threshold to prune hypothesis, for forerunner
transcript
--flashlight_decoder.smearing_mode FLASHLIGHT_DECODER.SMEARING_MODE
Decoder smearing mode. Can be logadd, max or none
--flashlight_decoder.forerunner_use_lm FLASHLIGHT_DECODER.FORERUNNER_USE_LM
Bool that controls if the forerunner decoder should
use a language model
--flashlight_decoder.num_tokenization FLASHLIGHT_DECODER.NUM_TOKENIZATION
Number of tokenizations to generate for each word in
the lexicon
greedy_decoder:
--greedy_decoder.max_sequence_idle_microseconds GREEDY_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--greedy_decoder.max_batch_size GREEDY_DECODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--greedy_decoder.min_batch_size GREEDY_DECODER.MIN_BATCH_SIZE
--greedy_decoder.opt_batch_size GREEDY_DECODER.OPT_BATCH_SIZE
--greedy_decoder.preferred_batch_size GREEDY_DECODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--greedy_decoder.batching_type GREEDY_DECODER.BATCHING_TYPE
--greedy_decoder.preserve_ordering GREEDY_DECODER.PRESERVE_ORDERING
Preserve ordering
--greedy_decoder.instance_group_count GREEDY_DECODER.INSTANCE_GROUP_COUNT
How many instances in a group
--greedy_decoder.max_queue_delay_microseconds GREEDY_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--greedy_decoder.max_execution_batch_size GREEDY_DECODER.MAX_EXECUTION_BATCH_SIZE
--greedy_decoder.decoder_type GREEDY_DECODER.DECODER_TYPE
--greedy_decoder.padding_size GREEDY_DECODER.PADDING_SIZE
padding_size
--greedy_decoder.max_supported_transcripts GREEDY_DECODER.MAX_SUPPORTED_TRANSCRIPTS
--greedy_decoder.asr_model_delay GREEDY_DECODER.ASR_MODEL_DELAY
(Advanced) Number of time steps by which the acoustic
model output should be shifted when computing
timestamps. This parameter must be tuned since the CTC
model is not guaranteed to predict correct alignment.
--greedy_decoder.ms_per_timestep GREEDY_DECODER.MS_PER_TIMESTEP
--greedy_decoder.vocab_file GREEDY_DECODER.VOCAB_FILE
Vocab file to be used with decoder
--greedy_decoder.decoder_num_worker_threads GREEDY_DECODER.DECODER_NUM_WORKER_THREADS
Number of threads to use for CPU decoders. If < 1,
maximum hardware concurrency is used.
os2s_decoder:
--os2s_decoder.max_sequence_idle_microseconds OS2S_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--os2s_decoder.max_batch_size OS2S_DECODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--os2s_decoder.min_batch_size OS2S_DECODER.MIN_BATCH_SIZE
--os2s_decoder.opt_batch_size OS2S_DECODER.OPT_BATCH_SIZE
--os2s_decoder.preferred_batch_size OS2S_DECODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--os2s_decoder.batching_type OS2S_DECODER.BATCHING_TYPE
--os2s_decoder.preserve_ordering OS2S_DECODER.PRESERVE_ORDERING
Preserve ordering
--os2s_decoder.instance_group_count OS2S_DECODER.INSTANCE_GROUP_COUNT
How many instances in a group
--os2s_decoder.max_queue_delay_microseconds OS2S_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--os2s_decoder.max_execution_batch_size OS2S_DECODER.MAX_EXECUTION_BATCH_SIZE
--os2s_decoder.decoder_type OS2S_DECODER.DECODER_TYPE
--os2s_decoder.padding_size OS2S_DECODER.PADDING_SIZE
padding_size
--os2s_decoder.max_supported_transcripts OS2S_DECODER.MAX_SUPPORTED_TRANSCRIPTS
--os2s_decoder.asr_model_delay OS2S_DECODER.ASR_MODEL_DELAY
(Advanced) Number of time steps by which the acoustic
model output should be shifted when computing
timestamps. This parameter must be tuned since the CTC
model is not guaranteed to predict correct alignment.
--os2s_decoder.ms_per_timestep OS2S_DECODER.MS_PER_TIMESTEP
--os2s_decoder.vocab_file OS2S_DECODER.VOCAB_FILE
Vocab file to be used with decoder
--os2s_decoder.decoder_num_worker_threads OS2S_DECODER.DECODER_NUM_WORKER_THREADS
Number of threads to use for CPU decoders. If < 1,
maximum hardware concurrency is used.
--os2s_decoder.language_model_file OS2S_DECODER.LANGUAGE_MODEL_FILE
Language model file in binary format to be used by
KenLM
--os2s_decoder.beam_search_width OS2S_DECODER.BEAM_SEARCH_WIDTH
--os2s_decoder.language_model_alpha OS2S_DECODER.LANGUAGE_MODEL_ALPHA
--os2s_decoder.language_model_beta OS2S_DECODER.LANGUAGE_MODEL_BETA
kaldi_decoder:
--kaldi_decoder.max_sequence_idle_microseconds KALDI_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--kaldi_decoder.max_batch_size KALDI_DECODER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--kaldi_decoder.min_batch_size KALDI_DECODER.MIN_BATCH_SIZE
--kaldi_decoder.opt_batch_size KALDI_DECODER.OPT_BATCH_SIZE
--kaldi_decoder.preferred_batch_size KALDI_DECODER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--kaldi_decoder.batching_type KALDI_DECODER.BATCHING_TYPE
--kaldi_decoder.preserve_ordering KALDI_DECODER.PRESERVE_ORDERING
Preserve ordering
--kaldi_decoder.instance_group_count KALDI_DECODER.INSTANCE_GROUP_COUNT
How many instances in a group
--kaldi_decoder.max_queue_delay_microseconds KALDI_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--kaldi_decoder.max_execution_batch_size KALDI_DECODER.MAX_EXECUTION_BATCH_SIZE
--kaldi_decoder.decoder_type KALDI_DECODER.DECODER_TYPE
--kaldi_decoder.padding_size KALDI_DECODER.PADDING_SIZE
padding_size
--kaldi_decoder.max_supported_transcripts KALDI_DECODER.MAX_SUPPORTED_TRANSCRIPTS
--kaldi_decoder.asr_model_delay KALDI_DECODER.ASR_MODEL_DELAY
(Advanced) Number of time steps by which the acoustic
model output should be shifted when computing
timestamps. This parameter must be tuned since the CTC
model is not guaranteed to predict correct alignment.
--kaldi_decoder.ms_per_timestep KALDI_DECODER.MS_PER_TIMESTEP
--kaldi_decoder.vocab_file KALDI_DECODER.VOCAB_FILE
Vocab file to be used with decoder
--kaldi_decoder.decoder_num_worker_threads KALDI_DECODER.DECODER_NUM_WORKER_THREADS
Number of threads to use for CPU decoders. If < 1,
maximum hardware concurrency is used.
--kaldi_decoder.fst_filename KALDI_DECODER.FST_FILENAME
Fst file to use during decoding
--kaldi_decoder.word_syms_filename KALDI_DECODER.WORD_SYMS_FILENAME
--kaldi_decoder.default_beam KALDI_DECODER.DEFAULT_BEAM
--kaldi_decoder.max_active KALDI_DECODER.MAX_ACTIVE
--kaldi_decoder.acoustic_scale KALDI_DECODER.ACOUSTIC_SCALE
--kaldi_decoder.decoder_num_copy_threads KALDI_DECODER.DECODER_NUM_COPY_THREADS
--kaldi_decoder.determinize_lattice KALDI_DECODER.DETERMINIZE_LATTICE
rescorer:
--rescorer.max_sequence_idle_microseconds RESCORER.MAX_SEQUENCE_IDLE_MICROSECONDS
Global timeout, in ms
--rescorer.max_batch_size RESCORER.MAX_BATCH_SIZE
Default maximum parallel requests in a single forward
pass
--rescorer.min_batch_size RESCORER.MIN_BATCH_SIZE
--rescorer.opt_batch_size RESCORER.OPT_BATCH_SIZE
--rescorer.preferred_batch_size RESCORER.PREFERRED_BATCH_SIZE
Preferred batch size, must be smaller than Max batch
size
--rescorer.batching_type RESCORER.BATCHING_TYPE
--rescorer.preserve_ordering RESCORER.PRESERVE_ORDERING
Preserve ordering
--rescorer.instance_group_count RESCORER.INSTANCE_GROUP_COUNT
How many instances in a group
--rescorer.max_queue_delay_microseconds RESCORER.MAX_QUEUE_DELAY_MICROSECONDS
Maximum amount of time to allow requests to queue to
form a batch in microseconds
--rescorer.max_supported_transcripts RESCORER.MAX_SUPPORTED_TRANSCRIPTS
--rescorer.score_lm_carpa_filename RESCORER.SCORE_LM_CARPA_FILENAME
--rescorer.decode_lm_carpa_filename RESCORER.DECODE_LM_CARPA_FILENAME
--rescorer.word_syms_filename RESCORER.WORD_SYMS_FILENAME
--rescorer.word_insertion_penalty RESCORER.WORD_INSERTION_PENALTY
--rescorer.num_worker_threads RESCORER.NUM_WORKER_THREADS
--rescorer.ms_per_timestep RESCORER.MS_PER_TIMESTEP
--rescorer.boundary_character_ids RESCORER.BOUNDARY_CHARACTER_IDS
--rescorer.vocab_file RESCORER.VOCAB_FILE
Vocab file to be used with decoder
lm_decoder_cpu:
--lm_decoder_cpu.beam_search_width LM_DECODER_CPU.BEAM_SEARCH_WIDTH
--lm_decoder_cpu.decoder_type LM_DECODER_CPU.DECODER_TYPE
--lm_decoder_cpu.padding_size LM_DECODER_CPU.PADDING_SIZE
padding_size
--lm_decoder_cpu.language_model_file LM_DECODER_CPU.LANGUAGE_MODEL_FILE
Language model file in binary format to be used by
KenLM
--lm_decoder_cpu.max_supported_transcripts LM_DECODER_CPU.MAX_SUPPORTED_TRANSCRIPTS
--lm_decoder_cpu.asr_model_delay LM_DECODER_CPU.ASR_MODEL_DELAY
(Advanced) Number of time steps by which the acoustic
model output should be shifted when computing
timestamps. This parameter must be tuned since the CTC
model is not guaranteed to predict correct alignment.
--lm_decoder_cpu.language_model_alpha LM_DECODER_CPU.LANGUAGE_MODEL_ALPHA
--lm_decoder_cpu.language_model_beta LM_DECODER_CPU.LANGUAGE_MODEL_BETA
--lm_decoder_cpu.ms_per_timestep LM_DECODER_CPU.MS_PER_TIMESTEP
--lm_decoder_cpu.vocab_file LM_DECODER_CPU.VOCAB_FILE
Vocab file to be used with decoder
--lm_decoder_cpu.lexicon_file LM_DECODER_CPU.LEXICON_FILE
Lexicon file to be used with decoder
--lm_decoder_cpu.beam_size LM_DECODER_CPU.BEAM_SIZE
Maximum number of hypothesis the decoder holds after
each step
--lm_decoder_cpu.beam_size_token LM_DECODER_CPU.BEAM_SIZE_TOKEN
Maximum number of tokens the decoder considers at each
step
--lm_decoder_cpu.beam_threshold LM_DECODER_CPU.BEAM_THRESHOLD
Threshold to prune hypothesis
--lm_decoder_cpu.lm_weight LM_DECODER_CPU.LM_WEIGHT
Weight of language model
--lm_decoder_cpu.word_insertion_score LM_DECODER_CPU.WORD_INSERTION_SCORE
Word insertion score
--lm_decoder_cpu.forerunner_beam_size LM_DECODER_CPU.FORERUNNER_BEAM_SIZE
Maximum number of hypothesis the decoder holds after
each step, for forerunner transcript
--lm_decoder_cpu.forerunner_beam_size_token LM_DECODER_CPU.FORERUNNER_BEAM_SIZE_TOKEN
Maximum number of tokens the decoder considers at each
step, for forerunner transcript
--lm_decoder_cpu.forerunner_beam_threshold LM_DECODER_CPU.FORERUNNER_BEAM_THRESHOLD
Threshold to prune hypothesis, for forerunner
transcript
--lm_decoder_cpu.smearing_mode LM_DECODER_CPU.SMEARING_MODE
Decoder smearing mode. Can be logadd, max or none
--lm_decoder_cpu.forerunner_use_lm LM_DECODER_CPU.FORERUNNER_USE_LM
Bool that controls if the forerunner decoder should
use a language model
Training Language Models¶
Introducing a language model to an ASR pipeline is an easy way to improve accuracy for natural language
and can be fine-tuned for niche settings. In short, an n-gram language model estimates the probability
distribution over groups of n
or less consecutive words, P
(word-1, …, word-n). By altering or biasing
the data on which a language model is trained on, and thus the distribution it is estimating, it can be
used to predict different transcriptions as more likely, and thus alter the prediction without changing
the acoustic model. Riva supports n-gram models trained and exported from either NVIDIA TAO Toolkit or KenLM.
TAO Toolkit Language Model¶
The general TAO Toolkit model development pipeline is outlined in the Model Overview page. To train a new language model, run:
!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
export_to=PATH_TO_TAO_FILE \
training_ds.data_dir=PATH_TO_DATA \
model.order=4 \
model.pruning=[0,1,1,3] \
-k $KEY
To export a pre-trained model, run:
### For export to Riva
!tao n_gram export \
-e /specs/nlp/intent_slot_classification/export.yaml \
-m PATH_TO_TAO_FILE \
export_to=PATH_TO_RIVA_FILE \
binary_type=probing \
-k $KEY
For more information, refer to the TAO Toolkit documentation. Try running the following Jupyter notebook N-Gram Language Model Notebook.
KenLM Setup¶
KenLM is the recommended tool for building language models. This toolkit supports estimating, filtering and querying n-gram language models. To begin, first make sure you have Boost and zlib installed. Depending on your requirements, you may require additional dependencies. Double check by referencing the dependencies list.
After all dependencies are met, create a separate directory to build KenLM.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
Estimating¶
The next step is to gather and process data. In most cases, KenLM expects data to be natural language (suiting your use case). Common preprocessing steps include replacing numerics and removing umlauts, punctuation or special characters. However, it is most important that your preprocessing steps are consistent between both your language and acoustic model.
Assuming your current working directory is the build
subdirectory of KenLM, bin/lmplz
performs
estimation on the corpus provided through stdin
and writes the ARPA (a human readable from of the
language model) to stdout
. Running bin/lmplz
documents the command-line arguments, however, here are
a few important ones:
-o
: Required. The order of the language model. Depends on use case, but generally 3-8.-S
: Memory to use. Nubmer followed by%
for percentage,b
for bytes,K
for kilobytes, and so on. Default is80%
.-T
: Temproary file location--text arg
: Read text from a file instead ofstdin
.--arpa arg
: Write ARPA to a file instead ofstdout
.--prune arg
: Prune n-grams with count less than or equal to the given threshold, with one value specified for each order. For example, to prune singleton trigrams,--prune 0 0 1
. The sequence of values must be non-decreasing and the last value applies to all remaining orders. Default is to not prune. Unigram pruning is not supported, so the first number must be0
.--limit_vocab_file arg
: Read allowed vocabulary separated by whitespace from file in argument and prune all n-grams containing vocabulary items not from the list. Can be combined with pruning.
Pruning and limiting vocabulary help to get rid of typos, uncommon words, and general outliers from the dataset, making the resulting ARPA smaller and generally less overfit, but potentially at the cost of losing some jargon or colloquial language.
With the appropriate options, the language model can be estimated.
bin/lmplz -o 4 < text > text.arpa
Querying and Evaluation¶
For faster loading, convert the arpa
file to binary.
bin/build_binary text.arpa text.binary
The binary or ARPA can be queried via the command-line.
bin/query text.binary < data