Speech Recognition

Automatic Speech Recognition (ASR) takes as input an audio stream or audio buffer and returns one or more text transcripts, along with additional optional metadata. ASR represents a full speech recognition pipeline that is GPU accelerated with optimized performance and accuracy. ASR supports synchronous and streaming recognition modes.

Riva ASR features include:

  • Support for offline and streaming use cases

  • A streaming mode that returns intermediate transcripts with low latency

  • GPU-accelerated feature extraction

  • Multiple (and growing) acoustic model architecture options accelerated by NVIDIA TensorRT

  • Beam search decoder based on n-gram language models

  • Voice activity detection algorithms (CTC-based)

  • Automatic punctuation

  • Ability to return top-N transcripts from beam decoder

  • Word-level timestamps

  • Inverse Text Normalization (ITN)

For more information, refer to the Speech To Text notebook, which is an end-to-end workflow for speech recognition. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Model Architectures

Citrinet

Citrinet is the recommended new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models.

Details on the model architecture can be found in the paper Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition.

Conformer-CTC

The Conformer-CTC model is a non-autoregressive variant of the Conformer model for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. Fore more information, refer to Conformer-CTC Model.

The model used in Riva is a large size version of Conformer-CTC (around 120M parameters) trained on NeMo ASRSet. The model transcribes speech in lower case english alphabet along with spaces and apostrophes.

Jasper

The Jasper model is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment.

The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment during a post-processing step called decoding.

Details on the model architecture can be found in the paper Jasper: An End-to-End Convolutional Neural Acoustic Model.

QuartzNet

QuartzNet is the next generation of the Jasper speech recognition model. It improves on Jasper by replacing 1D convolutions with 1D time-channel separable convolutions. Doing this effectively factorizes the convolution kernels, enabling deeper models while reducing the number of parameters by over an order of magnitude.

Details on the model architecture can be found in the paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.

Normalization

Riva implements inverse text normalization (ITN) for ASR requests using weight finite state transducers (WSFT) based models to convert spoken domain output from an ASR model into written domain text to improve readability of the ASR systems output.

Details on the model archiecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.

Languages Supported

Language

Language code

Supported Architectures

English

en-US

Jasper, Quartznet, Citrinet, Conformer-CTC

German

de-DE

Citrinet

Russian

ru-RU

Citrinet

Spanish

es-US

Citrinet

Services

Riva ASR supports both offline/batch and streaming inference modes.

Offline Recognition

In synchronous mode, the full audio signal is first read from a file or captured from a microphone. Following the capture of the entire signal, the client makes a request to the Riva Speech Server to transcribe it. The client then waits for the response from the server.

Note

This method can have long latency since the processing of the audio signal starts after the full audio signal has been captured or read from the file.

Streaming Recognition

In streaming recognition mode, as soon as an audio segment of a specified length is captured or read, a request is made to the server to process that segment. On the server side, a response is returned as soon as an intermediate transcript is available.

Note

You can select the length of the audio segments based on speed and memory requirements.

Refer to the riva/proto/riva_asr.proto documentation for more details.

Pipeline Configuration

In the simplest use case, you can deploy an ASR model without any language model as follows:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --acoustic_model_name=<acoustic_model_name>

where:

  • <encryption_key> is the encryption key used during the export of the .riva file.

  • <pipeline_name> and <acoustic_model_name> are optional user-defined names for the components in the model repository.

    Note

    <acoustic_model_name> is global and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and there will not be any incompatibilities in model weights or input shapes.

  • <riva_filename> is the name of the riva file to use as input.

  • <rmir_filename> is the Riva rmir file that is generated.

Upon succesful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. Since no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted you need to include :<encryption_key> at the end of the RMIR filename and Riva filename. Otherwise, this is unnecessary.

The following summary lists the riva-build commands used to generate the RMIR files from the Quick Start scripts for different models, modes, and their limitations:

Acoustic Model

Mode

Limitations

riva-build command

Citrinet-1024

Streaming Low-Latency

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Citrinet-1024

Streaming High-Throughput

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-english-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Citrinet-1024

Offline

Maximum audio duration of 15 minutes

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --offline \
   --name=citrinet-1024-english-asr-offline \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --chunk_size=900 \
   --left_padding_size=0. \
   --right_padding_size=0. \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Conformer-CTC

Streaming Low-Latency

ONNX runtime only

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=conformer-en-US-asr-streaming \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Conformer-CTC

Streaming High-Throughput

ONNX runtime only

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=conformer-en-US-asr-streaming-throughput \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

Conformer-CTC

Offline

ONNX runtime only and maximum audio duration of 3 minutes

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --offline \
   --name=conformer-en-US-asr-offline \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --vad.vad_start_history=200 \
   --chunk_size=200 \
   --left_padding_size=0. \
   --right_padding_size=0. \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

QuartzNet

Streaming Low-Latency

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=quartznet-en-US-asr-streaming \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

QuartzNet

Streaming High-Throughput

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=quartznet-en-US-asr-streaming-throughput \
   --chunk_size=0.8 \
   --left_padding_size=0.8 \
   --right_padding_size=0.8 \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

QuartzNet

Offline

Maximum audio duration of 15 minutes

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --offline \
   --name=quartznet-en-US-asr-offline \
   --chunk_size=900 \
   --left_padding_size=0. \
   --right_padding_size=0. \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

Jasper

Streaming Low-Latency

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=jasper-en-US-asr-streaming \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

Jasper

Streaming High-Throughput

None

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=jasper-en-US-asr-streaming-throughput \
   --chunk_size=0.8 \
   --left_padding_size=0.8 \
   --right_padding_size=0.8 \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

Jasper

Offline

Maximum audio duration of 15 minutes

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --offline \
   --name=jasper-en-US-asr-offline \
   --chunk_size=900 \
   --left_padding_size=0. \
   --right_padding_size=0. \
   --decoder_type=os2s \
   --decoding_language_model_binary=<lm_binary> \
   --language_code=en-US

Streaming/Offline Configuration

By default, the Riva RMIR file is configured to be used with the Riva StreamingRecognize RPC call, for streaming use cases. To use the Recognize RPC call, generate the Riva RMIR file by adding the --offline option.

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --offline \
    --chunk_size=900 \
    --padding_size=0.

where chunk_size specifies the maximum audio duration in seconds. This value has an impact on the GPU memory usage and can be increased or decreased depending on deployment scenarios. Furthermore, the default streaming Riva RMIR configuration is to provide intermediate transcripts with very low latency. For use cases where being able to support additional concurrent audio streams is more important, run:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --chunk_size=0.8 \
    --padding_size=0.8

Citrinet and Conformer-CTC Acoustic Models

The Citrinet and Conformer-CTC acoustic models have different properties than Jasper and QuartzNet. We recommend the following riva-build parameters to export Citrinet or Conformer-CTC for low-latency streaming recognition:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=80 \
    --greedy_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False

For high throughput streaming recognition, chunk_size and padding_size can be set as follows:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --chunk_size=0.8 \
    --padding_size=1.6 \
    --ms_per_timestep=80 \
    --greedy_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False

Finally, for offline recogition, we recommend the following settings:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --offline \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --chunk_size=900. \
    --padding_size=0. \
    --ms_per_timestep=80 \
    --greedy_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False

Language Models

Riva ASR supports decoding with an n-gram language model. The n-gram language model can be provided in a few different ways.

  1. A .riva file exported from TAO Toolkit.

  2. A .arpa format file.

  3. A KenLM binary format file.

For more information on buliding language models, see Training Language Models.

When using the Jasper or QuartzNet acoustic model, you can configure the Riva ASR pipeline to use an n-gram language model stored in .riva format by running:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<acoustic_riva_filename>:<encryption_key> \
    /servicemaker-dev/<n_gram_riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=os2s

When using the Citrinet or Conformer-CTC acoustic model, specify the language model by running the following riva-build command:

riva-build speech_recognition \
    /servicemaker-dev/<jmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    /servicemaker-dev/<n_gram_riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=flashlight \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=80 \
    --flashlight_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --decoding_vocab=<vocabulary_filename>

where vocabulary_filename is the vocabulary used by the Flashlight lexicon decoder. The vocabulary file must contain one vocabulary word per line.

When using the Jasper or QuartzNet acoustic model, one can configure the Riva ASR Pipeline to use an n-gram language model stored in arpa format by running:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=os2s \
    --decoding_language_model_arpa=<arpa_filename>

When using the Citrinet or Conformer-CTC acoustic model, the language model can be specified with the following riva-build command:

riva-build speech_recognition \
    /servicemaker-dev/<jmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=flashlight \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=80 \
    --flashlight_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --decoding_language_model_arpa=<arpa_filename> \
    --decoding_vocab=<vocabulary_filename>

When using a KenLM binary file to specify the language model, one can generate the Riva RMIR with:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=os2s \
    --decoding_language_model_binary=<KenLM_binary_filename>

when using the Jasper or QuartzNet acoustic model and with:

riva-build speech_recognition \
    /servicemaker-dev/<jmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=flashlight \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=80 \
    --flashlight_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --decoding_language_model_binary=<KENLM_binary_filename> \
    --decoding_vocab=<vocab_filename>

when using the Citrinet or Conformer-CTC acoustic model.

The decoder language model hyperparameters can also be specified from the riva-build command. When using the Jasper or QuartzNet acoustic models, the language model parameters alpha, beta, and beam_search_width can be specified with:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=os2s \
    --decoding_language_model_binary=<KenLM_binary_filename> \
    --os2s_decoder.beam_search_width=<beam_search_width> \
    --os2s_decoder.language_model_alpha=<language_model_alpha> \
    --os2s_decoder.language_model_beta=<language_model_beta>

With the Citrinet or Conformer-CTC acoustic model, you can specify the Flashlight decoder hyper-parameters beam_size, beam_size_token, beam_threshold, lm_weight and word_insertion_score as follows:

riva-build speech_recognition \
    /servicemaker-dev/<jmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=flashlight \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=80 \
    --flashlight_decoder.asr_model_delay=-1 \
    --vad.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --decoding_language_model_binary=<arpa_filename> \
    --decoding_vocab=<vocab_filename> \
    --flashlight_decoder.beam_size=<beam_size> \
    --flashlight_decoder.beam_size_token=<beam_size_token> \
    --flashlight_decoder.beam_threshold=<beam_threshold> \
    --flashlight_decoder.lm_weight=<lm_weight> \
    --flashlight_decoder.word_insertion_score=<word_insertion_score>

Flashlight Decoder Lexicon

The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the provided lexicon file. The lexicon file can be specified with the parameter --decoding_vocab in the riva-build command. You’ll need to ensure that words of interest are in the lexicon file. The Riva Service Maker automatically tokenizes the words in the lexicon file. It’s also possible to add additional tokenizations for the words in the lexicon by performing the following steps.

The riva-build and riva-deploy commands provided in the section above generates the lexicon tokenizations in the /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt file.

To add additional tokenizations to the lexicon, copy the lexicon file:

cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt  decoding_lexicon.txt

and modify it to add the sentencepiece tokenizations for the word of interest. For example, one could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

to file decoding_lexicon.txt so that the word manu is generated in the transcript if the acoustic model predicts those tokens. You’ll need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. Once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

GPU-accelerated Decoder

The Riva ASR pipeline can also use a GPU-accelerated weighted finite-state transducer (WFST) decoder that was initially developed for Kaldi. To use the GPU decoder, using a language model defined by an .arpa file, run:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoding_language_model_arpa=<decoding_lm_arpa_filename> \
    --decoder_type=kaldi

where <decoding_lm_arpa_filename> is the language model .arpa file that was used during the WFST decoding phase.

Note

Conversion from an .arpa file to a WFST graph can take a very long time, especially for large language models.

Also, large language models will increase GPU memory utilization. When using the GPU decoder, it is recommended to use different language models for the WFST decoding phase and the lattice rescoring phase. This can be achieved by using the following riva-build command:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoding_language_model_arpa=<decoding_lm_arpa_filename> \
    --rescoring_language_model_arpa=<rescoring_lm_arpa_filename> \
    --decoder_type=kaldi

where:

  • <decoding_lm_arpa_filename> is the language model .arpa file that was used during the WFST decoding phase

  • <rescoring_lm_arpa_filename> is the language model used during the lattice rescoring phase

Typically, one would use a small language model for the WFST decoding phase (for example, a pruned 2 or 3-gram language model) and a larger language model for the lattice rescoring phase (for example, an unpruned 4-gram language model).

For advanced users, it is also possible to configure the GPU decoder by specifying the decoding WFST file and the vocabulary directly, instead of using an .arpa file. For example:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoding_language_model_fst=<decoding_lm_fst_filename> \
    --decoding_language_model_words=<decoding_lm_words_file> \
    --decoder_type=kaldi

Furthermore, you can specify the .arpa files to use in the case where lattice rescoring is needed:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoding_language_model_fst=<decoding_lm_fst_filename> \
    --decoding_language_model_carpa=<decoding_lm_carpa_filename> \
    --decoding_language_model_words=<decoding_lm_words_filename> \
    --rescoring_language_model_carpa=<rescoring_lm_carpa_filename> \
    --decoder_type=kaldi

where:

  • <decoding_lm_carpa_filename> is the language model construct .arpa representation to use during the WFST decoding phase

  • <rescoring_lm_carpa_filename> is the language model construct .arpa representation to use during the lattice rescoring phase

The GPU decoder hyperparameters (default_beam, lattice_beam, word_insertion_penalty and acoustic_scale) can be set with the riva-build command as follows:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoding_language_model_arpa=<decoding_lm_arpa_filename> \
    --lattice_beam=<lattice_beam> \
    --kaldi_decoder.default_beam=<default_beam> \
    --kaldi_decoder.acoustic_scale=<acoustic_scale> \
    --rescorer.word_insertion_penalty=<word_insertion_penalty> \
    --decoder_type=kaldi

Beginning/End of Utterance Detection

Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window has non-blank characters, and the end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. You can tune those values for their particular use case by using the following riva-build command:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --vad.vad_start_history=300 \
    --vad.vad_start_th=0.2 \
    --vad.vad_stop_history=800 \
    --vad.vad_stop_th=0.98

Additionally, it is possible to disable the beginning/end of utterance detection with the following code:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --decoder_type=greedy \
    --vad.vad_type=none

Note that in this case, the decoder state would only get reset after the full audio signal has been sent by the client. Similarly, the punctuator model would only get called once.

Inverse Text Normalization

Currently, the grammars are limited to English. In a future release, additional information on training, tuning, and loading custom grammars will be available.

Selecting Custom Model at Runtime

When receiving requests from the client application, the Riva server selects the deployed ASR model to use based on the RecognitionConfig of the client request. If no models are available to fulfill the request, an error is returned. In the case where multiple models might be able to fulfill the client request, one model is selected at random. You can also explicitly select which ASR model to use by setting the model field of the RecognitionConfig protobuf object to the value of <pipeline_name> which was used with the riva-build command. This enables you to deploy multiple ASR pipelines concurrently and select which one to use at runtime.

Generating Multiple Transcript Hypotheses

By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter --max_supported_transcripts=N to the riva-build command, where N is the maximum number of hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses by setting the max_alternatives field of RecognitionConfig to values greater than 1.

Riva-build Optional Parameters

For details about the parameters passed to riva-build to customize the ASR pipeline, issue:

riva-build speech_recognition -h

The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build speech_recognition [-h] [-f] [--language_code LANGUAGE_CODE]
                                     [--max_batch_size MAX_BATCH_SIZE]
                                     [--acoustic_model_name ACOUSTIC_MODEL_NAME]
                                     [--name NAME] [--streaming] [--offline]
                                     [--chunk_size CHUNK_SIZE]
                                     [--padding_factor PADDING_FACTOR]
                                     [--left_padding_size LEFT_PADDING_SIZE]
                                     [--right_padding_size RIGHT_PADDING_SIZE]
                                     [--padding_size PADDING_SIZE]
                                     [--max_supported_transcripts MAX_SUPPORTED_TRANSCRIPTS]
                                     [--compute_timestamps COMPUTE_TIMESTAMPS]
                                     [--ms_per_timestep MS_PER_TIMESTEP]
                                     [--lattice_beam LATTICE_BEAM]
                                     [--decoding_language_model_arpa DECODING_LANGUAGE_MODEL_ARPA]
                                     [--decoding_language_model_binary DECODING_LANGUAGE_MODEL_BINARY]
                                     [--decoding_language_model_fst DECODING_LANGUAGE_MODEL_FST]
                                     [--decoding_language_model_words DECODING_LANGUAGE_MODEL_WORDS]
                                     [--rescoring_language_model_arpa RESCORING_LANGUAGE_MODEL_ARPA]
                                     [--decoding_language_model_carpa DECODING_LANGUAGE_MODEL_CARPA]
                                     [--rescoring_language_model_carpa RESCORING_LANGUAGE_MODEL_CARPA]
                                     [--decoding_lexicon DECODING_LEXICON]
                                     [--decoding_vocab DECODING_VOCAB]
                                     [--tokenizer_model TOKENIZER_MODEL]
                                     [--decoder_type DECODER_TYPE]
                                     [--featurizer.max_sequence_idle_microseconds FEATURIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--featurizer.max_batch_size FEATURIZER.MAX_BATCH_SIZE]
                                     [--featurizer.min_batch_size FEATURIZER.MIN_BATCH_SIZE]
                                     [--featurizer.opt_batch_size FEATURIZER.OPT_BATCH_SIZE]
                                     [--featurizer.preferred_batch_size FEATURIZER.PREFERRED_BATCH_SIZE]
                                     [--featurizer.batching_type FEATURIZER.BATCHING_TYPE]
                                     [--featurizer.preserve_ordering FEATURIZER.PRESERVE_ORDERING]
                                     [--featurizer.instance_group_count FEATURIZER.INSTANCE_GROUP_COUNT]
                                     [--featurizer.max_queue_delay_microseconds FEATURIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--featurizer.max_execution_batch_size FEATURIZER.MAX_EXECUTION_BATCH_SIZE]
                                     [--featurizer.gain FEATURIZER.GAIN]
                                     [--featurizer.dither FEATURIZER.DITHER]
                                     [--featurizer.stddev_floor FEATURIZER.STDDEV_FLOOR]
                                     [--featurizer.use_utterance_norm_params FEATURIZER.USE_UTTERANCE_NORM_PARAMS]
                                     [--featurizer.precalc_norm_time_steps FEATURIZER.PRECALC_NORM_TIME_STEPS]
                                     [--featurizer.precalc_norm_params FEATURIZER.PRECALC_NORM_PARAMS]
                                     [--featurizer.norm_per_feature FEATURIZER.NORM_PER_FEATURE]
                                     [--featurizer.mean FEATURIZER.MEAN]
                                     [--featurizer.stddev FEATURIZER.STDDEV]
                                     [--featurizer.transpose FEATURIZER.TRANSPOSE]
                                     [--featurizer.padding_size FEATURIZER.PADDING_SIZE]
                                     [--nn.max_sequence_idle_microseconds NN.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--nn.max_batch_size NN.MAX_BATCH_SIZE]
                                     [--nn.min_batch_size NN.MIN_BATCH_SIZE]
                                     [--nn.opt_batch_size NN.OPT_BATCH_SIZE]
                                     [--nn.preferred_batch_size NN.PREFERRED_BATCH_SIZE]
                                     [--nn.batching_type NN.BATCHING_TYPE]
                                     [--nn.preserve_ordering NN.PRESERVE_ORDERING]
                                     [--nn.instance_group_count NN.INSTANCE_GROUP_COUNT]
                                     [--nn.max_queue_delay_microseconds NN.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--nn.trt_max_workspace_size NN.TRT_MAX_WORKSPACE_SIZE]
                                     [--nn.use_onnx_runtime]
                                     [--nn.use_trt_fp32]
                                     [--vad.max_sequence_idle_microseconds VAD.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--vad.max_batch_size VAD.MAX_BATCH_SIZE]
                                     [--vad.min_batch_size VAD.MIN_BATCH_SIZE]
                                     [--vad.opt_batch_size VAD.OPT_BATCH_SIZE]
                                     [--vad.preferred_batch_size VAD.PREFERRED_BATCH_SIZE]
                                     [--vad.batching_type VAD.BATCHING_TYPE]
                                     [--vad.preserve_ordering VAD.PRESERVE_ORDERING]
                                     [--vad.instance_group_count VAD.INSTANCE_GROUP_COUNT]
                                     [--vad.max_queue_delay_microseconds VAD.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--vad.ms_per_timestep VAD.MS_PER_TIMESTEP]
                                     [--vad.vad_start_history VAD.VAD_START_HISTORY]
                                     [--vad.vad_stop_history VAD.VAD_STOP_HISTORY]
                                     [--vad.vad_start_th VAD.VAD_START_TH]
                                     [--vad.vad_stop_th VAD.VAD_STOP_TH]
                                     [--vad.vad_type VAD.VAD_TYPE]
                                     [--vad.residue_blanks_at_start VAD.RESIDUE_BLANKS_AT_START]
                                     [--vad.residue_blanks_at_end VAD.RESIDUE_BLANKS_AT_END]
                                     [--vad.vocab_file VAD.VOCAB_FILE]
                                     [--flashlight_decoder.max_sequence_idle_microseconds FLASHLIGHT_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--flashlight_decoder.max_batch_size FLASHLIGHT_DECODER.MAX_BATCH_SIZE]
                                     [--flashlight_decoder.min_batch_size FLASHLIGHT_DECODER.MIN_BATCH_SIZE]
                                     [--flashlight_decoder.opt_batch_size FLASHLIGHT_DECODER.OPT_BATCH_SIZE]
                                     [--flashlight_decoder.preferred_batch_size FLASHLIGHT_DECODER.PREFERRED_BATCH_SIZE]
                                     [--flashlight_decoder.batching_type FLASHLIGHT_DECODER.BATCHING_TYPE]
                                     [--flashlight_decoder.preserve_ordering FLASHLIGHT_DECODER.PRESERVE_ORDERING]
                                     [--flashlight_decoder.instance_group_count FLASHLIGHT_DECODER.INSTANCE_GROUP_COUNT]
                                     [--flashlight_decoder.max_queue_delay_microseconds FLASHLIGHT_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--flashlight_decoder.max_execution_batch_size FLASHLIGHT_DECODER.MAX_EXECUTION_BATCH_SIZE]
                                     [--flashlight_decoder.decoder_type FLASHLIGHT_DECODER.DECODER_TYPE]
                                     [--flashlight_decoder.padding_size FLASHLIGHT_DECODER.PADDING_SIZE]
                                     [--flashlight_decoder.max_supported_transcripts FLASHLIGHT_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--flashlight_decoder.asr_model_delay FLASHLIGHT_DECODER.ASR_MODEL_DELAY]
                                     [--flashlight_decoder.ms_per_timestep FLASHLIGHT_DECODER.MS_PER_TIMESTEP]
                                     [--flashlight_decoder.vocab_file FLASHLIGHT_DECODER.VOCAB_FILE]
                                     [--flashlight_decoder.decoder_num_worker_threads FLASHLIGHT_DECODER.DECODER_NUM_WORKER_THREADS]
                                     [--flashlight_decoder.language_model_file FLASHLIGHT_DECODER.LANGUAGE_MODEL_FILE]
                                     [--flashlight_decoder.lexicon_file FLASHLIGHT_DECODER.LEXICON_FILE]
                                     [--flashlight_decoder.beam_size FLASHLIGHT_DECODER.BEAM_SIZE]
                                     [--flashlight_decoder.beam_size_token FLASHLIGHT_DECODER.BEAM_SIZE_TOKEN]
                                     [--flashlight_decoder.beam_threshold FLASHLIGHT_DECODER.BEAM_THRESHOLD]
                                     [--flashlight_decoder.lm_weight FLASHLIGHT_DECODER.LM_WEIGHT]
                                     [--flashlight_decoder.blank_token FLASHLIGHT_DECODER.BLANK_TOKEN]
                                     [--flashlight_decoder.sil_token FLASHLIGHT_DECODER.SIL_TOKEN]
                                     [--flashlight_decoder.word_insertion_score FLASHLIGHT_DECODER.WORD_INSERTION_SCORE]
                                     [--flashlight_decoder.forerunner_beam_size FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE]
                                     [--flashlight_decoder.forerunner_beam_size_token FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE_TOKEN]
                                     [--flashlight_decoder.forerunner_beam_threshold FLASHLIGHT_DECODER.FORERUNNER_BEAM_THRESHOLD]
                                     [--flashlight_decoder.smearing_mode FLASHLIGHT_DECODER.SMEARING_MODE]
                                     [--flashlight_decoder.forerunner_use_lm FLASHLIGHT_DECODER.FORERUNNER_USE_LM]
                                     [--greedy_decoder.max_sequence_idle_microseconds GREEDY_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--greedy_decoder.max_batch_size GREEDY_DECODER.MAX_BATCH_SIZE]
                                     [--greedy_decoder.min_batch_size GREEDY_DECODER.MIN_BATCH_SIZE]
                                     [--greedy_decoder.opt_batch_size GREEDY_DECODER.OPT_BATCH_SIZE]
                                     [--greedy_decoder.preferred_batch_size GREEDY_DECODER.PREFERRED_BATCH_SIZE]
                                     [--greedy_decoder.batching_type GREEDY_DECODER.BATCHING_TYPE]
                                     [--greedy_decoder.preserve_ordering GREEDY_DECODER.PRESERVE_ORDERING]
                                     [--greedy_decoder.instance_group_count GREEDY_DECODER.INSTANCE_GROUP_COUNT]
                                     [--greedy_decoder.max_queue_delay_microseconds GREEDY_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--greedy_decoder.max_execution_batch_size GREEDY_DECODER.MAX_EXECUTION_BATCH_SIZE]
                                     [--greedy_decoder.decoder_type GREEDY_DECODER.DECODER_TYPE]
                                     [--greedy_decoder.padding_size GREEDY_DECODER.PADDING_SIZE]
                                     [--greedy_decoder.max_supported_transcripts GREEDY_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--greedy_decoder.asr_model_delay GREEDY_DECODER.ASR_MODEL_DELAY]
                                     [--greedy_decoder.ms_per_timestep GREEDY_DECODER.MS_PER_TIMESTEP]
                                     [--greedy_decoder.vocab_file GREEDY_DECODER.VOCAB_FILE]
                                     [--greedy_decoder.decoder_num_worker_threads GREEDY_DECODER.DECODER_NUM_WORKER_THREADS]
                                     [--os2s_decoder.max_sequence_idle_microseconds OS2S_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--os2s_decoder.max_batch_size OS2S_DECODER.MAX_BATCH_SIZE]
                                     [--os2s_decoder.min_batch_size OS2S_DECODER.MIN_BATCH_SIZE]
                                     [--os2s_decoder.opt_batch_size OS2S_DECODER.OPT_BATCH_SIZE]
                                     [--os2s_decoder.preferred_batch_size OS2S_DECODER.PREFERRED_BATCH_SIZE]
                                     [--os2s_decoder.batching_type OS2S_DECODER.BATCHING_TYPE]
                                     [--os2s_decoder.preserve_ordering OS2S_DECODER.PRESERVE_ORDERING]
                                     [--os2s_decoder.instance_group_count OS2S_DECODER.INSTANCE_GROUP_COUNT]
                                     [--os2s_decoder.max_queue_delay_microseconds OS2S_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--os2s_decoder.max_execution_batch_size OS2S_DECODER.MAX_EXECUTION_BATCH_SIZE]
                                     [--os2s_decoder.decoder_type OS2S_DECODER.DECODER_TYPE]
                                     [--os2s_decoder.padding_size OS2S_DECODER.PADDING_SIZE]
                                     [--os2s_decoder.max_supported_transcripts OS2S_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--os2s_decoder.asr_model_delay OS2S_DECODER.ASR_MODEL_DELAY]
                                     [--os2s_decoder.ms_per_timestep OS2S_DECODER.MS_PER_TIMESTEP]
                                     [--os2s_decoder.vocab_file OS2S_DECODER.VOCAB_FILE]
                                     [--os2s_decoder.decoder_num_worker_threads OS2S_DECODER.DECODER_NUM_WORKER_THREADS]
                                     [--os2s_decoder.language_model_file OS2S_DECODER.LANGUAGE_MODEL_FILE]
                                     [--os2s_decoder.beam_search_width OS2S_DECODER.BEAM_SEARCH_WIDTH]
                                     [--os2s_decoder.language_model_alpha OS2S_DECODER.LANGUAGE_MODEL_ALPHA]
                                     [--os2s_decoder.language_model_beta OS2S_DECODER.LANGUAGE_MODEL_BETA]
                                     [--kaldi_decoder.max_sequence_idle_microseconds KALDI_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--kaldi_decoder.max_batch_size KALDI_DECODER.MAX_BATCH_SIZE]
                                     [--kaldi_decoder.min_batch_size KALDI_DECODER.MIN_BATCH_SIZE]
                                     [--kaldi_decoder.opt_batch_size KALDI_DECODER.OPT_BATCH_SIZE]
                                     [--kaldi_decoder.preferred_batch_size KALDI_DECODER.PREFERRED_BATCH_SIZE]
                                     [--kaldi_decoder.batching_type KALDI_DECODER.BATCHING_TYPE]
                                     [--kaldi_decoder.preserve_ordering KALDI_DECODER.PRESERVE_ORDERING]
                                     [--kaldi_decoder.instance_group_count KALDI_DECODER.INSTANCE_GROUP_COUNT]
                                     [--kaldi_decoder.max_queue_delay_microseconds KALDI_DECODER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--kaldi_decoder.max_execution_batch_size KALDI_DECODER.MAX_EXECUTION_BATCH_SIZE]
                                     [--kaldi_decoder.decoder_type KALDI_DECODER.DECODER_TYPE]
                                     [--kaldi_decoder.padding_size KALDI_DECODER.PADDING_SIZE]
                                     [--kaldi_decoder.max_supported_transcripts KALDI_DECODER.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--kaldi_decoder.asr_model_delay KALDI_DECODER.ASR_MODEL_DELAY]
                                     [--kaldi_decoder.ms_per_timestep KALDI_DECODER.MS_PER_TIMESTEP]
                                     [--kaldi_decoder.vocab_file KALDI_DECODER.VOCAB_FILE]
                                     [--kaldi_decoder.decoder_num_worker_threads KALDI_DECODER.DECODER_NUM_WORKER_THREADS]
                                     [--kaldi_decoder.fst_filename KALDI_DECODER.FST_FILENAME]
                                     [--kaldi_decoder.word_syms_filename KALDI_DECODER.WORD_SYMS_FILENAME]
                                     [--kaldi_decoder.default_beam KALDI_DECODER.DEFAULT_BEAM]
                                     [--kaldi_decoder.max_active KALDI_DECODER.MAX_ACTIVE]
                                     [--kaldi_decoder.acoustic_scale KALDI_DECODER.ACOUSTIC_SCALE]
                                     [--kaldi_decoder.decoder_num_copy_threads KALDI_DECODER.DECODER_NUM_COPY_THREADS]
                                     [--kaldi_decoder.determinize_lattice KALDI_DECODER.DETERMINIZE_LATTICE]
                                     [--rescorer.max_sequence_idle_microseconds RESCORER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                     [--rescorer.max_batch_size RESCORER.MAX_BATCH_SIZE]
                                     [--rescorer.min_batch_size RESCORER.MIN_BATCH_SIZE]
                                     [--rescorer.opt_batch_size RESCORER.OPT_BATCH_SIZE]
                                     [--rescorer.preferred_batch_size RESCORER.PREFERRED_BATCH_SIZE]
                                     [--rescorer.batching_type RESCORER.BATCHING_TYPE]
                                     [--rescorer.preserve_ordering RESCORER.PRESERVE_ORDERING]
                                     [--rescorer.instance_group_count RESCORER.INSTANCE_GROUP_COUNT]
                                     [--rescorer.max_queue_delay_microseconds RESCORER.MAX_QUEUE_DELAY_MICROSECONDS]
                                     [--rescorer.max_supported_transcripts RESCORER.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--rescorer.score_lm_carpa_filename RESCORER.SCORE_LM_CARPA_FILENAME]
                                     [--rescorer.decode_lm_carpa_filename RESCORER.DECODE_LM_CARPA_FILENAME]
                                     [--rescorer.word_syms_filename RESCORER.WORD_SYMS_FILENAME]
                                     [--rescorer.word_insertion_penalty RESCORER.WORD_INSERTION_PENALTY]
                                     [--rescorer.num_worker_threads RESCORER.NUM_WORKER_THREADS]
                                     [--rescorer.ms_per_timestep RESCORER.MS_PER_TIMESTEP]
                                     [--rescorer.boundary_character_ids RESCORER.BOUNDARY_CHARACTER_IDS]
                                     [--rescorer.vocab_file RESCORER.VOCAB_FILE]
                                     [--lm_decoder_cpu.beam_search_width LM_DECODER_CPU.BEAM_SEARCH_WIDTH]
                                     [--lm_decoder_cpu.decoder_type LM_DECODER_CPU.DECODER_TYPE]
                                     [--lm_decoder_cpu.padding_size LM_DECODER_CPU.PADDING_SIZE]
                                     [--lm_decoder_cpu.language_model_file LM_DECODER_CPU.LANGUAGE_MODEL_FILE]
                                     [--lm_decoder_cpu.max_supported_transcripts LM_DECODER_CPU.MAX_SUPPORTED_TRANSCRIPTS]
                                     [--lm_decoder_cpu.asr_model_delay LM_DECODER_CPU.ASR_MODEL_DELAY]
                                     [--lm_decoder_cpu.language_model_alpha LM_DECODER_CPU.LANGUAGE_MODEL_ALPHA]
                                     [--lm_decoder_cpu.language_model_beta LM_DECODER_CPU.LANGUAGE_MODEL_BETA]
                                     [--lm_decoder_cpu.ms_per_timestep LM_DECODER_CPU.MS_PER_TIMESTEP]
                                     [--lm_decoder_cpu.vocab_file LM_DECODER_CPU.VOCAB_FILE]
                                     [--lm_decoder_cpu.lexicon_file LM_DECODER_CPU.LEXICON_FILE]
                                     [--lm_decoder_cpu.beam_size LM_DECODER_CPU.BEAM_SIZE]
                                     [--lm_decoder_cpu.beam_size_token LM_DECODER_CPU.BEAM_SIZE_TOKEN]
                                     [--lm_decoder_cpu.beam_threshold LM_DECODER_CPU.BEAM_THRESHOLD]
                                     [--lm_decoder_cpu.lm_weight LM_DECODER_CPU.LM_WEIGHT]
                                     [--lm_decoder_cpu.word_insertion_score LM_DECODER_CPU.WORD_INSERTION_SCORE]
                                     [--lm_decoder_cpu.forerunner_beam_size LM_DECODER_CPU.FORERUNNER_BEAM_SIZE]
                                     [--lm_decoder_cpu.forerunner_beam_size_token LM_DECODER_CPU.FORERUNNER_BEAM_SIZE_TOKEN]
                                     [--lm_decoder_cpu.forerunner_beam_threshold LM_DECODER_CPU.FORERUNNER_BEAM_THRESHOLD]
                                     [--lm_decoder_cpu.smearing_mode LM_DECODER_CPU.SMEARING_MODE]
                                     [--lm_decoder_cpu.forerunner_use_lm LM_DECODER_CPU.FORERUNNER_USE_LM]
                                     output_path source_path [source_path ...]

Generate a Riva Model from a speech_recognition model trained with NVIDIA
NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --acoustic_model_name ACOUSTIC_MODEL_NAME
                        name of the acoustic model
  --name NAME           name of the ASR pipeline, used to set the model names
                        in the Riva model repository
  --streaming           Execute model in streaming mode
  --offline             In streaming mode, do not minimize latency
  --chunk_size CHUNK_SIZE
                        Size of audio chunks to use during inference. If not
                        specified, default will be selected based on
                        online/offline setting
  --padding_factor PADDING_FACTOR
                        Multiple on the chunk_size. Deprecated and will be
                        ignored
  --left_padding_size LEFT_PADDING_SIZE
                        The duration in seconds of the backward looking
                        padding to prepend to the audio chunk. The acoustic
                        model input corresponds to a duration of
                        (left_padding_size + chunk_size + right_padding_size)
                        seconds
  --right_padding_size RIGHT_PADDING_SIZE
                        The duration in seconds of the forward looking padding
                        to append to the audio chunk. The acoustic model input
                        corresponds to a duration of (left_padding_size +
                        chunk_size + right_padding_size) seconds
  --padding_size PADDING_SIZE
                        padding_size
  --max_supported_transcripts MAX_SUPPORTED_TRANSCRIPTS
                        The maximum number of hypothesized transcripts
                        generated per utterance
  --compute_timestamps COMPUTE_TIMESTAMPS
  --ms_per_timestep MS_PER_TIMESTEP
                        The duration in milliseconds of one timestep of the
                        acoustic model output
  --lattice_beam LATTICE_BEAM
  --decoding_language_model_arpa DECODING_LANGUAGE_MODEL_ARPA
                        Language model .arpa used during decoding
  --decoding_language_model_binary DECODING_LANGUAGE_MODEL_BINARY
                        Language model .binary used during decoding
  --decoding_language_model_fst DECODING_LANGUAGE_MODEL_FST
                        Language model fst used during decoding
  --decoding_language_model_words DECODING_LANGUAGE_MODEL_WORDS
                        Language model words used during decoding
  --rescoring_language_model_arpa RESCORING_LANGUAGE_MODEL_ARPA
                        Language model .arpa used during lattice rescoring
  --decoding_language_model_carpa DECODING_LANGUAGE_MODEL_CARPA
                        Language model .carpa used during decoding
  --rescoring_language_model_carpa RESCORING_LANGUAGE_MODEL_CARPA
                        Language model .carpa used during lattice rescoring
  --decoding_lexicon DECODING_LEXICON
                        Lexicon to use when decoding
  --decoding_vocab DECODING_VOCAB
                        File of unique words separated by white space. Only
                        used if decoding_lexicon not provided.
  --tokenizer_model TOKENIZER_MODEL
                        Sentencpiece model to use for encoding. Only include
                        if generating lexicon from vocab.
  --decoder_type DECODER_TYPE
                        Type of decoder to use. Valid entries are greedy,
                        os2s, flashlight or kaldi

featurizer:
  --featurizer.max_sequence_idle_microseconds FEATURIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --featurizer.max_batch_size FEATURIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --featurizer.min_batch_size FEATURIZER.MIN_BATCH_SIZE
  --featurizer.opt_batch_size FEATURIZER.OPT_BATCH_SIZE
  --featurizer.preferred_batch_size FEATURIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --featurizer.batching_type FEATURIZER.BATCHING_TYPE
  --featurizer.preserve_ordering FEATURIZER.PRESERVE_ORDERING
                        Preserve ordering
  --featurizer.instance_group_count FEATURIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --featurizer.max_queue_delay_microseconds FEATURIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --featurizer.max_execution_batch_size FEATURIZER.MAX_EXECUTION_BATCH_SIZE
                        Maximum Batch Size
  --featurizer.gain FEATURIZER.GAIN
                        Adjust input signal with this gain multiplier prior to
                        feature extraction
  --featurizer.dither FEATURIZER.DITHER
                        Augment signal with gaussian noise with this gain to
                        prevent quantization artifacts
  --featurizer.stddev_floor FEATURIZER.STDDEV_FLOOR
                        Add this value to computed features standard
                        deviation. Higher values help reduce spurious
                        transcripts with low energy signals.
  --featurizer.use_utterance_norm_params FEATURIZER.USE_UTTERANCE_NORM_PARAMS
                        Apply normalization at utterance level
  --featurizer.precalc_norm_time_steps FEATURIZER.PRECALC_NORM_TIME_STEPS
                        Weight of the precomputed normalization parameters, in
                        timesteps. Setting to 0 will disable use of
                        precalculated normalization parameters.
  --featurizer.precalc_norm_params FEATURIZER.PRECALC_NORM_PARAMS
                        Boolean that controls if precalculated Normalization
                        Parameters should be used
  --featurizer.norm_per_feature FEATURIZER.NORM_PER_FEATURE
                        Normalize Per Feature
  --featurizer.mean FEATURIZER.MEAN
                        Pre-computed mean values
  --featurizer.stddev FEATURIZER.STDDEV
                        Pre-computed Std Dev Values
  --featurizer.transpose FEATURIZER.TRANSPOSE
                        Take transpose of output features
  --featurizer.padding_size FEATURIZER.PADDING_SIZE
                        padding_size

nn:
  --nn.max_sequence_idle_microseconds NN.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --nn.max_batch_size NN.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --nn.min_batch_size NN.MIN_BATCH_SIZE
  --nn.opt_batch_size NN.OPT_BATCH_SIZE
  --nn.preferred_batch_size NN.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --nn.batching_type NN.BATCHING_TYPE
  --nn.preserve_ordering NN.PRESERVE_ORDERING
                        Preserve ordering
  --nn.instance_group_count NN.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --nn.max_queue_delay_microseconds NN.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --nn.trt_max_workspace_size NN.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in bytes) to use for model
                        export to TensorRT
  --nn.use_onnx_runtime
                        Use ONNX runtime instead of TRT
  --nn.use_trt_fp32     Use TRT engine with fp32 instead of fp16

vad:
  --vad.max_sequence_idle_microseconds VAD.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --vad.max_batch_size VAD.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --vad.min_batch_size VAD.MIN_BATCH_SIZE
  --vad.opt_batch_size VAD.OPT_BATCH_SIZE
  --vad.preferred_batch_size VAD.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --vad.batching_type VAD.BATCHING_TYPE
  --vad.preserve_ordering VAD.PRESERVE_ORDERING
                        Preserve ordering
  --vad.instance_group_count VAD.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --vad.max_queue_delay_microseconds VAD.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --vad.ms_per_timestep VAD.MS_PER_TIMESTEP
  --vad.vad_start_history VAD.VAD_START_HISTORY
                        Size of the window, in milliseconds, to use to detect
                        start of utterance. If (vad_start_th) of
                        (vad_start_history) ms of the acoustic model output
                        have non-blank tokens, start of utterance is detected.
  --vad.vad_stop_history VAD.VAD_STOP_HISTORY
                        Size of the window, in milliseconds, to use to detect
                        end of utterance. If (vad_stop_th) of
                        (vad_stop_history) ms of the acoustic model output
                        have non-blank tokens, end of utterance is detected.
  --vad.vad_start_th VAD.VAD_START_TH
                        Percentage threshold to use to detect start of
                        utterance. If (vad_start_th) of (vad_start_history) ms
                        of the acoustic model output have non-blank tokens,
                        start of utterance is detected.
  --vad.vad_stop_th VAD.VAD_STOP_TH
                        Percentage threshold to use to detect end of
                        utterance. If (vad_stop_th) of (vad_stop_history) ms
                        of the acoustic model output have non-blank tokens,
                        end of utterance is detected.
  --vad.vad_type VAD.VAD_TYPE
                        Type of voice activity detection algorithm to use. Set
                        to none to disable VAD.
  --vad.residue_blanks_at_start VAD.RESIDUE_BLANKS_AT_START
                        (Advanced) Number of time steps to ignore at the
                        beginning of the acoustic model output when trying to
                        detect start/end of speech
  --vad.residue_blanks_at_end VAD.RESIDUE_BLANKS_AT_END
                        (Advanced) Number of time steps to ignore at the end
                        of the acoustic model output when trying to detect
                        start/end of speech
  --vad.vocab_file VAD.VOCAB_FILE
                        Vocab file to be used with decoder

flashlight_decoder:
  --flashlight_decoder.max_sequence_idle_microseconds FLASHLIGHT_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --flashlight_decoder.max_batch_size FLASHLIGHT_DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --flashlight_decoder.min_batch_size FLASHLIGHT_DECODER.MIN_BATCH_SIZE
  --flashlight_decoder.opt_batch_size FLASHLIGHT_DECODER.OPT_BATCH_SIZE
  --flashlight_decoder.preferred_batch_size FLASHLIGHT_DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --flashlight_decoder.batching_type FLASHLIGHT_DECODER.BATCHING_TYPE
  --flashlight_decoder.preserve_ordering FLASHLIGHT_DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --flashlight_decoder.instance_group_count FLASHLIGHT_DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --flashlight_decoder.max_queue_delay_microseconds FLASHLIGHT_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --flashlight_decoder.max_execution_batch_size FLASHLIGHT_DECODER.MAX_EXECUTION_BATCH_SIZE
  --flashlight_decoder.decoder_type FLASHLIGHT_DECODER.DECODER_TYPE
  --flashlight_decoder.padding_size FLASHLIGHT_DECODER.PADDING_SIZE
                        padding_size
  --flashlight_decoder.max_supported_transcripts FLASHLIGHT_DECODER.MAX_SUPPORTED_TRANSCRIPTS
  --flashlight_decoder.asr_model_delay FLASHLIGHT_DECODER.ASR_MODEL_DELAY
                        (Advanced) Number of time steps by which the acoustic
                        model output should be shifted when computing
                        timestamps. This parameter must be tuned since the CTC
                        model is not guaranteed to predict correct alignment.
  --flashlight_decoder.ms_per_timestep FLASHLIGHT_DECODER.MS_PER_TIMESTEP
  --flashlight_decoder.vocab_file FLASHLIGHT_DECODER.VOCAB_FILE
                        Vocab file to be used with decoder
  --flashlight_decoder.decoder_num_worker_threads FLASHLIGHT_DECODER.DECODER_NUM_WORKER_THREADS
                        Number of threads to use for CPU decoders. If < 1,
                        maximum hardware concurrency is used.
  --flashlight_decoder.language_model_file FLASHLIGHT_DECODER.LANGUAGE_MODEL_FILE
                        Language model file in binary format to be used by
                        KenLM
  --flashlight_decoder.lexicon_file FLASHLIGHT_DECODER.LEXICON_FILE
                        Lexicon file to be used with decoder
  --flashlight_decoder.beam_size FLASHLIGHT_DECODER.BEAM_SIZE
                        Maximum number of hypothesis the decoder holds after
                        each step
  --flashlight_decoder.beam_size_token FLASHLIGHT_DECODER.BEAM_SIZE_TOKEN
                        Maximum number of tokens the decoder considers at each
                        step
  --flashlight_decoder.beam_threshold FLASHLIGHT_DECODER.BEAM_THRESHOLD
                        Threshold to prune hypothesis
  --flashlight_decoder.lm_weight FLASHLIGHT_DECODER.LM_WEIGHT
                        Weight of language model
  --flashlight_decoder.blank_token FLASHLIGHT_DECODER.BLANK_TOKEN
                        Blank token
  --flashlight_decoder.sil_token FLASHLIGHT_DECODER.SIL_TOKEN
                        Silence token
  --flashlight_decoder.word_insertion_score FLASHLIGHT_DECODER.WORD_INSERTION_SCORE
                        Word insertion score
  --flashlight_decoder.forerunner_beam_size FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE
                        Maximum number of hypothesis the decoder holds after
                        each step, for forerunner transcript
  --flashlight_decoder.forerunner_beam_size_token FLASHLIGHT_DECODER.FORERUNNER_BEAM_SIZE_TOKEN
                        Maximum number of tokens the decoder considers at each
                        step, for forerunner transcript
  --flashlight_decoder.forerunner_beam_threshold FLASHLIGHT_DECODER.FORERUNNER_BEAM_THRESHOLD
                        Threshold to prune hypothesis, for forerunner
                        transcript
  --flashlight_decoder.smearing_mode FLASHLIGHT_DECODER.SMEARING_MODE
                        Decoder smearing mode. Can be logadd, max or none
  --flashlight_decoder.forerunner_use_lm FLASHLIGHT_DECODER.FORERUNNER_USE_LM
                        Bool that controls if the forerunner decoder should
                        use a language model

greedy_decoder:
  --greedy_decoder.max_sequence_idle_microseconds GREEDY_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --greedy_decoder.max_batch_size GREEDY_DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --greedy_decoder.min_batch_size GREEDY_DECODER.MIN_BATCH_SIZE
  --greedy_decoder.opt_batch_size GREEDY_DECODER.OPT_BATCH_SIZE
  --greedy_decoder.preferred_batch_size GREEDY_DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --greedy_decoder.batching_type GREEDY_DECODER.BATCHING_TYPE
  --greedy_decoder.preserve_ordering GREEDY_DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --greedy_decoder.instance_group_count GREEDY_DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --greedy_decoder.max_queue_delay_microseconds GREEDY_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --greedy_decoder.max_execution_batch_size GREEDY_DECODER.MAX_EXECUTION_BATCH_SIZE
  --greedy_decoder.decoder_type GREEDY_DECODER.DECODER_TYPE
  --greedy_decoder.padding_size GREEDY_DECODER.PADDING_SIZE
                        padding_size
  --greedy_decoder.max_supported_transcripts GREEDY_DECODER.MAX_SUPPORTED_TRANSCRIPTS
  --greedy_decoder.asr_model_delay GREEDY_DECODER.ASR_MODEL_DELAY
                        (Advanced) Number of time steps by which the acoustic
                        model output should be shifted when computing
                        timestamps. This parameter must be tuned since the CTC
                        model is not guaranteed to predict correct alignment.
  --greedy_decoder.ms_per_timestep GREEDY_DECODER.MS_PER_TIMESTEP
  --greedy_decoder.vocab_file GREEDY_DECODER.VOCAB_FILE
                        Vocab file to be used with decoder
  --greedy_decoder.decoder_num_worker_threads GREEDY_DECODER.DECODER_NUM_WORKER_THREADS
                        Number of threads to use for CPU decoders. If < 1,
                        maximum hardware concurrency is used.

os2s_decoder:
  --os2s_decoder.max_sequence_idle_microseconds OS2S_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --os2s_decoder.max_batch_size OS2S_DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --os2s_decoder.min_batch_size OS2S_DECODER.MIN_BATCH_SIZE
  --os2s_decoder.opt_batch_size OS2S_DECODER.OPT_BATCH_SIZE
  --os2s_decoder.preferred_batch_size OS2S_DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --os2s_decoder.batching_type OS2S_DECODER.BATCHING_TYPE
  --os2s_decoder.preserve_ordering OS2S_DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --os2s_decoder.instance_group_count OS2S_DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --os2s_decoder.max_queue_delay_microseconds OS2S_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --os2s_decoder.max_execution_batch_size OS2S_DECODER.MAX_EXECUTION_BATCH_SIZE
  --os2s_decoder.decoder_type OS2S_DECODER.DECODER_TYPE
  --os2s_decoder.padding_size OS2S_DECODER.PADDING_SIZE
                        padding_size
  --os2s_decoder.max_supported_transcripts OS2S_DECODER.MAX_SUPPORTED_TRANSCRIPTS
  --os2s_decoder.asr_model_delay OS2S_DECODER.ASR_MODEL_DELAY
                        (Advanced) Number of time steps by which the acoustic
                        model output should be shifted when computing
                        timestamps. This parameter must be tuned since the CTC
                        model is not guaranteed to predict correct alignment.
  --os2s_decoder.ms_per_timestep OS2S_DECODER.MS_PER_TIMESTEP
  --os2s_decoder.vocab_file OS2S_DECODER.VOCAB_FILE
                        Vocab file to be used with decoder
  --os2s_decoder.decoder_num_worker_threads OS2S_DECODER.DECODER_NUM_WORKER_THREADS
                        Number of threads to use for CPU decoders. If < 1,
                        maximum hardware concurrency is used.
  --os2s_decoder.language_model_file OS2S_DECODER.LANGUAGE_MODEL_FILE
                        Language model file in binary format to be used by
                        KenLM
  --os2s_decoder.beam_search_width OS2S_DECODER.BEAM_SEARCH_WIDTH
  --os2s_decoder.language_model_alpha OS2S_DECODER.LANGUAGE_MODEL_ALPHA
  --os2s_decoder.language_model_beta OS2S_DECODER.LANGUAGE_MODEL_BETA

kaldi_decoder:
  --kaldi_decoder.max_sequence_idle_microseconds KALDI_DECODER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --kaldi_decoder.max_batch_size KALDI_DECODER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --kaldi_decoder.min_batch_size KALDI_DECODER.MIN_BATCH_SIZE
  --kaldi_decoder.opt_batch_size KALDI_DECODER.OPT_BATCH_SIZE
  --kaldi_decoder.preferred_batch_size KALDI_DECODER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --kaldi_decoder.batching_type KALDI_DECODER.BATCHING_TYPE
  --kaldi_decoder.preserve_ordering KALDI_DECODER.PRESERVE_ORDERING
                        Preserve ordering
  --kaldi_decoder.instance_group_count KALDI_DECODER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --kaldi_decoder.max_queue_delay_microseconds KALDI_DECODER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --kaldi_decoder.max_execution_batch_size KALDI_DECODER.MAX_EXECUTION_BATCH_SIZE
  --kaldi_decoder.decoder_type KALDI_DECODER.DECODER_TYPE
  --kaldi_decoder.padding_size KALDI_DECODER.PADDING_SIZE
                        padding_size
  --kaldi_decoder.max_supported_transcripts KALDI_DECODER.MAX_SUPPORTED_TRANSCRIPTS
  --kaldi_decoder.asr_model_delay KALDI_DECODER.ASR_MODEL_DELAY
                        (Advanced) Number of time steps by which the acoustic
                        model output should be shifted when computing
                        timestamps. This parameter must be tuned since the CTC
                        model is not guaranteed to predict correct alignment.
  --kaldi_decoder.ms_per_timestep KALDI_DECODER.MS_PER_TIMESTEP
  --kaldi_decoder.vocab_file KALDI_DECODER.VOCAB_FILE
                        Vocab file to be used with decoder
  --kaldi_decoder.decoder_num_worker_threads KALDI_DECODER.DECODER_NUM_WORKER_THREADS
                        Number of threads to use for CPU decoders. If < 1,
                        maximum hardware concurrency is used.
  --kaldi_decoder.fst_filename KALDI_DECODER.FST_FILENAME
                        Fst file to use during decoding
  --kaldi_decoder.word_syms_filename KALDI_DECODER.WORD_SYMS_FILENAME
  --kaldi_decoder.default_beam KALDI_DECODER.DEFAULT_BEAM
  --kaldi_decoder.max_active KALDI_DECODER.MAX_ACTIVE
  --kaldi_decoder.acoustic_scale KALDI_DECODER.ACOUSTIC_SCALE
  --kaldi_decoder.decoder_num_copy_threads KALDI_DECODER.DECODER_NUM_COPY_THREADS
  --kaldi_decoder.determinize_lattice KALDI_DECODER.DETERMINIZE_LATTICE

rescorer:
  --rescorer.max_sequence_idle_microseconds RESCORER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --rescorer.max_batch_size RESCORER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --rescorer.min_batch_size RESCORER.MIN_BATCH_SIZE
  --rescorer.opt_batch_size RESCORER.OPT_BATCH_SIZE
  --rescorer.preferred_batch_size RESCORER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --rescorer.batching_type RESCORER.BATCHING_TYPE
  --rescorer.preserve_ordering RESCORER.PRESERVE_ORDERING
                        Preserve ordering
  --rescorer.instance_group_count RESCORER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --rescorer.max_queue_delay_microseconds RESCORER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --rescorer.max_supported_transcripts RESCORER.MAX_SUPPORTED_TRANSCRIPTS
  --rescorer.score_lm_carpa_filename RESCORER.SCORE_LM_CARPA_FILENAME
  --rescorer.decode_lm_carpa_filename RESCORER.DECODE_LM_CARPA_FILENAME
  --rescorer.word_syms_filename RESCORER.WORD_SYMS_FILENAME
  --rescorer.word_insertion_penalty RESCORER.WORD_INSERTION_PENALTY
  --rescorer.num_worker_threads RESCORER.NUM_WORKER_THREADS
  --rescorer.ms_per_timestep RESCORER.MS_PER_TIMESTEP
  --rescorer.boundary_character_ids RESCORER.BOUNDARY_CHARACTER_IDS
  --rescorer.vocab_file RESCORER.VOCAB_FILE
                        Vocab file to be used with decoder

lm_decoder_cpu:
  --lm_decoder_cpu.beam_search_width LM_DECODER_CPU.BEAM_SEARCH_WIDTH
  --lm_decoder_cpu.decoder_type LM_DECODER_CPU.DECODER_TYPE
  --lm_decoder_cpu.padding_size LM_DECODER_CPU.PADDING_SIZE
                        padding_size
  --lm_decoder_cpu.language_model_file LM_DECODER_CPU.LANGUAGE_MODEL_FILE
                        Language model file in binary format to be used by
                        KenLM
  --lm_decoder_cpu.max_supported_transcripts LM_DECODER_CPU.MAX_SUPPORTED_TRANSCRIPTS
  --lm_decoder_cpu.asr_model_delay LM_DECODER_CPU.ASR_MODEL_DELAY
                        (Advanced) Number of time steps by which the acoustic
                        model output should be shifted when computing
                        timestamps. This parameter must be tuned since the CTC
                        model is not guaranteed to predict correct alignment.
  --lm_decoder_cpu.language_model_alpha LM_DECODER_CPU.LANGUAGE_MODEL_ALPHA
  --lm_decoder_cpu.language_model_beta LM_DECODER_CPU.LANGUAGE_MODEL_BETA
  --lm_decoder_cpu.ms_per_timestep LM_DECODER_CPU.MS_PER_TIMESTEP
  --lm_decoder_cpu.vocab_file LM_DECODER_CPU.VOCAB_FILE
                        Vocab file to be used with decoder
  --lm_decoder_cpu.lexicon_file LM_DECODER_CPU.LEXICON_FILE
                        Lexicon file to be used with decoder
  --lm_decoder_cpu.beam_size LM_DECODER_CPU.BEAM_SIZE
                        Maximum number of hypothesis the decoder holds after
                        each step
  --lm_decoder_cpu.beam_size_token LM_DECODER_CPU.BEAM_SIZE_TOKEN
                        Maximum number of tokens the decoder considers at each
                        step
  --lm_decoder_cpu.beam_threshold LM_DECODER_CPU.BEAM_THRESHOLD
                        Threshold to prune hypothesis
  --lm_decoder_cpu.lm_weight LM_DECODER_CPU.LM_WEIGHT
                        Weight of language model
  --lm_decoder_cpu.word_insertion_score LM_DECODER_CPU.WORD_INSERTION_SCORE
                        Word insertion score
  --lm_decoder_cpu.forerunner_beam_size LM_DECODER_CPU.FORERUNNER_BEAM_SIZE
                        Maximum number of hypothesis the decoder holds after
                        each step, for forerunner transcript
  --lm_decoder_cpu.forerunner_beam_size_token LM_DECODER_CPU.FORERUNNER_BEAM_SIZE_TOKEN
                        Maximum number of tokens the decoder considers at each
                        step, for forerunner transcript
  --lm_decoder_cpu.forerunner_beam_threshold LM_DECODER_CPU.FORERUNNER_BEAM_THRESHOLD
                        Threshold to prune hypothesis, for forerunner
                        transcript
  --lm_decoder_cpu.smearing_mode LM_DECODER_CPU.SMEARING_MODE
                        Decoder smearing mode. Can be logadd, max or none
  --lm_decoder_cpu.forerunner_use_lm LM_DECODER_CPU.FORERUNNER_USE_LM
                        Bool that controls if the forerunner decoder should
                        use a language model

Training Language Models

Introducing a language model to an ASR pipeline is an easy way to improve accuracy for natural language and can be fine-tuned for niche settings. In short, an n-gram language model estimates the probability distribution over groups of n or less consecutive words, P (word-1, …, word-n). By altering or biasing the data on which a language model is trained on, and thus the distribution it is estimating, it can be used to predict different transcriptions as more likely, and thus alter the prediction without changing the acoustic model. Riva supports n-gram models trained and exported from either NVIDIA TAO Toolkit or KenLM.

TAO Toolkit Language Model

The general TAO Toolkit model development pipeline is outlined in the Model Overview page. To train a new language model, run:

!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
                     export_to=PATH_TO_TAO_FILE \
                     training_ds.data_dir=PATH_TO_DATA \
                     model.order=4 \
                     model.pruning=[0,1,1,3]  \
                     -k $KEY

To export a pre-trained model, run:

### For export to Riva
!tao n_gram export \
           -e /specs/nlp/intent_slot_classification/export.yaml \
           -m PATH_TO_TAO_FILE \
           export_to=PATH_TO_RIVA_FILE \
           binary_type=probing \
           -k $KEY

KenLM Setup

KenLM is the recommended tool for building language models. This toolkit supports estimating, filtering and querying n-gram language models. To begin, first make sure you have Boost and zlib installed. Depending on your requirements, you may require additional dependencies. Double check by referencing the dependencies list.

After all dependencies are met, create a separate directory to build KenLM.

wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

Estimating

The next step is to gather and process data. In most cases, KenLM expects data to be natural language (suiting your use case). Common preprocessing steps include replacing numerics and removing umlauts, punctuation or special characters. However, it is most important that your preprocessing steps are consistent between both your language and acoustic model.

Assuming your current working directory is the build subdirectory of KenLM, bin/lmplz performs estimation on the corpus provided through stdin and writes the ARPA (a human readable from of the language model) to stdout. Running bin/lmplz documents the command-line arguments, however, here are a few important ones:

  • -o: Required. The order of the language model. Depends on use case, but generally 3 to 8.

  • -S: Memory to use. Nubmer followed by % for percentage, b for bytes, K for kilobytes, and so on. Default is 80%.

  • -T: Temproary file location

  • --text arg: Read text from a file instead of stdin.

  • --arpa arg: Write ARPA to a file instead of stdout.

  • --prune arg: Prune n-grams with count less than or equal to the given threshold, with one value specified for each order. For example, to prune singleton trigrams, --prune 0 0 1. The sequence of values must be non-decreasing and the last value applies to all remaining orders. Default is to not prune. Unigram pruning is not supported, so the first number must be 0.

  • --limit_vocab_file arg: Read allowed vocabulary separated by whitespace from file in argument and prune all n-grams containing vocabulary items not from the list. Can be combined with pruning.

Pruning and limiting vocabulary help get rid of typos, uncommon words, and general outliers from the dataset, making the resulting ARPA smaller and generally less overfit, but potentially at the cost of losing some jargon or colloquial language.

With the appropriate options, the language model can be estimated.

bin/lmplz -o 4 < text > text.arpa

Querying and Evaluation

For faster loading, convert the arpa file to binary.

bin/build_binary text.arpa text.binary

The binary or ARPA can be queried via the command-line.

bin/query text.binary < data

Pretrained Models

Deployment with Citrinet is currently recommended for most users. QuartzNet 1.2 is a smaller, more efficient model and is suitable for situations where reduced accuracy is acceptable in favor of higher throughput and lower latency.

Task

Architecture

Language

Dataset

Sampling Rate

Compatibility with TAO Toolkit 3.0-21.08

Compatibility with Nemo 1.0.0b4

Link

Transcription

Citrinet

English

ASR Set 3.0 - 16700 Hours

16000

Yes

Yes

Riva

Transcription

Conformer-CTC

English

ASR Set 3.0 - 16700 Hours

16000

Yes

Yes

Riva

Transcription

Jasper

English

ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours

16000

Yes

Yes

Riva

Transcription

QuartzNet

English

ASR Set 1.2

16000

Yes

Yes

Riva

Transcription

Citrinet

Spanish

ASR Set 1.0 - 1800 Hours

16000

Yes

Yes

Riva

Transcription

Citrinet

German

ASR Set 1.0 - 2300 Hours

16000

Yes

Yes

Riva

Transcription

Citrinet

Russian

ASR Set 1.0 - 1700 Hours

16000

Yes

Yes

Riva

Features

Word Boosting

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time, by giving them a higher score when decoding the output of the acoustic model.

Examples demonstrating how to use word boosting can be found in the /work/examples/transcribe_file_offline.py and /work/examples/transcribe_file.py Python scripts in the Riva client image. The following sample command shows how to run these scripts (and the outputs they generate) from within the Riva client container:

/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav
Final transcript: I had a meeting today with Muhammad Oscar and Katherine Rutherford about the future of Riva at NVIDIA.
/work/examples# python3 transcribe_file.py --server <Riva API endpoint hostname>:<Riva API endpoint port number> --audio-file <audio file path>/audiofile.wav --boosted_lm_words "asghar"
Final transcript: I had a meeting today with Muhammad Asghar and Katherine Rutherford about the future of Riva at NVIDIA.

These scripts show how to add the boosted words to RecognitionConfig, with SpeechContext (look for the "# Append boosted words/score" comment). For more information about SpeechContext, refer to the riva/proto/riva_asr.proto description here.

The following word boosting code snippets are included in these example scripts:

# Creating GRPC channel and RecognitionConfig instance
channel = grpc.insecure_channel(args.server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
  encoding=ra.AudioEncoding.LINEAR_PCM,
  sample_rate_hertz=wf.getframerate(),
  language_code=args.language_code,
  max_alternatives=1,
  enable_automatic_punctuation=True,
)

# Word Boosting
boosted_lm_words = ["first", "second", "third"]
boosted_lm_score = 10.0
speech_context = rasr.SpeechContext()
speech_context.phrases.extend(boosted_lm_words)
speech_context.boost = boosted_lm_score
config.speech_contexts.append(speech_context)

# Creating StreamingRecognitionConfig instance with config
streaming_config = rasr.StreamingRecognitionConfig(config=config, interim_results=True)

.....

You can also have different boost values for different words. For example, here first is boosted by 10 and second is boosted by 20:

speech_context1 = rasr.SpeechContext()
speech_context1.phrases.append("first")
speech_context1.boost = 10.
config.speech_contexts.append(speech_context1)

speech_context2 = rasr.SpeechContext()
speech_context2.phrases.append("second")
speech_context2.boost = 20.
config.speech_contexts.append(speech_context2)

Note:

  • There is no limit on the number of words that can be boosted. The Riva Speech Skills v1.10.0-beta release has significant performance improvement over the previous versions. You should see no impact on latency for all requests, even for ~100 boosted words, except for the first request, which is expected.

  • By default, no words are boosted on the server side. Only words passed by the client are boosted.

  • With Riva Speech Skills v1.10.0-beta release, out-of-vocabulary word boosting is supported as well.

  • Boosting phrases or combination of words is not yet fully supported (but do work). We will revisit finalizing this support in the upcoming release.