Pipeline Configuration#

In the simplest use case, you can deploy an ASR pipeline to be used with the StreamingRecognize API call (refer to riva/proto/riva_asr.proto) without any language model as follows:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy

where:

  • <rmir_filename> is the Riva rmir file that is generated

  • <riva_filename> is the name of the riva file to use as input

  • <encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

  • <name>,<acoustic_model_name> and <featurizer_name> are optional user-defined names for the components in the model repository.

  • <wfst_tokenizer_model> is the name of the WFST tokenizer model file to use for inverse text normalization of ASR transcripts. Refer to inverse-text-normalization for more details.

  • <wfst_verbalizer_model> is the name of the WFST verbalizer model file to use for inverse text normalization of ASR transcripts. Refer to inverse-text-normalization for more details.

  • decoder_type is the type of decoder to use. Valid values are flashlight, os2s, greedy and pass_through. We recommend using flashlight for all CTC models. Refer to Decoder Hyper-Parameters for more details.

Upon successful completion of this command, a file named <rmir_filename> is created in the /servicemaker-dev/ folder. Since no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted you need to include :<encryption_key> at the end of the RMIR filename and Riva filename. Otherwise, this is unnecessary.

For embedded platforms, using a batch size of 1 is recommended since it achieves the lowest memory footprint. To use a batch size of 1, refer to the riva-build-optional-parameters section and set the various min_batch_size, max_batch_size, opt_batch_size, and max_execution_batch_size parameters to 1 while executing the riva-build command.

The following summary lists the riva-build commands used to generate the RMIR files from the Quick Start scripts for different models, modes, and their limitations:

riva-build speech_recognition \
  <rmir_filename>:<key> \
  <riva_file>:<key> \
  --name=conformer-en-US-asr-streaming \
  --return_separate_utterances=False \
  --featurizer.use_utterance_norm_params=False \
  --featurizer.precalc_norm_time_steps=0 \
  --featurizer.precalc_norm_params=False \
  --ms_per_timestep=40 \
  --endpointing.start_history=200 \
  --nn.fp16_needs_obey_precision_pass \
  --endpointing.residue_blanks_at_start=-2 \
  --chunk_size=0.16 \
  --left_padding_size=1.92 \
  --right_padding_size=1.92 \
  --decoder_type=flashlight \
  --flashlight_decoder.asr_model_delay=-1 \
  --decoding_language_model_binary=<bin_file> \
  --decoding_vocab=<txt_decoding_vocab_file> \
  --flashlight_decoder.lm_weight=0.8 \
  --flashlight_decoder.word_insertion_score=1.0 \
  --flashlight_decoder.beam_size=32 \
  --flashlight_decoder.beam_threshold=20. \
  --flashlight_decoder.num_tokenization=1 \
  --profane_words_file=<txt_profane_words_file> \
  --language_code=en-US \
  --wfst_tokenizer_model=<far_tokenizer_file> \
  --wfst_verbalizer_model=<far_verbalizer_file> \
  --speech_hints_model=<far_speech_hints_file>

For details about the parameters passed to riva-build to customize the ASR pipeline, run:

riva-build <pipeline> -h

Streaming/Offline Recognition#

The Riva ASR pipeline can be configured for both streaming and offline recognition use cases. When using the StreamingRecognize API call (refer to riva/proto/riva_asr.proto), we recommend the following riva-build parameters for low-latency streaming recognition with the Conformer acoustic model:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=40 \
    --nn.fp16_needs_obey_precision_pass \
    --greedy_decoder.asr_model_delay=-1 \
    --endpointing.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False

For high throughput streaming recognition with the StreamingRecognize API call, chunk_size and padding_size can be set as follows:

    --chunk_size=0.8 \
    --padding_size=1.6

Finally, to configure the ASR pipeline for offline recognition with the Recognize API call (refer to riva/proto/riva_asr.proto), we recommend the following settings with the Conformer acoustic model:

     --offline \
     --chunk_size=4.8 \
     --padding_size=1.6

Note

When deploying the offline ASR models with riva-deploy, TensorRT warnings indicating that memory requirements of format conversion cannot be satisfied might appear in the logs. These warnings should not affect functionality and can be ignored.

Language Models#

Riva ASR supports decoding with an n-gram language model. The n-gram language model can be provided in a few different ways.

  1. A .arpa format file.

  2. A KenLM binary format file.

For more information on building language models, refer to the training-language-models section.

ARPA Format Language Model#

To configure the Riva ASR pipeline to use an n-gram language model stored in arpa format, replace:

    --decoder_type=greedy

with

    --decoder_type=flashlight \
    --decoding_language_model_arpa=<arpa_filename> \
    --decoding_vocab=<decoder_vocab_file>

KenLM Binary Language Model#

To generate the Riva RMIR file when using a KenLM binary file to specify the language model, replace:

    --decoder_type=greedy

with

    --decoder_type=flashlight \
    --decoding_language_model_binary=<KENLM_binary_filename> \
    --decoding_vocab=<decoder_vocab_file>

Decoder Hyper-Parameters#

The decoder language model hyper-parameters can also be specified from the riva-build command.

You can specify the Flashlight decoder hyper-parameters beam_size, beam_size_token, beam_threshold, lm_weight and word_insertion_score by specifying

    --decoder_type=flashlight \
    --decoding_language_model_binary=<arpa_filename> \
    --decoding_vocab=<decoder_vocab_file> \
    --flashlight_decoder.beam_size=<beam_size> \
    --flashlight_decoder.beam_size_token=<beam_size_token> \
    --flashlight_decoder.beam_threshold=<beam_threshold> \
    --flashlight_decoder.lm_weight=<lm_weight> \
    --flashlight_decoder.word_insertion_score=<word_insertion_score>

Where:

  • beam_size is the maximum number of hypothesis the decoder holds at each step

  • beam_size_token is the maximum number of tokens the decoder considers at each step

  • beam_threshold is the threshold to prune hypothesis

  • lm_weight is the weight of the language model used when scoring hypothesis

  • word_insertion_score is the word insertion score used when scoring hypothesis

For advanced users, additional decoder hyper-parameters can also be specified. Refer to Riva-build Optional Parameters for a list of those parameters and their description.

Flashlight Decoder Lexicon#

The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the decoder vocabulary file passed to the riva-build command. The decoder vocabulary file used to generate the ASR pipelines in the Quick Start scripts include words that cover a wide range of domains and should provide accurate transcripts for most applications.

It is also possible to build an ASR pipeline using your own decoder vocabulary file by using the parameter --decoding_vocab of the riva-build command. For example, you could start with the riva-build commands used to generate the ASR pipelines in our Quick Start scripts from section Pipeline Configuration and provide your own lexicon decoder vocabulary file. You will need to ensure that words of interest are in the decoder vocabulary file. The Riva ServiceMaker automatically tokenizes the words in the decoder vocabulary file. The number of tokenization for each word in the decoder vocabulary file can be controlled with the --flashlight_decoder.num_tokenization parameter.

(Advanced) Manually Adding Additional Tokenizations of Words in Lexicon#

It is also possible to manually add additional tokenizations for the words in the decoder vocabulary by performing the following steps:

The riva-build and riva-deploy commands provided in the previous section store the lexicon in the /data/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-streaming/1/lexicon.txt file of the Triton model repository.

To add additional tokenizations to the lexicon, copy the lexicon file:

cp /data/models/citrinet-1024-en-US-asr-streaming-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt

and add the SentencePiece tokenization for the word of interest. For example, you could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

to the decoding_lexicon.txt file so that the word manu is generated in the transcript if the acoustic model predicts those tokens. You will need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. After this is done, regenerate the model repository using the new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

Flashlight Decoder Lexicon Free#

The Flashlight decoder can also be used without a lexicon. Lexicon free decoding is performed with a character based language model. Lexicon free decoding with flashlight can be enabled by adding --flashlight_decoder.use_lexicon_free_decoding=True to riva-build and specifying a character based language model via --decoding_language_model_binary=<path/to/charlm>.

OpenSeq2Seq Decoder#

Riva uses the OpenSeq2Seq decoder for beam-search decoding with a language model. For example:

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-1024-zh-CN-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=os2s \
   --os2s_decoder.language_model_alpha=0.5 \
   --os2s_decoder.language_model_beta=1.0 \
   --os2s_decoder.beam_search_width=128 \
   --language_code=zh-CN

Where:

  • --os2s_decoder.language_model_alpha is the weight given to the language model during the beam search.

  • --os2s_decoder.language_model_beta is the word insertion score.

  • --os2s_decoder.beam_search_width is the number of partial hypotheses to keep at each step of the beam search.

All of these parameters effect performance. Latency increases as these parameters increase in value. The suggested ranges are listed below.

Parameter

Minimum

Maximum

--os2s_decoder.beam_search_width

16

64

--os2s_decoder.language_model_alpha

0.5

1.5

--os2s_decoder.language_model_beta

1.0

3.0

Beginning/End of Utterance Detection#

Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window has nonblank characters. The end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. You can tune those values for their particular use case by using the following riva-build parameters:

  --endpointing.start_history=300 \
  --endpointing.start_th=0.2 \
  --endpointing.stop_history=800 \
  --endpointing.stop_th=0.98

Additionally, it is possible to disable the beginning/end of utterance detection by passing --endpointing_type=none to riva-build.

Note that in this case, the decoder state resets after the full audio signal has been sent by the client. Similarly, the punctuator model is only called once.

Neural-Based Voice Activity Detection#

It is possible to use a neural-based Voice Activity Detection (VAD) algorithm in Riva ASR. This can help to filter out noise in the audio, and can help reduce spurious words from appearing in the ASR transcripts. To use the neural-based VAD algorithm in the ASR pipeline, pass the following additional parameters to riva-build:

Silero VAD#

<silero_vad_riva_filename>:<encryption_key>
--vad_type=silero
--neural_vad_nn.optimization_graph_level=-1
--neural_vad.filter_speech_first false
--neural_vad.min_duration_on=0.2
--neural_vad.onset=0.85
--neural_vad.offset=0.6

where:

  • <silero_vad_riva_filename> is the .riva silero VAD model to use. For example, you can use the Silero VAD Riva model available on NGC.

  • <encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

MarbleNet VAD#

<marblenet_vad_riva_filename>:<encryption_key>
--vad_type=neural
--neural_vad_nn.optimization_graph_level=-1

where:

  • <marblenet_vad_riva_filename> is the .riva marblenet VAD model to use. For example, you can use the MarbleNet VAD Riva model available on NGC.

  • <encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

Note that using a neural VAD component in the ASR pipeline will have an impact on latency and throughput of the deployed Riva ASR server.

Generating Multiple Transcript Hypotheses#

By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter --max_supported_transcripts=N to the riva-build command, where N is the maximum number of hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses by setting the max_alternatives field of RecognitionConfig to values greater than 1.

Impact of Chunk Size and Padding Size on Performance and Accuracy (Advanced)#

The chunk_size and padding_size parameters used to configure Riva ASR can have a significant impact on accuracy and performance. A brief description of those parameters can be found in section Riva-build Optional Parameters. Riva provides pre-configured ASR pipelines, with preset values of chunk_size and padding_size: a low-latency streaming configuration, a high throughput streaming configuration, and an offline configuration. Those configurations should suit most deployment scenarios. The chunk_size and padding_size values used for those configurations can be found in a table in section Pipeline Configuration.

The chunk_size parameter is the duration of the audio chunk in seconds processed by the Riva server for every streaming request. Hence, in streaming mode, Riva returns one response for every chunk_size seconds of audio. A lower value of chunk_size will therefore reduce the user-perceived latency as the transcript will get updated more frequently.

The padding_size parameter is the duration in seconds of the padding prepended and appended to the chunk_size. The Riva acoustic model processes an input tensor corresponding to an audio duration of 2*(padding_size) + chunk_size for every new chunk of audio it receives. Increasing padding_size or chunk_size typically helps to improve accuracy of the transcripts since the acoustic model has access to more context. However, increasing padding_size reduces the maximum number of concurrent streams supported by Riva ASR, since it will increase the size of the input tensor fed to the acoustic model for every new chunk.

Sharing Acoustic and Feature Extractor Models Across Multiple ASR Pipelines (Advanced)#

It is possible to configure the Riva ASR service such that multiple ASR pipelines share the same feature extractor and acoustic models, thus allowing to reduce GPU memory usage. This option can be used, for example, to deploy multiple ASR pipelines where each pipeline uses a different language model, but share the same acoustic model and feature extractor. This can be achieved by specifying the parameters acoustic_model_name and featurizer_name in the riva-build command:

riva-build speech_recognition \
    /servicemaker-dev/<rmir_filename>:<encryption_key>  \
    /servicemaker-dev/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --acoustic_model_name=<acoustic_model_name> \
    --featurizer_name=<featurizer_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy

where:

  • <acoustic_model_name> is the user-defined name for the acoustic model component of the ASR pipeline

  • <featurizer_name> is the user-defined name for the feature extractor component of the ASR pipeline

If multiple ASR pipelines are built, each with a different name, but with the same acoustic_model_name and featurizer_name, they will share the same acoustic and feature extractor models.

When running the riva-deploy command, you must pass the -f option to ensure that all the ASR pipelines that share the acoustic model and feature extractor are initialized properly.

Note

<acoustic_model_name> and <featurizer_name> are global and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and you want to share the featurizer and/or acoustic models across different ASR pipelines. When specifying <acoustic_model_name> you should make sure that there will not be any incompatibilities in acoustic model weights or input shapes. Similarly, when specifying <featurizer_name>, you should make sure that that all ASR pipelines with the same <featurizer_name> use the same feature extractor parameters.

Riva-build Optional Parameters#

For details about the parameters passed to riva-build to customize the ASR pipeline, issue:

riva-build speech_recognition -h

The following list includes descriptions for all optional parameters currently recognized by riva-build: