Pipeline Configuration#

In the simplest use case, you can deploy an ASR pipeline to be used with the StreamingRecognize API call without any language model. Refer to riva/proto/riva_asr.proto for details.

riva-build speech_recognition \
    /riva_build_deploy/<rmir_filename>:<encryption_key>  \
    /riva_build_deploy/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy

where:

<rmir_filename> is the Riva rmir file that is generated.
<riva_filename> is the name of the riva file to use as input.
<encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models that are uploaded on NGC is tlt_encode.
<name>,<acoustic_model_name>, and <featurizer_name> are optional user-defined names for the components in the model repository.
<wfst_tokenizer_model> is the name of the WFST tokenizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
<wfst_verbalizer_model> is the name of the WFST verbalizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
decoder_type is the type of decoder to use. Valid values are flashlight, greedy and nemo. We recommend using flashlight for all CTC models. Refer to Decoder Hyper-Parameters for more details.

Upon successful completion of this command, a file named <rmir_filename> is created. Since no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR and Riva filenames. Otherwise, this is not necessary.

The following summary lists the riva-build commands used to generate the RMIR files for different models, modes, and their limitations:

For details about the parameters passed to riva-build to customize the ASR pipeline, run:

riva-build <pipeline> -h

Finetuning Artifacts#

Riva ASR models can be fine-tuned for specific domains, languages, or use cases to enhance accuracy and performance. Fine-tuning generates specialized artifacts that extend the capabilities of your ASR pipeline configuration. These artifacts include domain-optimized acoustic models, language models, speaker diarization models, and punctuation/capitalization models tailored for specific scenarios.

The table below provides direct links to fine-tuning artifacts for various Riva ASR models. These artifacts are available on NGC (NVIDIA GPU Cloud) and can be integrated into your ASR pipeline using the riva-build command with the appropriate parameters.

NIM Artifacts
parakeet-0-6b-ctc-en-us
parakeet-1-1b-ctc-en-us
parakeet-1-1b-rnnt-multilingual
parakeet-ctc-0.6b-es
parakeet-ctc-0.6b-vi
parakeet-ctc-0.6b-zh-cn
parakeet-ctc-0.6b-zh-cw
parakeet-tdt-0.6b-v2

Streaming/Offline Recognition#

You can configure the Riva ASR pipeline for both streaming and offline recognition use cases. When using the StreamingRecognize API call, we recommend the following riva-build parameters for low-latency streaming recognition with the Conformer acoustic model. Refer to riva/proto/riva_asr.proto for details.

riva-build speech_recognition \
    /riva_build_deploy/<rmir_filename>:<encryption_key> \
    /riva_build_deploy/<riva_filename>:<encryption_key> \
    --name=<pipeline_name> \
    --wfst_tokenizer_model=<wfst_tokenizer_model> \
    --wfst_verbalizer_model=<wfst_verbalizer_model> \
    --decoder_type=greedy \
    --chunk_size=0.16 \
    --padding_size=1.92 \
    --ms_per_timestep=40 \
    --nn.fp16_needs_obey_precision_pass \
    --greedy_decoder.asr_model_delay=-1 \
    --endpointing.residue_blanks_at_start=-2 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False

For high-throughput streaming recognition with the StreamingRecognize API call, you can set the chunk_size and padding_size parameters:

    --chunk_size=0.8 \
    --padding_size=1.6

Finally, to configure the ASR pipeline for offline recognition with the Recognize API call (refer to riva/proto/riva_asr.proto), we recommend the following settings with the Parakeet-CTC acoustic model:

     --offline \
     --chunk_size=4.8 \
     --padding_size=1.6

Note

When you deploy the offline ASR models with riva-deploy, TensorRT warnings can appear in the logs that indicate that memory requirements of format conversion cannot be satisfied. These warnings should not affect functionality and you can ignore them.

Language Models#

Riva ASR supports decoding with an n-gram language model. You can provide the n-gram language model in a few different ways.

A .arpa format file or KenLM binary format file for CTC models
A .nemo format file for RNNT models

Language Model for Parakeet-CTC Models#

ARPA Format Language Model#

To configure the Riva ASR pipeline to use an n-gram language model stored in arpa format, replace:

    --decoder_type=greedy

with

    --decoder_type=flashlight \
    --decoding_language_model_arpa=<arpa_filename> \
    --decoding_vocab=<decoder_vocab_file>

KenLM Binary Language Model#

To generate the Riva RMIR file when using a KenLM binary file to specify the language model, replace:

    --decoder_type=greedy

with

    --decoder_type=flashlight \
    --decoding_language_model_binary=<KENLM_binary_filename> \
    --decoding_vocab=<decoder_vocab_file>

Decoder Hyper-Parameters#

You can also specify the decoder language model hyperparameters from the riva-build command.

You can specify the Flashlight decoder hyper-parameters beam_size, beam_size_token, beam_threshold, lm_weight, and word_insertion_score.

    --decoder_type=flashlight \
    --decoding_language_model_binary=<arpa_filename> \
    --decoding_vocab=<decoder_vocab_file> \
    --flashlight_decoder.beam_size=<beam_size> \
    --flashlight_decoder.beam_size_token=<beam_size_token> \
    --flashlight_decoder.beam_threshold=<beam_threshold> \
    --flashlight_decoder.lm_weight=<lm_weight> \
    --flashlight_decoder.word_insertion_score=<word_insertion_score>

Where:

beam_size is the maximum number of hypotheses the decoder holds at each step.
beam_size_token is the maximum number of tokens the decoder considers at each step.
beam_threshold is the threshold to prune hypotheses.
lm_weight is the weight of the language model that is used when scoring hypotheses.
word_insertion_score is the word insertion score that is used when scoring hypotheses.

Flashlight Decoder Lexicon#

The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the decoder vocabulary file passed to the riva-build command. The decoder vocabulary file used to generate the ASR pipelines include words that cover a wide range of domains and should provide accurate transcripts for most applications.

You can also build an ASR pipeline using your own decoder vocabulary file by using the --decoding_vocab parameter of the riva-build command. For example, you can start with the riva-build commands that are used to generate the ASR pipelines in our pipeline configuration section and provide your own lexicon decoder vocabulary file. Refer to Pipeline Configuration for details.

The Riva ServiceMaker automatically tokenizes the words in the decoder vocabulary file, so double-check that words of interest are included. You can control the number of tokenizations for each word in the decoder vocabulary file with the --flashlight_decoder.num_tokenization parameter.

(Advanced) Manually Adding Additional Tokenizations of Words in Lexicon#

It is also possible to manually add additional tokenizations for the words in the decoder vocabulary by performing the following steps:

The riva-build and riva-deploy commands provided in the previous section store the lexicon in the /data/models/parakeet-1.1b-en-US-asr-streaming-asr-bls-ensemble/1/lexicon.txt file of the Triton model repository.

To add additional tokenizations to the lexicon, copy the lexicon file:

cp /data/models/parakeet-1.1b-en-US-asr-streaming-asr-bls-ensemble/1/lexicon.txt decoding_lexicon.txt

and add the SentencePiece tokenization for the word of interest. For example, you could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

to the decoding_lexicon.txt file so that the word manu is generated in the transcript if the acoustic model predicts those tokens. You will need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. After this is done, regenerate the model repository using the new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of --decoding_vocab=decoding_vocab.txt.

Flashlight Decoder Lexicon Free#

The Flashlight decoder can also be used without a lexicon. Lexicon free decoding is performed with a character based language model. Lexicon free decoding with flashlight can be enabled by adding --flashlight_decoder.use_lexicon_free_decoding=True to riva-build and specifying a character based language model via --decoding_language_model_binary=<path/to/charlm>.

nGPU-LM for Parakeet-RNNT models#

To configure the Riva ASR pipeline to use an n-GPU language model stored in .nemo format, add:

--nemo_decoder.language_model_alpha=0.5 \
--nemo_decoder.language_model_file=<GPU_LM.nemo file>

Beginning/End of Utterance Detection#

Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window has nonblank characters. The end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. You can tune those values for their particular use case by using the following riva-build parameters:

  --endpointing.start_history=300 \
  --endpointing.start_th=0.2 \
  --endpointing.stop_history=800 \
  --endpointing.stop_th=0.98

Additionally, it is possible to disable the beginning/end of utterance detection by passing --endpointing_type=none to riva-build.

Note that in this case, the decoder state resets after the full audio signal has been sent by the client. Similarly, the punctuator model is only called once.

Streaming Speaker Diarization#

Riva currently supports speaker diarization in streaming mode via the Sortformer Diarizer model. For more details on Sortformer speaker diarization, refer to the Streaming Speaker Diarization section in the ASR Overview.

Sortformer#

To enable Sortformer speaker diarization in the ASR pipeline, pass the following additional parameters to riva-build when building a streaming ASR model:

<sortformer_diarizer_riva_filename>:<encryption_key>
--diarizer_type=sortformer

where:

<sortformer_diarizer_riva_filename> is the .riva Sortformer model to use. For example, you can use the Sortformer Diarizer Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

Note: Sortformer currently supports up to maximum of 4 speakers.

Neural-Based Voice Activity Detection#

It is possible to use a neural-based Voice Activity Detection (VAD) algorithm in Riva ASR. This can help to filter out noise in the audio, and can help reduce spurious words from appearing in the ASR transcripts. To use the neural-based VAD algorithm in the ASR pipeline, pass the following additional parameters to riva-build:

Silero VAD#

<silero_vad_riva_filename>:<encryption_key>
--vad_type=silero
--neural_vad_nn.optimization_graph_level=-1
--neural_vad.filter_speech_first false
--neural_vad.onset=0.85
--neural_vad.offset=0.3
--neural_vad.min_duration_on=0.2
--neural_vad.min_duration_off=0.5
--neural_vad.pad_offset=0.08
--neural_vad.pad_onset=0.3
--neural_vad.features_mask_value=-16.635

where:

<silero_vad_riva_filename> is the .riva silero VAD model to use. For example, you can use the Silero VAD Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.
--neural_vad.onset is the minimum probability threshold for detecting the start of a speech segment.
--neural_vad.offset is the minimum probability threshold for detecting the end of a speech segment.
--neural_vad.min_duration_on is the minimum duration of a speech segment to be considered as a speech segment.
--neural_vad.min_duration_off is the minimum duration of a non-speech segment to be considered as a non-speech segment.
--neural_vad.pad_onset is the duration of audio (in seconds) to pad the onset of a speech segment.
--neural_vad.pad_offset is the duration of audio (in seconds) to pad the offset of a speech segment.
--neural_vad.features_mask_value is the value to use to mask the features of a non-speech segment.

Several of these parameters can be configured at runtime using the custom_configuration parameter. The configurable parameters are:

onset
offset
min_duration_on
min_duration_off
pad_onset
pad_offset

Example of runtime configuration:

--custom_configuration="neural_vad.onset:0.9,neural_vad.offset:0.4,neural_vad.min_duration_on:0.3,neural_vad.min_duration_off:0.6"

MarbleNet VAD#

<marblenet_vad_riva_filename>:<encryption_key>
--vad_type=neural
--neural_vad_nn.optimization_graph_level=-1

where:

<marblenet_vad_riva_filename> is the .riva marblenet VAD model to use. For example, you can use the MarbleNet VAD Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

Note that using a neural VAD component in the ASR pipeline will have an impact on latency and throughput of the deployed Riva ASR server.

Generating Multiple Transcript Hypotheses#

By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter --max_supported_transcripts=N to the riva-build command, where N is the maximum number of hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses by setting the max_alternatives field of RecognitionConfig to values greater than 1.

Impact of Chunk Size and Padding Size on Performance and Accuracy (Advanced)#

The chunk_size and padding_size parameters used to configure Riva ASR can have a significant impact on accuracy and performance. Riva provides pre-configured ASR pipelines, with preset values of chunk_size and padding_size: a low-latency streaming configuration, a high throughput streaming configuration, and an offline configuration. You can find the chunk_size and padding_size values that are used for those configurations in a table in the pipeline configuration section. Refer to Pipeline Configuration for details.

The chunk_size parameter is the duration of the audio chunk in seconds processed by the Riva server for every streaming request. Hence, in streaming mode, Riva returns one response for every chunk_size seconds of audio. A lower value of chunk_size will therefore reduce the user-perceived latency as the transcript will get updated more frequently.

The padding_size parameter is the duration in seconds of the padding prepended and appended to the chunk_size. The Riva acoustic model processes an input tensor corresponding to an audio duration of 2*(padding_size) + chunk_size for every new chunk of audio it receives. Increasing padding_size or chunk_size typically helps to improve accuracy of the transcripts since the acoustic model has access to more context. However, increasing padding_size reduces the maximum number of concurrent streams supported by Riva ASR, since it will increase the size of the input tensor fed to the acoustic model for every new chunk.