Is this page helpful?

Pipeline Configuration#

In the simplest use case, you can deploy an ASR pipeline to be used with the StreamingRecognize API call without any language model. Refer to riva/proto/riva_asr.proto for details.

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=streaming \
    output_path=<rmir_filename>:<encryption_key> \
    'source_path=[<riva_filename>:<encryption_key>]' \
    name=<pipeline_name> \
    wfst_tokenizer_model=<wfst_tokenizer_model> \
    wfst_verbalizer_model=<wfst_verbalizer_model> \
    decoder=greedy

where:

<rmir_filename> is the Riva rmir file that is generated.
<riva_filename> is the name of the riva file to use as input.
<encryption_key> is the key used to encrypt the files. The encryption key for the pre-trained Riva models that are uploaded on NGC is tlt_encode.
name is an optional user-defined name for the pipeline in the model repository.
<wfst_tokenizer_model> is the name of the WFST tokenizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
<wfst_verbalizer_model> is the name of the WFST verbalizer model file to use for inverse text normalization of ASR transcripts. Refer to Inverse Text Normalization for more details.
decoder is the type of decoder to use. Valid values are flashlight, greedy, and nemo. We recommend using flashlight for all CTC models. Refer to Decoder Hyper-Parameters for more details.

Upon successful completion of this command, a file named <rmir_filename> is created. Since no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR and Riva filenames. Otherwise, this is not necessary.

The following summary lists the riva-build commands used to generate the RMIR files for different models, modes, and their limitations:

Parakeet-0.6B-CTC

Parakeet-1.1B-CTC

English (US)

Streaming, low-latency

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=streaming \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=parakeet-1.1b-en-US-asr-streaming \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   endpointing.residue_blanks_at_start=-2 \
   nn.fp16_needs_obey_precision_pass=True \
   chunk_size=0.16 \
   left_padding_size=1.92 \
   right_padding_size=1.92 \
   decoder_chunk_size=0.96 \
   decoder=flashlight \
   flashlight_decoder.asr_model_delay=-1 \
   decoding_language_model_binary=<bin_file> \
   decoding_vocab=<txt_decoding_vocab_file> \
   flashlight_decoder.lm_weight=0.8 \
   flashlight_decoder.word_insertion_score=1.0 \
   flashlight_decoder.beam_size=32 \
   flashlight_decoder.beam_threshold=20. \
   flashlight_decoder.num_tokenization=1 \
   language_code=en-US \
   wfst_tokenizer_model=<far_tokenizer_file> \
   wfst_verbalizer_model=<far_verbalizer_file> \
   speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: For deploying the model in greedy mode, replace the flashlight_decoder related parameters from the above command and add decoder=greedy greedy_decoder.asr_model_delay=-1.

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:onnx,onnx_opset:19,max_dim:1000}}]' in above command.

FP8 Quantization: To deploy the model with FP8 precision, add nn.use_trt_fp8=True to the command above. FP8 is supported only on GPUs with compute capability 8.9 or higher.

Speaker Diarization: To enable speaker diarization, include the Sortformer Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<sortformer_riva_file>:<key>]' and add the following parameters: sortformer=enabled diarizer_type=sortformer streaming_diarizer.center_chunk_size=0.64 streaming_diarizer.right_context_size=0.64 streaming_diarizer_nn.chunk_len=128 streaming_diarizer_nn.spkcache_len=160 streaming_diarizer_nn.fifo_len=80.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<VAD_riva_file>:<key>]' and add the following parameters: vad=enabled vad_type=silero neural_vad_nn.optimization_graph_level=-1 neural_vad.filter_speech_first=false neural_vad.min_duration_on=0.2 neural_vad.min_duration_off=0.5 neural_vad.onset=0.85 neural_vad.offset=0.3 neural_vad.pad_offset=0.08 neural_vad.pad_onset=0.3 neural_vad.mask_features=false enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build --config-path=pkg://servicemaker.configs.punctuation --config-name=base output_path=<rmir_filename>:<key> 'source_path=[<riva_file>:<key>]' name=riva-punctuation language_code=<language_code>

Streaming, high-throughput

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=streaming \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=parakeet-1.1b-en-US-asr-streaming-throughput \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   endpointing.residue_blanks_at_start=-2 \
   nn.fp16_needs_obey_precision_pass=True \
   chunk_size=0.96 \
   left_padding_size=1.92 \
   right_padding_size=1.92 \
   decoder=flashlight \
   flashlight_decoder.asr_model_delay=-1 \
   decoding_language_model_binary=<bin_file> \
   decoding_vocab=<txt_decoding_vocab_file> \
   flashlight_decoder.lm_weight=0.8 \
   flashlight_decoder.word_insertion_score=1.0 \
   flashlight_decoder.beam_size=32 \
   flashlight_decoder.beam_threshold=20. \
   flashlight_decoder.num_tokenization=1 \
   language_code=en-US \
   wfst_tokenizer_model=<far_tokenizer_file> \
   wfst_verbalizer_model=<far_verbalizer_file> \
   speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: For deploying the model in greedy mode, replace the flashlight_decoder related parameters from the above command and add decoder=greedy greedy_decoder.asr_model_delay=-1.

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:onnx,onnx_opset:19,max_dim:1000}}]' in above command.

FP8 Quantization: To deploy the model with FP8 precision, add nn.use_trt_fp8=True to the command above. FP8 is supported only on GPUs with compute capability 8.9 or higher.

Speaker Diarization: To enable speaker diarization, include the Sortformer Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<sortformer_riva_file>:<key>]' and add the following parameters: sortformer=enabled diarizer_type=sortformer streaming_diarizer.center_chunk_size=0.96 streaming_diarizer.right_context_size=0.32 streaming_diarizer_nn.chunk_len=128 streaming_diarizer_nn.spkcache_len=256 streaming_diarizer_nn.fifo_len=112.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<VAD_riva_file>:<key>]' and add the following parameters: vad=enabled vad_type=silero neural_vad_nn.optimization_graph_level=-1 neural_vad.filter_speech_first=false neural_vad.min_duration_on=0.2 neural_vad.min_duration_off=0.5 neural_vad.onset=0.85 neural_vad.offset=0.3 neural_vad.pad_offset=0.08 neural_vad.pad_onset=0.3 neural_vad.mask_features=false enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build --config-path=pkg://servicemaker.configs.punctuation --config-name=base output_path=<rmir_filename>:<key> 'source_path=[<riva_file>:<key>]' name=riva-punctuation language_code=<language_code>

Offline

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=parakeet-1.1b-en-US-asr-offline \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   nn.fp16_needs_obey_precision_pass=True \
   chunk_size=4.8 \
   left_padding_size=1.6 \
   right_padding_size=1.6 \
   featurizer.max_batch_size=256 \
   featurizer.max_execution_batch_size=256 \
   decoder=flashlight \
   flashlight_decoder.asr_model_delay=-1 \
   decoding_language_model_binary=<bin_file> \
   decoding_vocab=<txt_decoding_vocab_file> \
   flashlight_decoder.lm_weight=0.8 \
   flashlight_decoder.word_insertion_score=1.0 \
   flashlight_decoder.beam_size=32 \
   flashlight_decoder.beam_threshold=20. \
   flashlight_decoder.num_tokenization=1 \
   language_code=en-US \
   wfst_tokenizer_model=<far_tokenizer_file> \
   wfst_verbalizer_model=<far_verbalizer_file> \
   speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: For deploying the model in greedy mode, replace the flashlight_decoder related parameters from the above command and add decoder=greedy greedy_decoder.asr_model_delay=-1.

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:onnx,onnx_opset:19,max_dim:1000}}]' in above command.

FP8 Quantization: To deploy the model with FP8 precision, add nn.use_trt_fp8=True to the command above. FP8 is supported only on GPUs with compute capability 8.9 or higher.

Speaker Diarization: To enable speaker diarization, include the Sortformer Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<sortformer_riva_file>:<key>]' and add the following parameters: sortformer=enabled diarizer_type=sortformer streaming_diarizer.center_chunk_size=4.8 streaming_diarizer.right_context_size=0 streaming_diarizer_nn.chunk_len=480 streaming_diarizer_nn.spkcache_len=332 streaming_diarizer_nn.fifo_len=120.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<VAD_riva_file>:<key>]' and add the following parameters: vad=enabled vad_type=silero neural_vad_nn.optimization_graph_level=-1 neural_vad.filter_speech_first=false neural_vad.min_duration_on=0.2 neural_vad.min_duration_off=0.5 neural_vad.onset=0.85 neural_vad.offset=0.3 neural_vad.pad_offset=0.08 neural_vad.pad_onset=0.3 neural_vad.mask_features=false enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build --config-path=pkg://servicemaker.configs.punctuation --config-name=base output_path=<rmir_filename>:<key> 'source_path=[<riva_file>:<key>]' name=riva-punctuation language_code=<language_code>

Parakeet-1.1B-RNNT

Nemotron ASR Streaming

English (US)

Streaming, low-latency

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=streaming \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=cache-aware-parakeet-rnnt-en-US-asr-streaming \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   endpointing.residue_blanks_at_start=-2 \
   nemo_decoder.use_stateful_decoding=true \
   endpointing_type=niva \
   endpointing.stop_history=800 \
   endpointing.residue_blanks_at_end=0 \
   unified_acoustic_model=true \
   feature_extractor_type=torch \
   torch_feature_type=nemo \
   featurizer.use_streaming_torch_fe=true \
   nn.fp16_needs_obey_precision_pass=True \
   nn.am_cache_len_input_use_int64=true \
   att_context_size='[70,1]' \
   max_batch_size=32 \
   nn.max_batch_size=32 \
   nn.opt_batch_size=32 \
   decoder=nemo \
   language_code=en-US \
   wfst_tokenizer_model=<far_tokenizer_file> \
   wfst_verbalizer_model=<far_verbalizer_file> \
   speech_hints_model=<far_speech_hints_file>

Note

GPU-based Language Model: To deploy with a GPU-LM, add the following parameters: nemo_decoder.language_model_alpha=0.5 nemo_decoder.language_model_file=<GPU_LM.nemo file>. For training instructions, see (nvidia-riva/tutorials).

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:nemo}}]' in above command.

Speaker Diarization: To enable speaker diarization, include the Sortformer Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<sortformer_riva_file>:<key>]' and add the following parameters: sortformer=enabled diarizer_type=sortformer streaming_diarizer.center_chunk_size=0.64 streaming_diarizer.right_context_size=0.64 streaming_diarizer_nn.chunk_len=128 streaming_diarizer_nn.spkcache_len=160 streaming_diarizer_nn.fifo_len=80.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD Riva file in the source_path in the build command: 'source_path=[<riva_file>:<key>,<VAD_riva_file>:<key>]' and add the following parameters: vad=enabled vad_type=silero neural_vad_nn.optimization_graph_level=-1 neural_vad.filter_speech_first=false neural_vad.min_duration_on=0.2 neural_vad.min_duration_off=0.5 neural_vad.onset=0.85 neural_vad.offset=0.3 neural_vad.pad_offset=0.08 neural_vad.pad_onset=0.3 neural_vad.mask_features=false enable_vad_endpointing=true.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Parakeet-0.6B-TDT

English (US)

Offline

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=parakeet-tdt-0.6b-en-US-asr-offline \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   nn.fp16_needs_obey_precision_pass=True \
   chunk_size=16 \
   left_padding_size=0.0 \
   right_padding_size=0.0 \
   featurizer.max_batch_size=256 \
   featurizer.max_execution_batch_size=256 \
   featurizer.right_pad_features=true \
   max_batch_size=64 \
   nn.opt_batch_size=64 \
   unified_acoustic_model=true \
   endpointing_type=niva \
   endpointing.stop_history=0 \
   nemo_decoder.use_stateful_decoding=False \
   decoder=nemo \
   language_code=en-US \
   wfst_tokenizer_model=<far_tokenizer_file> \
   wfst_verbalizer_model=<far_verbalizer_file> \
   speech_hints_model=<far_speech_hints_file>

Note

GPU-based Language Model: Not supported.

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:nemo}}]' in above command.

Speaker Diarization: Not supported.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Multilingual

Offline

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
   output_path=<rmir_filename>:<key> \
   'source_path=[<riva_file>:<key>]' \
   profane_words_file=<txt_profane_words_file> \
   name=parakeet-tdt-0.6b-multi-asr-offline \
   featurizer.use_utterance_norm_params=False \
   featurizer.precalc_norm_time_steps=0 \
   featurizer.precalc_norm_params=False \
   ms_per_timestep=80 \
   language_code=\'bg-BG,hr-HR,cs-CZ,da-DK,nl-NL,en-GB,et-EE,fi-FI,fr-FR,de-DE,el-GR,hu-HU,it-IT,lv-LV,lt-LT,mt-MT,pl-PL,pt-PT,ro-RO,sk-SK,sl-SI,es-ES,sv-SE,ru-RU,uk-UA,multi\' \
   nn.fp16_needs_obey_precision_pass=True \
   chunk_size=16 \
   left_padding_size=0.0 \
   right_padding_size=0.0 \
   featurizer.max_batch_size=256 \
   featurizer.max_execution_batch_size=256 \
   featurizer.right_pad_features=true \
   max_batch_size=64 \
   nn.opt_batch_size=64 \
   unified_acoustic_model=true \
   endpointing_type=niva \
   endpointing.stop_history=0 \
   nemo_decoder.use_stateful_decoding=False \
   nn.use_trt_bf16=True \
   nn.bf16_needs_obey_precision_pass=True \
   decoder=nemo

Note

GPU-based Language Model: Not supported.

nemo2riva: For using .nemo checkpoint instead of .riva, replace source_path=[<riva_file>:<key>] with 'source_path=[{path: <path to .nemo checkpoint>, nemo2riva: {format:nemo}}]' in above command.

Speaker Diarization: Not supported.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Whisper-Large

Multilingual

Offline

trtllm

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
  output_path=<rmir_filename>:<key> \
  'source_path=[<riva_file>:<key>]' \
  profane_words_file=<txt_profane_words_file> \
  name=whisper-large-v3-multi-asr-offline \
  unified_acoustic_model=true \
  chunk_size=30 \
  left_padding_size=0 \
  right_padding_size=0 \
  decoder=trtllm \
  feature_extractor_type=torch \
  torch_feature_type=whisper \
  featurizer.norm_per_feature=false \
  max_batch_size=8 \
  featurizer.precalc_norm_params=false \
  featurizer.max_batch_size=8 \
  featurizer.max_execution_batch_size=8 \
  language_code=\'en,zh,de,es,ru,ko,fr,ja,pt,tr,pl,ca,nl,ar,sv,it,id,hi,fi,vi,he,uk,el,ms,cs,ro,da,hu,ta,no,th,ur,hr,bg,lt,la,mi,ml,cy,sk,te,fa,lv,bn,sr,az,sl,kn,et,mk,br,eu,is,hy,ne,mn,bs,kk,sq,sw,gl,mr,pa,si,km,sn,yo,so,af,oc,ka,be,tg,sd,gu,am,yi,lo,uz,fo,ht,ps,tk,nn,mt,sa,lb,my,bo,tl,mg,as,tt,haw,ln,ha,ba,jw,su,yue,multi\'

Canary-1B

Multilingual

Offline

nemo

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
  output_path=<rmir_filename>:<key> \
  'source_path=[<riva_file>:<key>]' \
  profane_words_file=<profane_words_file> \
  name=canary-1b-multi-asr-offline \
  unified_acoustic_model=true \
  use_cpp_postprocessing=False \
  language_code=\'en-US,ar-AR,bg-BG,ca-ES,cs-CZ,da-DK,de-AT,de-CH,de-DE,el-GR,el-IL,et-EE,en-AM,en-AU,en-CA,en-EU,en-GB,en-IN,en-ME,en-MY,en-PH,en-SA,en-SG,en-UA,en-ZA,es-AR,es-CL,es-ES,es-LA,es-PY,es-UY,es-US,es-MX,fi-FI,fr-BE,fr-CA,fr-CH,fr-FR,he-IL,hi-IN,hu-HU,hr-HR,id-ID,it-IT,it-CH,lt-LT,lv-LV,ja-JP,km-KH,ko-KR,my-MM,nb-NO,nn-NO,nl-NL,nl-BE,nn-NB,pl-PL,pt-BR,pt-PT,ro-RO,ru-AM,ru-RU,ru-UA,sk-SK,sl-SI,sv-SE,th-TH,tr-TR,uk-UA,vi-VN,zh-CN,zh-TW\' \
  chunk_size=30 \
  left_padding_size=0 \
  right_padding_size=0 \
  feature_extractor_type=torch \
  torch_feature_type=nemo \
  max_batch_size=8 \
  featurizer.use_utterance_norm_params=false \
  featurizer.precalc_norm_params=false \
  featurizer.max_batch_size=128 \
  featurizer.max_execution_batch_size=128 \
  ms_per_timestep=80 \
  share_flags=true \
  featurizer.norm_per_feature=false \
  decoder=trtllm \
  trtllm_decoder.max_output_len=200 \
  trtllm_decoder.decoupled_mode=true

For details about the parameters passed to riva-build to customize the ASR pipeline, run:

riva-build --config-path=pkg://servicemaker.configs.asr -h

Finetuning Artifacts#

Riva ASR models can be fine-tuned for specific domains, languages, or use cases to enhance accuracy and performance. Fine-tuning generates specialized artifacts that extend the capabilities of your ASR pipeline configuration. These artifacts include domain-optimized acoustic models, language models, speaker diarization models, and punctuation/capitalization models tailored for specific scenarios.

The table below provides direct links to fine-tuning artifacts for various Riva ASR models. These artifacts are available on NGC (NVIDIA GPU Cloud) and can be integrated into your ASR pipeline using the riva-build command with the appropriate parameters.

NIM Artifacts
parakeet-0-6b-ctc-en-us
parakeet-1-1b-ctc-en-us
parakeet-1-1b-rnnt-multilingual
parakeet-ctc-0.6b-es
parakeet-ctc-0.6b-vi
parakeet-ctc-0.6b-zh-cn
parakeet-ctc-0.6b-zh-cw
parakeet-tdt-0.6b-v2
nemotron-asr-streaming

Streaming/Offline Recognition#

You can configure the Riva ASR pipeline for both streaming and offline recognition use cases. When using the StreamingRecognize API call, we recommend the following riva-build parameters for low-latency streaming recognition. Use the Hydra config streaming for streaming and offline for batch recognition. Refer to riva/proto/riva_asr.proto for details.

Streaming (low-latency):

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=streaming \
    output_path=<rmir_filename>:<encryption_key> \
    'source_path=[<riva_filename>:<encryption_key>]' \
    name=<pipeline_name> \
    wfst_tokenizer_model=<wfst_tokenizer_model> \
    wfst_verbalizer_model=<wfst_verbalizer_model> \
    decoder=greedy \
    chunk_size=0.16 \
    decoder_chunk_size=0.96 \
    left_padding_size=1.92 \
    right_padding_size=1.92 \
    ms_per_timestep=40 \
    nn.fp16_needs_obey_precision_pass=True \
    greedy_decoder.asr_model_delay=-1 \
    endpointing.residue_blanks_at_start=-2 \
    featurizer.use_utterance_norm_params=False \
    featurizer.precalc_norm_time_steps=0 \
    featurizer.precalc_norm_params=False

For high-throughput streaming recognition with the StreamingRecognize API call, use --config-name=streaming and set:

    chunk_size=0.8 \
    left_padding_size=1.6 \
    right_padding_size=1.6

To configure the ASR pipeline for offline recognition with the Recognize API call (refer to riva/proto/riva_asr.proto), use the offline config:

riva-build --config-path=pkg://servicemaker.configs.asr --config-name=offline \
    output_path=<rmir_filename>:<encryption_key> \
    'source_path=[<riva_filename>:<encryption_key>]' \
    name=<pipeline_name> \
    chunk_size=4.8 \
    left_padding_size=1.6 \
    right_padding_size=1.6

Note

When you deploy the offline ASR models with riva-deploy, TensorRT warnings can appear in the logs that indicate that memory requirements of format conversion cannot be satisfied. These warnings should not affect functionality and you can ignore them.

Language Models#

Riva ASR supports decoding with an n-gram language model. You can provide the n-gram language model in a few different ways.

A .arpa format file or KenLM binary format file for CTC models
A .nemo format file for RNNT models

Language Model for Parakeet-CTC Models#

ARPA Format Language Model#

To configure the Riva ASR pipeline to use an n-gram language model stored in arpa format, replace:

    decoder=greedy

with

    decoder=flashlight \
    decoding_language_model_arpa=<arpa_filename> \
    decoding_vocab=<decoder_vocab_file>

KenLM Binary Language Model#

To generate the Riva RMIR file when using a KenLM binary file to specify the language model, replace:

    decoder=greedy

with

    decoder=flashlight \
    decoding_language_model_binary=<KENLM_binary_filename> \
    decoding_vocab=<decoder_vocab_file>

Decoder Hyper-Parameters#

You can also specify the decoder language model hyperparameters from the riva-build command.

You can specify the Flashlight decoder hyper-parameters beam_size, beam_size_token, beam_threshold, lm_weight, and word_insertion_score.

    decoder=flashlight \
    decoding_language_model_binary=<arpa_filename> \
    decoding_vocab=<decoder_vocab_file> \
    flashlight_decoder.beam_size=<beam_size> \
    flashlight_decoder.beam_size_token=<beam_size_token> \
    flashlight_decoder.beam_threshold=<beam_threshold> \
    flashlight_decoder.lm_weight=<lm_weight> \
    flashlight_decoder.word_insertion_score=<word_insertion_score>

Where:

beam_size is the maximum number of hypotheses the decoder holds at each step.
beam_size_token is the maximum number of tokens the decoder considers at each step.
beam_threshold is the threshold to prune hypotheses.
lm_weight is the weight of the language model that is used when scoring hypotheses.
word_insertion_score is the word insertion score that is used when scoring hypotheses.

Flashlight Decoder Lexicon#

The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the decoder vocabulary file passed to the riva-build command. The decoder vocabulary file used to generate the ASR pipelines include words that cover a wide range of domains and should provide accurate transcripts for most applications.

You can also build an ASR pipeline using your own decoder vocabulary file by using the decoding_vocab parameter of the riva-build command. For example, you can start with the riva-build commands that are used to generate the ASR pipelines in our pipeline configuration section and provide your own lexicon decoder vocabulary file. Refer to Pipeline Configuration for details.

The Riva ServiceMaker automatically tokenizes the words in the decoder vocabulary file, so double-check that words of interest are included. You can control the number of tokenizations for each word in the decoder vocabulary file with the flashlight_decoder.num_tokenization parameter.

(Advanced) Manually Adding Additional Tokenizations of Words in Lexicon#

It is also possible to manually add additional tokenizations for the words in the decoder vocabulary by performing the following steps:

The riva-build and riva-deploy commands provided in the previous section store the lexicon in the /data/models/parakeet-1.1b-en-US-asr-streaming-asr-bls-ensemble/1/lexicon.txt file of the Triton model repository.

To add additional tokenizations to the lexicon, copy the lexicon file:

cp /data/models/parakeet-1.1b-en-US-asr-streaming-asr-bls-ensemble/1/lexicon.txt decoding_lexicon.txt

and add the SentencePiece tokenization for the word of interest. For example, you could add:

manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew

to the decoding_lexicon.txt file so that the word manu is generated in the transcript if the acoustic model predicts those tokens. You will need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. After this is done, regenerate the model repository using the new decoding lexicon by passing decoding_lexicon=decoding_lexicon.txt to riva-build instead of decoding_vocab=decoding_vocab.txt.

Flashlight Decoder Lexicon Free#

The Flashlight decoder can also be used without a lexicon. Lexicon free decoding is performed with a character based language model. Lexicon free decoding with flashlight can be enabled by adding flashlight_decoder.use_lexicon_free_decoding=True to riva-build and specifying a character based language model using decoding_language_model_binary=<path/to/charlm>.

nGPU-LM for Parakeet-RNNT Models#

To configure the Riva ASR pipeline to use an n-GPU language model stored in .nemo format, add:

nemo_decoder.language_model_alpha=0.5 \
nemo_decoder.language_model_file=<GPU_LM.nemo file>

Beginning/End of Utterance Detection#

Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window has nonblank characters. The end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. You can tune those values for their particular use case by using the following riva-build parameters:

  endpointing.start_history=300 \
  endpointing.start_th=0.2 \
  endpointing.stop_history=800 \
  endpointing.stop_th=0.98

Additionally, it is possible to disable the beginning/end of utterance detection by passing endpointing_type=none to riva-build.

Note that in this case, the decoder state resets after the full audio signal has been sent by the client. Similarly, the punctuator model is only called once.

Streaming Speaker Diarization#

Riva currently supports speaker diarization in streaming mode using the Sortformer Diarizer model. For more details on Sortformer speaker diarization, refer to the Streaming Speaker Diarization section in the ASR Overview.

Sortformer#

To enable Sortformer speaker diarization in the ASR pipeline, include the Sortformer Riva file in source_path and pass the following additional parameters to riva-build when building a streaming ASR model:

'source_path=[<riva_file>:<key>,<sortformer_diarizer_riva_filename>:<encryption_key>]' \
sortformer=enabled \
diarizer_type=sortformer

where:

<sortformer_diarizer_riva_filename> is the .riva Sortformer model to use. For example, you can use the Sortformer Diarizer Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

Note: Sortformer currently supports up to maximum of 4 speakers.

Neural-Based Voice Activity Detection#

It is possible to use a neural-based Voice Activity Detection (VAD) algorithm in Riva ASR. This can help to filter out noise in the audio, and can help reduce spurious words from appearing in the ASR transcripts. To use the neural-based VAD algorithm in the ASR pipeline, include the VAD Riva file in source_path and pass the following additional parameters to riva-build:

Silero VAD#

'source_path=[<riva_file>:<key>,<silero_vad_riva_filename>:<encryption_key>]' \
vad=enabled \
vad_type=silero \
neural_vad_nn.optimization_graph_level=-1 \
neural_vad.filter_speech_first=false \
neural_vad.onset=0.85 \
neural_vad.offset=0.3 \
neural_vad.min_duration_on=0.2 \
neural_vad.min_duration_off=0.5 \
neural_vad.pad_offset=0.08 \
neural_vad.pad_onset=0.3 \
neural_vad.mask_features=false

where:

<silero_vad_riva_filename> is the .riva silero VAD model to use. For example, you can use the Silero VAD Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.
neural_vad.onset is the minimum probability threshold for detecting the start of a speech segment.
neural_vad.offset is the minimum probability threshold for detecting the end of a speech segment.
neural_vad.min_duration_on is the minimum duration of a speech segment to be considered as a speech segment.
neural_vad.min_duration_off is the minimum duration of a non-speech segment to be considered as a non-speech segment.
neural_vad.pad_onset is the duration of audio (in seconds) to pad the onset of a speech segment.
neural_vad.pad_offset is the duration of audio (in seconds) to pad the offset of a speech segment.
neural_vad.mask_features controls feature masking for non-speech segments.

Several of these parameters can be configured at runtime using the custom_configuration parameter. The configurable parameters are:

onset
offset
min_duration_on
min_duration_off
pad_onset
pad_offset

Example of runtime configuration:

--custom_configuration="neural_vad.onset:0.9,neural_vad.offset:0.4,neural_vad.min_duration_on:0.3,neural_vad.min_duration_off:0.6"

MarbleNet VAD#

'source_path=[<riva_file>:<key>,<marblenet_vad_riva_filename>:<encryption_key>]' \
vad=enabled \
vad_type=neural \
neural_vad_nn.optimization_graph_level=-1

where:

<marblenet_vad_riva_filename> is the .riva marblenet VAD model to use. For example, you can use the MarbleNet VAD Riva model available on NGC.
<encryption_key> is the key used to encrypt the file. The encryption key for the pre-trained Riva models uploaded on NGC is tlt_encode.

Note that using a neural VAD component in the ASR pipeline will have an impact on latency and throughput of the deployed Riva ASR server.

Generating Multiple Transcript Hypotheses#

By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter max_supported_transcripts=N to the riva-build command, where N is the maximum number of hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses by setting the max_alternatives field of RecognitionConfig to values greater than 1.

Impact of Chunk Size, Padding Size, and Decoder Chunk Size on Performance and Accuracy (Advanced)#

The chunk_size, left_padding_size/right_padding_size, and decoder_chunk_size parameters used to configure Riva ASR can have a significant impact on accuracy and performance. Riva provides pre-configured ASR pipelines, with preset values: a low-latency streaming configuration, a high throughput streaming configuration, and an offline configuration. You can find the values used for those configurations in the pipeline configuration section. Refer to Pipeline Configuration for details.

The chunk_size parameter is the duration of the audio chunk in seconds processed by the Riva server for every streaming request. Hence, in streaming mode, Riva returns one response for every chunk_size seconds of audio. A lower value of chunk_size will therefore reduce the user-perceived latency as the transcript will get updated more frequently.

The left_padding_size and right_padding_size parameters are the duration in seconds of the padding prepended and appended to the chunk_size. The Riva acoustic model processes an input tensor corresponding to an audio duration of left_padding_size + chunk_size + right_padding_size for every new chunk of audio it receives. Increasing padding or chunk_size typically helps to improve accuracy of the transcripts since the acoustic model has access to more context. However, increasing padding reduces the maximum number of concurrent streams supported by Riva ASR, since it will increase the size of the input tensor fed to the acoustic model for every new chunk.

The decoder_chunk_size parameter is the duration in seconds of the audio chunk fed to the decoder. This allows the decoder to operate at a different granularity than the acoustic model. For example, with chunk_size=0.16 and decoder_chunk_size=0.96, the acoustic model processes audio in 160ms chunks, but the decoder accumulates multiple chunks and processes them together every 960ms. A larger decoder_chunk_size improves transcript accuracy by overcoming token misalignment issues that can occur when decoding small chunks individually, while keeping the overall latency same since the acoustic model continues processing at its original rate. When set to -1 (default), the decoder processes frames at the same rate as the acoustic model’s chunk_size.