<no title>

Conformer-CTC

Spanish (US)

Streaming, low-latency

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --name=conformer-es-US-asr-streaming \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --endpointing.start_history=200 \
   --nn.fp16_needs_obey_precision_pass \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=es-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file>

Note

For deploying the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Streaming, high-throughput

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --name=conformer-es-US-asr-streaming-throughput \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --endpointing.start_history=200 \
   --nn.fp16_needs_obey_precision_pass \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=es-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file>

Note

For deploying the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Offline

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --offline \
   --name=conformer-es-US-asr-offline \
   --return_separate_utterances=True \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --endpointing.start_history=200 \
   --nn.fp16_needs_obey_precision_pass \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=4.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --max_batch_size=16 \
   --featurizer.max_batch_size=512 \
   --featurizer.max_execution_batch_size=512 \
   --decoder_type=flashlight \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=es-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file>

Note

For deploying the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Parakeet-0.6B-CTC

Parakeet-1.1B-CTC

English (US)

Streaming, low-latency

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --name=parakeet-1.1b-en-US-asr-streaming \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --endpointing.residue_blanks_at_start=-2 \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_decoding_vocab_file> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_threshold=20. \
   --flashlight_decoder.num_tokenization=1 \
   --profane_words_file=<txt_profane_words_file> \
   --language_code=en-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file> \
   --speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: To deploy the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=0.64 --streaming_diarizer.right_context_size=0.56.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <VAD_riva_file>:<key> and add the following parameters: --vad_type=silero --neural_vad_nn.optimization_graph_level=-1 --neural_vad.filter_speech_first=false --neural_vad.min_duration_on=0.2 --neural_vad.min_duration_off=0.5 --neural_vad.onset=0.85 --neural_vad.offset=0.3 --neural_vad.pad_offset=0.08 --neural_vad.pad_onset=0.3 --enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build punctuation <rmir_filename>:<key> <riva_file>:<key> --language_code=en-US --name=riva-punctuation-en-US

Streaming, high-throughput

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --name=parakeet-1.1b-en-US-asr-streaming-throughput \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --endpointing.residue_blanks_at_start=-2 \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=0.96 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_decoding_vocab_file> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_threshold=20. \
   --flashlight_decoder.num_tokenization=1 \
   --profane_words_file=<txt_profane_words_file> \
   --language_code=en-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file> \
   --speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: To deploy the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=0.96 --streaming_diarizer.right_context_size=0.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <VAD_riva_file>:<key> and add the following parameters: --vad_type=silero --neural_vad_nn.optimization_graph_level=-1 --neural_vad.filter_speech_first=false --neural_vad.min_duration_on=0.2 --neural_vad.min_duration_off=0.5 --neural_vad.onset=0.85 --neural_vad.offset=0.3 --neural_vad.pad_offset=0.08 --neural_vad.pad_onset=0.3 --enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build punctuation <rmir_filename>:<key> <riva_file>:<key> --language_code=en-US --name=riva-punctuation-en-US

Offline

riva-build speech_recognition \
   <rmir_filename>:<key> \
   <riva_file>:<key> \
   --offline \
   --name=parakeet-1.1b-en-US-asr-offline \
   --return_separate_utterances=True \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=4.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --featurizer.max_batch_size=256 \
   --featurizer.max_execution_batch_size=256 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<bin_file> \
   --decoding_vocab=<txt_decoding_vocab_file> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_threshold=20. \
   --flashlight_decoder.num_tokenization=1 \
   --profane_words_file=<txt_profane_words_file> \
   --language_code=en-US \
   --wfst_tokenizer_model=<far_tokenizer_file> \
   --wfst_verbalizer_model=<far_verbalizer_file> \
   --speech_hints_model=<far_speech_hints_file>

Note

Greedy Mode: To deploy the model in greedy mode, replace the --flashlight_decoder related parameters from the above command and add --decoder_type=greedy --greedy_decoder.asr_model_delay=-1.

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=4.8 --streaming_diarizer.right_context_size=0.

Voice Activity Detection: To enable VAD for improved noise robustness, include the Silero VAD RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <VAD_riva_file>:<key> and add the following parameters: --vad_type=silero --neural_vad_nn.optimization_graph_level=-1 --neural_vad.filter_speech_first=false --neural_vad.min_duration_on=0.2 --neural_vad.min_duration_off=0.5 --neural_vad.onset=0.85 --neural_vad.offset=0.3 --neural_vad.pad_offset=0.08 --neural_vad.pad_onset=0.3 --enable_vad_endpointing=true.

Punctuation and Capitalization: PnC models require a separate RMIR file. Generate the RMIR file using the following command: riva-build punctuation <rmir_filename>:<key> <riva_file>:<key> --language_code=en-US --name=riva-punctuation-en-US

Parakeet-1.1B-RNNT

Multilingual

Streaming, low-latency

riva-build speech_recognition <rmir_filename>:<key> \
   <riva_file>:<key> \
   --profane_words_file=<txt_profane_words_file> \
   --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-streaming \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80  \
   --endpointing.residue_blanks_at_start=-2  \
   --language_code=en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN,he-IL,nb-NO,nl-NL,cs-CZ,da-DK,fr-CA,pl-PL,sv-SE,th-TH,tr-TR,pt-PT,nn-NO,multi  \
   --nn.fp16_needs_obey_precision_pass   \
   --unified_acoustic_model  \
   --chunk_size=0.32 \
   --left_padding_size=4.64 \
   --right_padding_size=4.64 \
   --featurizer.max_batch_size=256 \
   --featurizer.max_execution_batch_size=256 \
   --max_batch_size=32 \
   --nn.max_batch_size=32 \
   --nn.opt_batch_size=32 \
   --endpointing_type=niva \
   --endpointing.stop_history=800 \
   --endpointing.stop_th=1.0 \
   --endpointing.residue_blanks_at_end=0 \
   --nemo_decoder.use_stateful_decoding  \
   --decoder_type=nemo

Note

GPU-based Language Model: To deploy with a GPU-LM, add the following parameters: --nemo_decoder.language_model_alpha=0.5 --nemo_decoder.language_model_file=<GPU_LM.nemo file>. For training instructions, see (nvidia-riva/tutorials).

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=1.6 --streaming_diarizer.right_context_size=0.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Streaming, high-throughput

riva-build speech_recognition <rmir_filename>:<key> \
   <riva_file>:<key> \
   --profane_words_file=<txt_profane_words_file> \
   --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-streaming-throughput \
   --return_separate_utterances=False \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --endpointing.residue_blanks_at_start=-2 \
   --language_code=en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN,he-IL,nb-NO,nl-NL,cs-CZ,da-DK,fr-CA,pl-PL,sv-SE,th-TH,tr-TR,pt-PT,nn-NO,multi \
   --nn.fp16_needs_obey_precision_pass \
   --unified_acoustic_model \
   --chunk_size=1.6 \
   --left_padding_size=4.0 \
   --right_padding_size=4.0 \
   --featurizer.max_batch_size=256 \
   --featurizer.max_execution_batch_size=256 \
   --max_batch_size=64 \
   --nn.opt_batch_size=64 \
   --endpointing_type=niva \
   --endpointing.stop_history=800 \
   --endpointing.stop_th=1.0 \
   --endpointing.residue_blanks_at_end=0 \
   --nemo_decoder.use_stateful_decoding \
   --decoder_type=nemo

Note

GPU-based Language Model: To deploy with a GPU-LM, add the following parameters: --nemo_decoder.language_model_alpha=0.5 --nemo_decoder.language_model_file=<GPU_LM.nemo file>. For training instructions, see the (nvidia-riva/tutorials).

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=1.6 --streaming_diarizer.right_context_size=0.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Offline

riva-build speech_recognition <rmir_filename>:<key> \
   <riva_file>:<key> \
   --profane_words_file=<txt_profane_words_file> \
   --offline \
   --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-offline \
   --return_separate_utterances=True \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --language_code=en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN,he-IL,nb-NO,nl-NL,cs-CZ,da-DK,fr-CA,pl-PL,sv-SE,th-TH,tr-TR,pt-PT,nn-NO,multi \
   --nn.fp16_needs_obey_precision_pass \
   --unified_acoustic_model \
   --chunk_size=8.0 \
   --left_padding_size=0 \
   --right_padding_size=0 \
   --featurizer.max_batch_size=256 \
   --featurizer.max_execution_batch_size=256 \
   --max_batch_size=128 \
   --nn.opt_batch_size=128 \
   --endpointing_type=niva \
   --endpointing.stop_history=0 \
   --decoder_type=nemo

Note

GPU-based Language Model: To deploy with a GPU-LM, add the following parameters: --nemo_decoder.language_model_alpha=0.5 --nemo_decoder.language_model_file=<GPU_LM.nemo file>. For training instructions, see (nvidia-riva/tutorials).

Speaker Diarization: To enable speaker diarization, include the Sortformer RIVA file in the command: riva-build speech_recognition <rmir_filename>:<key> <riva_file>:<key> <sortformer_riva_file>:<key> and add the following parameters: --diarizer_type=sortformer --streaming_diarizer.center_chunk_size=1.6 --streaming_diarizer.right_context_size=0.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Parakeet-0.6B-TDT

English (US)

Offline

riva-build speech_recognition <rmir_filename>:<key> \
   <riva_file>:<key> \
   --profane_words_file=<txt_profane_words_file> \
   --offline \
   --name=parakeet-rnnt-1.1b-unified-ml-cs-universal-multi-asr-offline \
   --return_separate_utterances=True \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=80 \
   --language_code=en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN,he-IL,nb-NO,nl-NL,cs-CZ,da-DK,fr-CA,pl-PL,sv-SE,th-TH,tr-TR,pt-PT,nn-NO,multi \
   --nn.fp16_needs_obey_precision_pass \
   --unified_acoustic_model \
   --chunk_size=8.0 \
   --left_padding_size=0 \
   --right_padding_size=0 \
   --featurizer.max_batch_size=256 \
   --featurizer.max_execution_batch_size=256 \
   --max_batch_size=128 \
   --nn.opt_batch_size=128 \
   --endpointing_type=niva \
   --endpointing.stop_history=0 \
   --decoder_type=nemo

Note

GPU-based Language Model: Not supported.

Speaker Diarization: Not supported.

Voice Activity Detection: Not supported.

Punctuation and Capitalization: No separate model required, the ASR model automatically generates punctuated and capitalized text.

Whisper-Large

Multilingual

Offline

trtllm

riva-build speech_recognition \
  <rmir_filename>:<key> \
  <riva_file>:<key> \
  --offline \
  --name=whisper-large-v3-multi-asr-offline \
  --return_separate_utterances=True \
  --chunk_size 30 \
  --left_padding_size 0 \
  --right_padding_size 0 \
  --decoder_type trtllm \
  --unified_acoustic_model \
  --feature_extractor_type torch \
  --featurizer.norm_per_feature false \
  --max_batch_size 8 \
  --featurizer.precalc_norm_params False \
  --featurizer.max_batch_size=8 \
  --featurizer.max_execution_batch_size=8 \
  --language_code=en,zh,de,es,ru,ko,fr,ja,pt,tr,pl,ca,nl,ar,sv,it,id,hi,fi,vi,he,uk,el,ms,cs,ro,da,hu,ta,no,th,ur,hr,bg,lt,la,mi,ml,cy,sk,te,fa,lv,bn,sr,az,sl,kn,et,mk,br,eu,is,hy,ne,mn,bs,kk,sq,sw,gl,mr,pa,si,km,sn,yo,so,af,oc,ka,be,tg,sd,gu,am,yi,lo,uz,fo,ht,ps,tk,nn,mt,sa,lb,my,bo,tl,mg,as,tt,haw,ln,ha,ba,jw,su,yue,multi

Canary-1B

Multilingual

Offline

nemo

riva-build speech_recognition \
  <rmir_filename>:<key> \
  <riva_file>:<key> \
  --profane_words_file=<profane_words_file> \
  --offline \
  --name=canary-1b-multi-asr-offline \
  --return_separate_utterances=True \
  --unified_acoustic_model \
  --language_code=en-US,ar-AR,bg-BG,ca-ES,cs-CZ,da-DK,de-AT,de-CH,de-DE,el-GR,el-IL,et-EE,en-AM,en-AU,en-CA,en-EU,en-GB,en-IN,en-ME,en-MY,en-PH,en-SA,en-SG,en-UA,en-ZA,es-AR,es-CL,es-ES,es-LA,es-PY,es-UY,es-US,es-MX,fi-FI,fr-BE,fr-CA,fr-CH,fr-FR,he-IL,hi-IN,hu-HU,hr-HR,id-ID,it-IT,it-CH,lt-LT,lv-LV,ja-JP,km-KH,ko-KR,my-MM,nb-NO,nn-NO,nl-NL,nl-BE,nn-NB,pl-PL,pt-BR,pt-PT,ro-RO,ru-AM,ru-RU,ru-UA,sk-SK,sl-SI,sv-SE,th-TH,tr-TR,uk-UA,vi-VN,zh-CN,zh-TW \
  --chunk_size 30 \
  --left_padding_size 0 \
  --right_padding_size 0 \
  --feature_extractor_type torch \
  --torch_feature_type nemo \
  --max_batch_size 8 \
  --featurizer.use_utterance_norm_params=False \
  --featurizer.precalc_norm_params=False \
  --featurizer.max_batch_size=128 \
  --featurizer.max_execution_batch_size=128 \
  --ms_per_timestep=80 \
  --share_flags=True \
  --featurizer.norm_per_feature false \
  --decoder_type trtllm \
  --trtllm_decoder.max_output_len 200 \
  --trtllm_decoder.decoupled_mode true