NGPU-LM (GPU-based N-gram Language Model) Language Model Fusion#
ASR systems can achieve significantly improved accuracy by leveraging external language model (LM) shallow fusion during the decoding process. This technique integrates knowledge from an external LM without requiring the ASR model itself to be retrained.
How Shallow Fusion Works:
During shallow fusion, the output probabilities generated by the ASR model are combined with those from a separate, external language model. The final transcription is then determined by selecting the word sequence that yields the highest combined score. These external LMs are typically trained on vast text datasets, allowing them to capture the statistical patterns, syntactic structures, and contextual dependencies of language. This enables them to predict more plausible word sequences, thereby correcting potential errors from the ASR model.
Domain Adaptation Benefits:
Shallow fusion is particularly valuable for adapting ASR systems to new or specialized domains. By training the external LM on domain-specific text-such as medical, legal, or technical documents-it learns the vocabulary of that field. This specialized knowledge guides the ASR decoding process towards more accurate and contextually relevant transcriptions.
Traditionally, shallow fusion has been performed during beam search decoding, a method that explores multiple promising hypotheses to find the most likely transcription.
NGPU-LM#
A widely used library for training traditional n-gram language models is KenLM. While KenLM (kpu/kenlm) is known for its efficient CPU-based implementation, its reliance on the CPU can limit performance in high-throughput scenarios, especially when dealing with large-scale data.
NGPU-LM on contrast is a GPU-accelerated implementation of a statistical n-gram language model. It uses a universal trie-based data structure, which enables fast, batched queries. For full details, please refer to the paper [ngpulm].
This enables shallow fusion during greedy decoding, creating a middle ground between standard greedy decoding and full beam search with a language model. It preserves the speed and simplicity of greedy decoding while regaining much of the accuracy typically achieved with beam search with external LM fusion. While not as accurate as full beam search, greedy decoding with NGPU-LM fusion offers a compelling balance between speed and accuracy.
NeMo provides efficient, fully GPU-based beam search implementations for all major ASR model types, allowing beam decoding to operate with real-time factors (RTFx) close to those of greedy decoding. At a batch size of 32, the RTFx difference between beam and greedy decoding is only about 20%. These implementations incorporate NGPU-LM, enabling fast, fully GPU-based decoding and customization. This enables users to customize decoding while maintaining reasonable speed, even in beam search mode. For full details, please refer to the [beamsearch].
NGPU-LM fusion is supported for BPE-based ASR models (CTC, RNNT, TDT, AED) during both greedy and beam decoding.
Train NGPU-LM#
NGPU-LM is built using .ARPA files generated by the KenLM library. You can train an n-gram LM using the following script: train_kenlm.py.
The generated .ARPA files can be directly used for GPU-based decoding.
However, for faster performance, it is recommended to convert the model to the .nemo format by setting the save_nemo
flag to true
.
python train_kenlm.py nemo_model_file=<path to the .nemo file of the model> \
train_paths=<list of paths to the training text or JSON manifest files> \
kenlm_bin_path=<path to the bin folder of KenLM library> \
kenlm_model_file=<path to store the binary KenLM model> \
ngram_length=<order of N-gram model> \
preserve_arpa=true \
save_nemo=True
For a complete list of arguments and usage details, refer to the Train N-gram LM.
Note
It is recommended that you use 6 as the order of the N-gram model for BPE-based models. Higher orders may require re-compiling KenLM to support them.
Decoding with NGPU-LM#
To run inference with NGPU-LM fusion, the ngram_lm_model
and ngram_lm_alpha
fields must be specified in the decoding configuration.
Note
For CTC, RNNT, and TDT models, these fields should be set within the respective greedy
or beam
sub-configurations.
For AED models running in greedy mode, set the beam size to 1 and specify these fields under the beam
sub-configuration.
Examples for different model types are provided below.
CTC Decoding with NGPU-LM#
Greedy Search:
You can run NGPU-LM shallow fusion during greedy CTC decoding using the following command:
python examples/asr/speech_to_text_eval.py \
pretrained_name=nvidia/parakeet-ctc-1.1b \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<path to the evaluation JSON manifest file> \
ctc_decoding.greedy.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
ctc_decoding.greedy.ngram_lm_alpha=0.2 \
ctc_decoding.greedy.allow_cuda_graphs=True \
ctc_decoding.strategy="greedy_batch"
Beam Search:
During CTC beam search, each hypothesis is scored using the following formula:
final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length
- where:
acoustic_score
is the score predicted by the ASR.lm_score
is the score predicted by the NGPU-LM LM.ngram_lm_alpha
is the weight given to the language model.beam_beta
is a penalty term that accounts for sequence length in the scores.
For running fully batched GPU-based CTC decoding with NGPU-LM, you can use the following command:
The following is the list of the adjustable arguments of batched CTC decoding algorithm beam_batch
:
Argument |
Type |
Default |
Description |
ngram_lm_alpha |
float |
Required |
Weight factor applied to the language model scores. |
beam_size |
int |
4 |
Beam size. |
beam_beta |
float |
1 |
Penalty applied to word insertions to control the trade-off between insertion and deletion errors during beam search decoding. |
beam_threshold |
float |
20 |
Threshold used to prune candidate hypotheses by comparing their scores to the best hypothesis. |
python examples/asr/speech_to_text_eval.py \
pretrained_name=nvidia/parakeet-ctc-1.1b \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<path to the evaluation JSON manifest file> \
ctc_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
ctc_decoding.beam.ngram_lm_alpha=0.2 \
ctc_decoding.beam.beam_size=12 \
ctc_decoding.beam.beam_beta=1.0 \
ctc_decoding.strategy="beam_batch" \
ctc_decoding.beam.allow_cuda_graphs=True
RNN-T/TDT decoding with NGPU-LM#
Greedy Search:
You can run NGPU-LM shallow fusion during greedy RNN-T / TDT decoding using the following command:
python examples/asr/speech_to_text_eval.py \
pretrained_name=nvidia/parakeet-rnnt-1.1b \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<path to the evaluation JSON manifest file> \
rnnt_decoding.greedy.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
rnnt_decoding.greedy.ngram_lm_alpha=0.2 \
rnnt_decoding.greedy.allow_cuda_graphs=True \
rnnt_decoding.strategy="greedy_batch"
Note
To run the inference with TDT model, you need to provide pretrained TDT model in pretrained_name
field (for example nvidia/parakeet-tdt_ctc-1.1b
).
Beam Search:
During RNN-T / TDT beam search, each hypothesis is scored using the following formula:
final_score = acoustic_score + ngram_lm_alpha * lm_score
- where:
acoustic_score
is the score predicted by the ASR.lm_score
is the score predicted by the NGPU-LM LM.ngram_lm_alpha
is the weight given to the language model.
Final hypotheses is chosen based on the normalized score final_score / seq_length
.
Blank Scoring in Transducer Models
Transducer models include a blank symbol (∅
) for frame transitions, while LMs do not model blanks.
During shallow fusion, the LM is typically applied only to non-blank tokens:
This can lead to excessive blank predictions at higher LM weights, increasing deletion errors. NeMo supports a blank-aware scoring method that adjusts LM contributions to better balance predictions:
Early vs. Late Pruning
In shallow fusion, LM and ASR scores can be combined at different stages:
Early pruning: ASR selects top hypotheses, then LM rescoring is applied. Efficient for small beams.
Late pruning: ASR and LM scores are combined before pruning. More accurate but requires full-vocab LM queries.
For Transducer models, late pruning with the blank-aware scoring method generally yields better performance than the standard approach.
Beam Search Strategies:
In NeMo fully batched implementation of following strategies are supported:
malsd_batch: fully batched implemention of modified Alignment-Length Synchronous Decoding [alsd], supporting both RNNT and TDT models.
maes_batch: fully batched implemention of modified Adaptive Expansion Search [aes], supporting for only RNNT models. CudaGraphs are not supported.
The following is the list of the adjustable arguments of batched CTC decoding algorithm beam_batch
:
Argument |
Type |
Strategy |
Default |
Description |
ngram_lm_alpha |
float |
malsd_batch, maes_batch |
Required |
Weight factor applied to the language model scores. |
beam_size |
int |
malsd_batch, maes_batch |
4 |
Beam size. |
pruning_mode |
str |
malsd_batch, maes_batch |
late |
Mode for hypotheses pruning. Can be |
blank_lm_score_mode |
str |
malsd_batch, maes_batch |
lm_weighted_full |
Mode for blank symbol scoring. Can be |
max_symbols_per_step |
int |
malsd_batch |
10 |
Max symbols to emit on each step to avoid infinite looping. |
maes_num_step |
int |
maes_batch |
2 |
Number of adaptive steps to take. |
maes_expansion_beta |
float |
maes_batch |
1.0 |
Maximum number of prefix expansions allowed, in addition to the beam size. |
maes_expansion_gamma |
float |
maes_batch |
2.3 |
Threshold used to prune candidate hypotheses by comparing their scores to the best hypothesis. |
You can run NGPU-LM shallow fusion during beam RNN-T / TDT decoding using the following command:
python examples/asr/speech_to_text_eval.py \
pretrained_name=nvidia/parakeet-rnnt-1.1b \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<path to the evaluation JSON manifest file> \
rnnt_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
rnnt_decoding.beam.ngram_lm_alpha=0.2 \
rnnt_decoding.beam.beam_size=12 \
rnnt_decoding.beam.pruning_mode="late" \
rnnt_decoding.beam.blank_lm_score_mode="lm_weighted_full" \
rnnt_decoding.beam.allow_cuda_graphs=True \
rnnt_decoding.strategy="malsd_batch"
Note
To run the inference with TDT model, you need to provide pretrained TDT model in pretrained_name
field (for example nvidia/parakeet-tdt_ctc-1.1b
).
AED Decoding with NGPU-LM#
Beam Search:
You can run NGPU-LM shallow fusion during greedy CTC decoding using the following command:
python examples/asr/speech_to_text_eval.py \
pretrained_name="nvidia/canary-1b" \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<dataset_manifest> \
multitask_decoding.beam.beam_size=4 \
multitask_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
multitask_decoding.beam.ngram_lm_alpha=0.2 \
multitask_decoding.strategy="beam"
Note
For greedy decoding with NGPU-LM, use beam search with beam_size=1.
References#
V. Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V. Lavrukhin, and B. Ginsburg. NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding. arXiv:2505.22857, 2025. Available at: https://arxiv.org/abs/2505.22857
L. Grigoryan, V. Bataev, A. Andrusenko, H. Xu, V. Lavrukhin, and B. Ginsburg. Pushing the Limits of Beam Search Decoding for Transducer-based ASR Models. arXiv:2506.00185, 2025. Available at: https://arxiv.org/abs/2506.00185
G. Saon, Z. Tüske, and K. Audhkhasi. Alignment-Length Synchronous Decoding for RNN Transducer. In: ICASSP 2020 – IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7804–7808, 2020. doi: https://doi.org/10.1109/ICASSP40776.2020.9053040
J. Kim, Y. Lee, and E. Kim. Accelerating RNN Transducer Inference via Adaptive Expansion Search. IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020. doi: https://doi.org/10.1109/LSP.2020.3036335