NGPU-LM (GPU-based N-gram Language Model) Language Model Fusion#

ASR systems can achieve significantly improved accuracy by leveraging external language model (LM) shallow fusion during the decoding process. This technique integrates knowledge from an external LM without requiring the ASR model itself to be retrained.

How Shallow Fusion Works:

During shallow fusion, the output probabilities generated by the ASR model are combined with those from a separate, external language model. The final transcription is then determined by selecting the word sequence that yields the highest combined score. These external LMs are typically trained on vast text datasets, allowing them to capture the statistical patterns, syntactic structures, and contextual dependencies of language. This enables them to predict more plausible word sequences, thereby correcting potential errors from the ASR model.

Domain Adaptation Benefits:

Shallow fusion is particularly valuable for adapting ASR systems to new or specialized domains. By training the external LM on domain-specific text-such as medical, legal, or technical documents-it learns the vocabulary of that field. This specialized knowledge guides the ASR decoding process towards more accurate and contextually relevant transcriptions.

Traditionally, shallow fusion has been performed during beam search decoding, a method that explores multiple promising hypotheses to find the most likely transcription.

NGPU-LM#

A widely used library for training traditional n-gram language models is KenLM. While KenLM (kpu/kenlm) is known for its efficient CPU-based implementation, its reliance on the CPU can limit performance in high-throughput scenarios, especially when dealing with large-scale data.

NGPU-LM on contrast is a GPU-accelerated implementation of a statistical n-gram language model. It uses a universal trie-based data structure, which enables fast, batched queries. For full details, please refer to the paper [ngpulm].

This enables shallow fusion during greedy decoding, creating a middle ground between standard greedy decoding and full beam search with a language model. It preserves the speed and simplicity of greedy decoding while regaining much of the accuracy typically achieved with beam search with external LM fusion. While not as accurate as full beam search, greedy decoding with NGPU-LM fusion offers a compelling balance between speed and accuracy.

NeMo provides efficient, fully GPU-based beam search implementations for all major ASR model types, allowing beam decoding to operate with real-time factors (RTFx) close to those of greedy decoding. At a batch size of 32, the RTFx difference between beam and greedy decoding is only about 20%. These implementations incorporate NGPU-LM, enabling fast, fully GPU-based decoding and customization. This enables users to customize decoding while maintaining reasonable speed, even in beam search mode. For full details, please refer to the [beamsearch].

NGPU-LM fusion is supported for BPE-based ASR models (CTC, RNNT, TDT, AED) during both greedy and beam decoding.

Train NGPU-LM#

NGPU-LM is built using .ARPA files generated by the KenLM library. You can train an n-gram LM using the following script: train_kenlm.py.

The generated .ARPA files can be directly used for GPU-based decoding. However, for faster performance, it is recommended to convert the model to the .nemo format by setting the save_nemo flag to true.

python train_kenlm.py nemo_model_file=<path to the .nemo file of the model> \
                          train_paths=<list of paths to the training text or JSON manifest files> \
                          kenlm_bin_path=<path to the bin folder of KenLM library> \
                          kenlm_model_file=<path to store the binary KenLM model> \
                          ngram_length=<order of N-gram model> \
                          preserve_arpa=true \
                          save_nemo=True

For a complete list of arguments and usage details, refer to the Train N-gram LM.

Note

It is recommended that you use 6 as the order of the N-gram model for BPE-based models. Higher orders may require re-compiling KenLM to support them.

Decoding with NGPU-LM#

To run inference with NGPU-LM fusion, the ngram_lm_model and ngram_lm_alpha fields must be specified in the decoding configuration.

Note

For CTC, RNNT, and TDT models, these fields should be set within the respective greedy or beam sub-configurations. For AED models running in greedy mode, set the beam size to 1 and specify these fields under the beam sub-configuration.

Examples for different model types are provided below.

CTC Decoding with NGPU-LM#

Greedy Search:

You can run NGPU-LM shallow fusion during greedy CTC decoding using the following command:

python examples/asr/speech_to_text_eval.py \
    pretrained_name=nvidia/parakeet-ctc-1.1b \
    amp=false \
    amp_dtype=bfloat16 \
    matmul_precision=high \
    compute_dtype=bfloat16 \
    presort_manifest=true \
    cuda=0 \
    batch_size=32 \
    dataset_manifest=<path to the evaluation JSON manifest file> \
    ctc_decoding.greedy.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
    ctc_decoding.greedy.ngram_lm_alpha=0.2 \
    ctc_decoding.greedy.allow_cuda_graphs=True \
    ctc_decoding.strategy="greedy_batch"

Beam Search:

During CTC beam search, each hypothesis is scored using the following formula:

final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length
where:
  • acoustic_score is the score predicted by the ASR.

  • lm_score is the score predicted by the NGPU-LM LM.

  • ngram_lm_alpha is the weight given to the language model.

  • beam_beta is a penalty term that accounts for sequence length in the scores.

For running fully batched GPU-based CTC decoding with NGPU-LM, you can use the following command:

The following is the list of the adjustable arguments of batched CTC decoding algorithm beam_batch:

Argument

Type

Default

Description

ngram_lm_alpha

float

Required

Weight factor applied to the language model scores.

beam_size

int

4

Beam size.

beam_beta

float

1

Penalty applied to word insertions to control the trade-off between insertion and deletion errors during beam search decoding.

beam_threshold

float

20

Threshold used to prune candidate hypotheses by comparing their scores to the best hypothesis.

python examples/asr/speech_to_text_eval.py \
    pretrained_name=nvidia/parakeet-ctc-1.1b \
    amp=false \
    amp_dtype=bfloat16 \
    matmul_precision=high \
    compute_dtype=bfloat16 \
    presort_manifest=true \
    cuda=0 \
    batch_size=32 \
    dataset_manifest=<path to the evaluation JSON manifest file> \
    ctc_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
    ctc_decoding.beam.ngram_lm_alpha=0.2 \
    ctc_decoding.beam.beam_size=12 \
    ctc_decoding.beam.beam_beta=1.0 \
    ctc_decoding.strategy="beam_batch" \
    ctc_decoding.beam.allow_cuda_graphs=True

RNN-T/TDT decoding with NGPU-LM#

Greedy Search:

You can run NGPU-LM shallow fusion during greedy RNN-T / TDT decoding using the following command:

python examples/asr/speech_to_text_eval.py \
    pretrained_name=nvidia/parakeet-rnnt-1.1b \
    amp=false \
    amp_dtype=bfloat16 \
    matmul_precision=high \
    compute_dtype=bfloat16 \
    presort_manifest=true \
    cuda=0 \
    batch_size=32 \
    dataset_manifest=<path to the evaluation JSON manifest file> \
    rnnt_decoding.greedy.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
    rnnt_decoding.greedy.ngram_lm_alpha=0.2 \
    rnnt_decoding.greedy.allow_cuda_graphs=True \
    rnnt_decoding.strategy="greedy_batch"

Note

To run the inference with TDT model, you need to provide pretrained TDT model in pretrained_name field (for example nvidia/parakeet-tdt_ctc-1.1b ).

Beam Search:

During RNN-T / TDT beam search, each hypothesis is scored using the following formula:

final_score = acoustic_score + ngram_lm_alpha * lm_score
where:
  • acoustic_score is the score predicted by the ASR.

  • lm_score is the score predicted by the NGPU-LM LM.

  • ngram_lm_alpha is the weight given to the language model.

Final hypotheses is chosen based on the normalized score final_score / seq_length.

Blank Scoring in Transducer Models

Transducer models include a blank symbol () for frame transitions, while LMs do not model blanks. During shallow fusion, the LM is typically applied only to non-blank tokens:

\[\begin{split}\ln p_{\text{tot}}[k] = \begin{cases} \ln p[k] + \lambda \ln p_{\text{LM}}[k], & k \in V \\ \ln p[\emptyset], & k = \emptyset \end{cases}\end{split}\]

This can lead to excessive blank predictions at higher LM weights, increasing deletion errors. NeMo supports a blank-aware scoring method that adjusts LM contributions to better balance predictions:

\[\begin{split}\ln p_{\text{tot}}[k] = \begin{cases} \ln p[k] + \lambda \ln((1 - p[\emptyset]) \cdot p_{\text{LM}}[k]), & k \in V \\ (1 + \lambda) \ln p[\emptyset], & k = \emptyset \end{cases}\end{split}\]

Early vs. Late Pruning

In shallow fusion, LM and ASR scores can be combined at different stages:

  • Early pruning: ASR selects top hypotheses, then LM rescoring is applied. Efficient for small beams.

  • Late pruning: ASR and LM scores are combined before pruning. More accurate but requires full-vocab LM queries.

For Transducer models, late pruning with the blank-aware scoring method generally yields better performance than the standard approach.

Beam Search Strategies:

In NeMo fully batched implementation of following strategies are supported:

  • malsd_batch: fully batched implemention of modified Alignment-Length Synchronous Decoding [alsd], supporting both RNNT and TDT models.

  • maes_batch: fully batched implemention of modified Adaptive Expansion Search [aes], supporting for only RNNT models. CudaGraphs are not supported.

The following is the list of the adjustable arguments of batched CTC decoding algorithm beam_batch:

Argument

Type

Strategy

Default

Description

ngram_lm_alpha

float

malsd_batch, maes_batch

Required

Weight factor applied to the language model scores.

beam_size

int

malsd_batch, maes_batch

4

Beam size.

pruning_mode

str

malsd_batch, maes_batch

late

Mode for hypotheses pruning. Can be early or late.

blank_lm_score_mode

str

malsd_batch, maes_batch

lm_weighted_full

Mode for blank symbol scoring. Can be no_score or lm_weighted_full

max_symbols_per_step

int

malsd_batch

10

Max symbols to emit on each step to avoid infinite looping.

maes_num_step

int

maes_batch

2

Number of adaptive steps to take.

maes_expansion_beta

float

maes_batch

1.0

Maximum number of prefix expansions allowed, in addition to the beam size.

maes_expansion_gamma

float

maes_batch

2.3

Threshold used to prune candidate hypotheses by comparing their scores to the best hypothesis.

You can run NGPU-LM shallow fusion during beam RNN-T / TDT decoding using the following command:

python examples/asr/speech_to_text_eval.py \
    pretrained_name=nvidia/parakeet-rnnt-1.1b \
    amp=false \
    amp_dtype=bfloat16 \
    matmul_precision=high \
    compute_dtype=bfloat16 \
    presort_manifest=true \
    cuda=0 \
    batch_size=32 \
    dataset_manifest=<path to the evaluation JSON manifest file> \
    rnnt_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
    rnnt_decoding.beam.ngram_lm_alpha=0.2 \
    rnnt_decoding.beam.beam_size=12 \
    rnnt_decoding.beam.pruning_mode="late" \
    rnnt_decoding.beam.blank_lm_score_mode="lm_weighted_full" \
    rnnt_decoding.beam.allow_cuda_graphs=True \
    rnnt_decoding.strategy="malsd_batch"

Note

To run the inference with TDT model, you need to provide pretrained TDT model in pretrained_name field (for example nvidia/parakeet-tdt_ctc-1.1b ).

AED Decoding with NGPU-LM#

Beam Search:

You can run NGPU-LM shallow fusion during greedy CTC decoding using the following command:

python examples/asr/speech_to_text_eval.py \
pretrained_name="nvidia/canary-1b" \
amp=false \
amp_dtype=bfloat16 \
matmul_precision=high \
compute_dtype=bfloat16 \
presort_manifest=true \
cuda=0 \
batch_size=32 \
dataset_manifest=<dataset_manifest>  \
multitask_decoding.beam.beam_size=4 \
multitask_decoding.beam.ngram_lm_model=<path to the .nemo/.ARPA file of the NGPU-LM model> \
multitask_decoding.beam.ngram_lm_alpha=0.2 \
multitask_decoding.strategy="beam"

Note

For greedy decoding with NGPU-LM, use beam search with beam_size=1.

References#

[ngpulm]

V. Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V. Lavrukhin, and B. Ginsburg. NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding. arXiv:2505.22857, 2025. Available at: https://arxiv.org/abs/2505.22857

[beamsearch]

L. Grigoryan, V. Bataev, A. Andrusenko, H. Xu, V. Lavrukhin, and B. Ginsburg. Pushing the Limits of Beam Search Decoding for Transducer-based ASR Models. arXiv:2506.00185, 2025. Available at: https://arxiv.org/abs/2506.00185

[alsd]

G. Saon, Z. Tüske, and K. Audhkhasi. Alignment-Length Synchronous Decoding for RNN Transducer. In: ICASSP 2020 – IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7804–7808, 2020. doi: https://doi.org/10.1109/ICASSP40776.2020.9053040

[aes]

J. Kim, Y. Lee, and E. Kim. Accelerating RNN Transducer Inference via Adaptive Expansion Search. IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020. doi: https://doi.org/10.1109/LSP.2020.3036335