Speculative Decoding#
NeMo 2.0 provides integration with NVIDIA TensorRT Model Optimizer (ModelOpt) for Speculative Decoding, making it easy to optimize your models for faster inference. This section explains how to use this feature effectively.
Speculative Decoding is a technique that improves inference speed by predicting multiple tokens in parallel and then verifying them against the main model’s output. This approach can significantly reduce latency while maintaining output quality.
In NeMo, Speculative Decoding is enabled by NVIDIA TensorRT Model Optimizer (ModelOpt) — a library to optimize deep-learning models for inference on GPUs.
Speculative Decoding Process#
The conversion process involves these steps:
Converts Checkpoint: Enhances the base model checkpoint with speculative decoding capabilities and saves a new NeMo checkpoint.
Trains Speculative Decoding Module: Pretrain or Finetune the model in standard NeMo fashion to train the new speculative decoding module.
Exports Checkpoint: Export the enhanced model checkpoint, which can then be used in inference frameworks with improved performance.
Limitations#
Only GPT-based NeMo 2.0 checkpoints are supported.
Currently only the Eagle 3 method is supported for speculative decoding.
Converting a Model#
To convert a model to use speculative decoding, use the provided conversion script:
MODEL_PATH="path/to/base/nemo2-checkpoint/"
OUTPUT_PATH="path/to/output/checkpoint/"
python scripts/llm/gpt_convert_speculative.py \
--model_path ${MODEL_PATH} \
--specdec_algo eagle3 \
--export_dir ${OUTPUT_PATH}
Once converted, the model can be used with any NeMo operation that supports the base model type. The model will automatically freeze all existing modules in the original checkpoint and use a custom loss function to train the speculative decoding module. In evaluation/inference mode, the model will automatically use the speculative decoding module.
Exporting#
When exporting a model with speculative decoding to Hugging Face format, you can choose to export only the speculative decoding module. This is useful for downstream applications like TRT-LLM or SGLang that can utilize the speculative decoding capabilities with an existing inference-exported model:
from nemo.collections.llm.api import export_ckpt
export_ckpt(
model_path="path/to/converted/checkpoint/",
output_path="path/to/hf/output/",
modelopt_export_kwargs={
"export_extra_modules": True, # This will export only the speculative decoding module
},
)