Train an EAGLE Drafter for Speculative Decoding — End-to-End Guide
Train an EAGLE Drafter for Speculative Decoding — End-to-End Guide
A step-by-step guide for training an EAGLE speculative decoding drafter to accelerate LLM inference using NeMo AutoModel.
What is EAGLE Speculative Decoding?
Large language models generate text one token at a time — each token requires a full forward pass through the entire model. Speculative decoding speeds this up by pairing the large target model with a small, fast drafter model. The drafter guesses multiple tokens ahead; the target model then verifies them all in a single forward pass, accepting correct guesses and rejecting wrong ones. The output is mathematically identical to running the target model alone, but 2-3x faster.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a family of speculative decoding methods. NeMo AutoModel supports three variants:
The Task
We train an EAGLE-3 drafter for Llama 3.1 8B Instruct on the PerfectBlend dataset — a chat corpus whose assistant turns were generated by the same Llama 3.1 8B model, ensuring distribution alignment between training data and target.
After training, we serve the target + drafter together via SGLang for accelerated inference.
Guide Overview
Hardware Requirements
The target model is loaded in full precision and frozen during training. Only the small drafter model (a few transformer layers) has trainable parameters, so GPU memory is dominated by the target model size.
Step 0 — Environment Setup
This guide runs inside the NeMo AutoModel Docker container:
For SGLang serving (Step 5), install it in the same environment:
Step 1 — Understand EAGLE Architecture
How EAGLE-3 Works
EAGLE-3 pairs a frozen target LLM with a small trainable drafter. During training, the drafter learns to predict what the target model would produce next, using a technique called test-time training (TTT) unroll:
Key components:
- Target model: The full LLM (e.g., Llama 3.1 8B), completely frozen. Provides hidden states from selected intermediate layers as auxiliary inputs to the drafter.
- Draft model: A shallow transformer (typically 2 layers) with:
- A fusion layer (
fc) that combines auxiliary hidden states from 3 target layers - Its own attention layers, MLP, and layer norm
- A smaller output vocabulary (e.g., 8192 or 32000 tokens instead of 128k) to reduce compute
- A fusion layer (
- TTT unroll: The drafter runs multiple autoregressive steps (default 4) during
training, with exponentially decaying loss weights (
0.8^i). This teaches the drafter to make multi-step predictions — exactly what speculative decoding needs.
EAGLE-3.1 Drafter Toggles
The same train_eagle3 recipe supports the EAGLE-3.1 drafter variant via two
optional flags in recipe_args. Both default to false, so existing EAGLE-3
configs and checkpoints behave identically. Setting them applies the EAGLE-3.1
architectural changes from
vllm-project/vllm#42764 to
the Llama-style draft. The MLA-backbone community release
lightseekorg/kimi-k2.6-eagle3.1-mla
is a separate architecture (Eagle3DeepseekV2ForCausalLM) and is not
produced by this recipe.
Together they remove the “attention drift” pattern (loss of focus on sink tokens at deeper speculation depths) reported by the EAGLE-3.1 paper and let the drafter behave more like a recurrently applied module than a stack of extra layers bolted onto the target.
How EAGLE-1 Differs
EAGLE-1 is simpler: it uses a single transformer layer, predicts the full
vocabulary, and trains with a combined loss of MSE on hidden states
(hidden_loss_weight) and cross-entropy on tokens (token_loss_weight).
No TTT unroll, no vocabulary mapping.
Step 2 — Prepare the Training Dataset
Data format
EAGLE training expects chat data in the OpenAI messages format — either
JSONL files or HuggingFace datasets with a messages column:
Option A: Use a pre-regenerated dataset (recommended)
For best results, the assistant turns in your training data should come from the same model you’ll use as the target at inference time. The PerfectBlend dataset already has answers regenerated by Llama 3.1 8B Instruct:
Expected output:
PerfectBlend uses a conversations column, but ChatDataset expects messages.
Rename the column before training:
Option B: Regenerate answers from your target model
If you have a chat dataset whose answers were generated by a different model, you can regenerate them using your target. This aligns the training data distribution with the model the drafter will actually assist at inference time.
Step B.1 — Start the target server (in one shell):
Wait for Uvicorn running on http://0.0.0.0:30000 before proceeding.
Step B.2 — Regenerate (in another shell):
For each sample, the script:
- Loads the conversation from the input dataset
- Drops the trailing assistant turn, keeping the user prompt context
- Calls the target server to generate a new assistant response
- Saves the rebuilt conversation to parquet shards
The output directory contains parquet files with a messages column — ready
for EAGLE training. The script is resumable: rerun with --resume to skip
completed shards.
Step 3 — Configure and Launch EAGLE-3 Training
YAML config
Save the following as eagle3_llama8b.yaml:
Config field reference
Launch training
Multi-GPU (8x A100, production):
Single-GPU (quick test with Llama 3.2 1B):
For a quick test, use the MVP config with Llama 3.2 1B and a small dataset:
For GPUs with FlashAttention support, add draft_attn_implementation: flash_attention_2
to recipe_args for faster training. See
llama_eagle3_mvp_flash_attn.yaml
for a complete example.
Step 4 — Monitor Training and Inspect Checkpoints
What to watch
Training loss should drop steadily. Here is a sample log from the PerfectBlend 200k run on 8x A100:
Checkpoint layout
Each checkpoint is saved under <checkpoint_dir>/epoch_<E>_step_<S>/:
Resume from checkpoint
If training is interrupted, resume from the latest checkpoint:
Or point to a specific checkpoint:
Step 5 — Serve with SGLang
The serve_sglang helper converts the training checkpoint into an
HF/SGLang-compatible format and launches the server in one command.
Launch the server
On first launch, the helper:
- Loads
draft_model.ptandeagle3_meta.ptfrom the checkpoint - Rewrites the architecture name for SGLang compatibility (
LlamaEagle3DraftModel->LlamaForCausalLMEagle3) - Exports
model.safetensorsandspeculative_token_map.ptinto amodel/subdirectory - Launches SGLang with the correct speculative decoding flags
Serving parameters
Pass extra SGLang flags after --:
Smoke-test the server
Once you see Uvicorn running on http://0.0.0.0:30000, test it:
Expected output:
The accept_length_per_step metric shows how many tokens the target model
accepts per speculative step on average. Higher is better — a value of 3.0+
indicates the drafter is accurately predicting the target’s behavior.
OpenAI-compatible endpoint
SGLang also exposes an OpenAI-compatible API:
Expected output:
Step 6 — (Bonus) Train an EAGLE-1 Drafter
EAGLE-1 is simpler and faster to train, making it a good starting point for experimentation. It uses a single transformer layer and trains with a combined hidden-state MSE + token cross-entropy loss.
YAML config
Save as eagle1_llama8b.yaml:
Launch
Serve
Use --algorithm EAGLE (not EAGLE3) for EAGLE-1/2 drafters: