> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# Train an EAGLE Drafter for Speculative Decoding — End-to-End Guide

> Step-by-step guide for training EAGLE-3 and EAGLE-1 speculative decoding drafters with NeMo AutoModel, from data preparation through SGLang serving.

**A step-by-step guide for training an EAGLE speculative decoding drafter to
accelerate LLM inference using
[NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel).**

***

## What is EAGLE Speculative Decoding?

Large language models generate text one token at a time — each token requires a
full forward pass through the entire model. Speculative decoding speeds this up
by pairing the large **target model** with a small, fast **drafter model**. The
drafter guesses multiple tokens ahead; the target model then verifies them all
in a single forward pass, accepting correct guesses and rejecting wrong ones.
The output is mathematically identical to running the target model alone, but
2-3x faster.

**EAGLE** (Extrapolation Algorithm for Greater Language-model Efficiency) is a
family of speculative decoding methods. NeMo AutoModel supports three variants:

| Variant     | Recipe         | Description                                                                                |
| ----------- | -------------- | ------------------------------------------------------------------------------------------ |
| **EAGLE-1** | `train_eagle1` | Lightweight 1-layer draft transformer; learns to predict target hidden states + next token |
| **EAGLE-2** | `train_eagle2` | Same architecture as EAGLE-1 (alias recipe)                                                |
| **EAGLE-3** | `train_eagle3` | Advanced drafter with test-time training (TTT) unroll and vocabulary mapping; best speed   |

## The Task

We train an EAGLE-3 drafter for **Llama 3.1 8B Instruct** on the
[PerfectBlend](https://huggingface.co/datasets/frankleeeee/PerfectBlend-Regenerated-Llama-3.1-8B-Instruct)
dataset — a chat corpus whose assistant turns were generated by the same Llama
3.1 8B model, ensuring distribution alignment between training data and target.

After training, we serve the target + drafter together via
[SGLang](https://github.com/sgl-project/sglang) for accelerated inference.

## Guide Overview

| Step       | Description                              |
| ---------- | ---------------------------------------- |
| **Step 0** | Environment setup                        |
| **Step 1** | Understand EAGLE architecture            |
| **Step 2** | Prepare the training dataset             |
| **Step 3** | Configure and launch EAGLE-3 training    |
| **Step 4** | Monitor training and inspect checkpoints |
| **Step 5** | Serve with SGLang                        |
| **Step 6** | (Bonus) Train an EAGLE-1 drafter         |

## Hardware Requirements

| Setup            | Target Model          | GPUs          | Training Time                  |
| ---------------- | --------------------- | ------------- | ------------------------------ |
| MVP (quick test) | Llama 3.2 1B          | 1x A100 80 GB | \~10 min (1 epoch, 1k samples) |
| Production       | Llama 3.1 8B Instruct | 8x A100 80 GB | \~2 h (1 epoch, 200k samples)  |

The target model is loaded in full precision and frozen during training. Only the
small drafter model (a few transformer layers) has trainable parameters, so GPU
memory is dominated by the target model size.

***

## Step 0 — Environment Setup

This guide runs **inside** the NeMo AutoModel Docker container:

```bash
docker run -it --rm --gpus all --ipc=host --network host \
    -v $(pwd):/workspace \
    nvcr.io/nvidia/nemo-automodel:26.06.00
```

```bash
huggingface-cli login
```

```bash
cd /opt/Automodel
```

For SGLang serving (Step 5), install it in the same environment:

```bash
uv pip install "sglang>=0.5.9"
```

***

## Step 1 — Understand EAGLE Architecture

### How EAGLE-3 Works

EAGLE-3 pairs a frozen target LLM with a small trainable drafter. During
training, the drafter learns to predict what the target model would produce
next, using a technique called **test-time training (TTT) unroll**:

```text
Target (frozen)          Drafter (trainable)
┌──────────────┐         ┌──────────────┐
│ Llama 3.1 8B │ ──────> │  2-layer     │
│              │ hidden  │  transformer │
│ Full model   │ states  │  + fc fusion │
│ (frozen)     │         │  + lm_head   │
└──────────────┘         └──────────────┘
                               │
                         predict next token
                         + hidden states
```

Key components:

* **Target model**: The full LLM (e.g., Llama 3.1 8B), completely frozen. Provides
  hidden states from selected intermediate layers as auxiliary inputs to the drafter.
* **Draft model**: A shallow transformer (typically 2 layers) with:
  * A **fusion layer** (`fc`) that combines auxiliary hidden states from 3 target layers
  * Its own attention layers, MLP, and layer norm
  * A smaller **output vocabulary** (e.g., 8192 or 32000 tokens instead of 128k) to reduce compute
* **TTT unroll**: The drafter runs multiple autoregressive steps (default 4) during
  training, with exponentially decaying loss weights (`0.8^i`). This teaches the
  drafter to make multi-step predictions — exactly what speculative decoding needs.

### EAGLE-3.1 Drafter Toggles

The same `train_eagle3` recipe supports the EAGLE-3.1 drafter variant via two
optional flags in `recipe_args`. Both default to `false`, so existing EAGLE-3
configs and checkpoints behave identically. Setting them applies the EAGLE-3.1
architectural changes from
[vllm-project/vllm#42764](https://github.com/vllm-project/vllm/pull/42764) to
the Llama-style draft. The MLA-backbone community release
[`lightseekorg/kimi-k2.6-eagle3.1-mla`](https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla)
is a separate architecture (`Eagle3DeepseekV2ForCausalLM`) and is **not**
produced by this recipe.

| Flag          | Effect                                                                                                                                                                                                                                                                                                                                 |
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `fc_norm`     | Apply an independent RMSNorm to each of the `num_aux_hidden_states` (3 by default) auxiliary hidden-state chunks before they are concatenated and projected by `model.fc`. Stored as an `nn.ModuleList` with on-disk keys `model.fc_norm.0.weight`, `model.fc_norm.1.weight`, ... matching vLLM's layout so checkpoints load directly. |
| `norm_output` | Route the existing final RMSNorm (`model.norm`) over the per-step hidden state returned by the drafter so the next TTT step (and `lm_head`) consume the post-norm state instead of the raw decoder output. Adds no parameters.                                                                                                         |

Together they remove the "attention drift" pattern (loss of focus on sink
tokens at deeper speculation depths) reported by the EAGLE-3.1 paper and let
the drafter behave more like a recurrently applied module than a stack of
extra layers bolted onto the target.

```yaml
recipe_args:
  # ... standard EAGLE-3 fields ...
  fc_norm: true
  norm_output: true
```

### How EAGLE-1 Differs

EAGLE-1 is simpler: it uses a single transformer layer, predicts the full
vocabulary, and trains with a combined loss of MSE on hidden states
(`hidden_loss_weight`) and cross-entropy on tokens (`token_loss_weight`).
No TTT unroll, no vocabulary mapping.

***

## Step 2 — Prepare the Training Dataset

### Data format

EAGLE training expects chat data in the **OpenAI messages format** — either
JSONL files or HuggingFace datasets with a `messages` column:

```json
{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."}
]}
```

### Option A: Use a pre-regenerated dataset (recommended)

For best results, the assistant turns in your training data should come from the
**same model** you'll use as the target at inference time. The PerfectBlend
dataset already has answers regenerated by Llama 3.1 8B Instruct:

```bash
python -c "
from datasets import load_dataset
ds = load_dataset(
    'frankleeeee/PerfectBlend-Regenerated-Llama-3.1-8B-Instruct',
    split='train[:5]'
)
print(f'Columns: {ds.column_names}')
print(f'Sample conversation:')
for msg in ds[0]['conversations'][:3]:
    role = msg['role']
    text = msg['content'][:80]
    print(f'  [{role}] {text}...')
"
```

Expected output:

```
Columns: ['conversations']
Sample conversation:
  [system] You are a helpful assistant....
  [user] What are the main differences between Python 2 and Python 3?...
  [assistant] Here are the key differences between Python 2 and Python 3:

1. **P...
```

PerfectBlend uses a `conversations` column, but `ChatDataset` expects `messages`.
Rename the column before training:

```bash
python -c "
import pandas as pd
from pathlib import Path
src = Path('<download_dir>')
dst = Path('./data/perfectblend_renamed')
dst.mkdir(parents=True, exist_ok=True)
for f in sorted(src.glob('train-*.parquet')):
    df = pd.read_parquet(f)
    df = df.rename(columns={'conversations': 'messages'})
    df.to_parquet(dst / f.name, index=False)
print('Done. Point train_data_path at:', dst)
"
```

### Option B: Regenerate answers from your target model

If you have a chat dataset whose answers were generated by a different model, you
can regenerate them using your target. This aligns the training data distribution
with the model the drafter will actually assist at inference time.

**Step B.1 — Start the target server** (in one shell):

```bash
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --port 30000
```

Wait for `Uvicorn running on http://0.0.0.0:30000` before proceeding.

**Step B.2 — Regenerate** (in another shell):

```bash
python -m nemo_automodel.components.speculative.regenerate \
    --input-data Aeala/ShareGPT_Vicuna_unfiltered \
    --output-dir ./regenerated/sharegpt_llama31_8b \
    --target-server http://localhost:30000/v1 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --concurrency 64 \
    --shard-size 1000
```

For each sample, the script:

1. Loads the conversation from the input dataset
2. Drops the trailing assistant turn, keeping the user prompt context
3. Calls the target server to generate a new assistant response
4. Saves the rebuilt conversation to parquet shards

The output directory contains parquet files with a `messages` column — ready
for EAGLE training. The script is **resumable**: rerun with `--resume` to skip
completed shards.

| Flag               | Default | Notes                                            |
| ------------------ | ------- | ------------------------------------------------ |
| `--concurrency`    | 32      | In-flight requests; raise to saturate the server |
| `--shard-size`     | 1000    | Rows per parquet file                            |
| `--temperature`    | 0.0     | Greedy by default (recommended for EAGLE)        |
| `--max-new-tokens` | 1024    | Cap per-answer length                            |
| `--split`          | `train` | Supports HF slice syntax, e.g., `train[:10000]`  |

***

## Step 3 — Configure and Launch EAGLE-3 Training

### YAML config

Save the following as `eagle3_llama8b.yaml`:

```yaml
recipe: TrainEagle3Recipe

dist_env:
  backend: nccl
  timeout_minutes: 60

recipe_args:
  target_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct

  # Point to your training data (HF dataset id, local parquet dir, or JSONL)
  train_data_path: ./data/perfectblend_renamed
  val_data_path: null

  # Slice to 200k samples for a ~2h training run
  train_split: "train[:200000]"
  val_split: null

  output_dir: ./outputs/eagle3_llama8b
  seq_length: 2048
  micro_batch_size: 1
  grad_accumulation_steps: 4   # effective batch = 8 GPUs * 1 * 4 = 32
  num_workers: 4
  num_epochs: 1

  # EAGLE-3 specific
  ttt_steps: 4                 # TTT unroll depth (higher = better but slower)
  draft_vocab_size: 32000      # smaller vocab = faster drafter

  # The drafter copies the target's embedding table at init; this flag
  # freezes those copied weights so only the draft transformer layers
  # and lm_head are trained.
  freeze_embeddings: true
  shuffle_seed: 42
  log_every_steps: 20
  max_grad_norm: 1.0

optimizer:
  lr: 2.0e-4
  betas: [0.9, 0.95]
  weight_decay: 0.0
  warmup_ratio: 0.05           # 5% warmup
  min_lr_ratio: 0.1

checkpoint:
  enabled: true
  checkpoint_dir: ./outputs/eagle3_llama8b/checkpoints
  # The recipe defaults to safetensors + consolidated; these lines are
  # shown explicitly for clarity but can be omitted.
  model_save_format: safetensors
  save_consolidated: true
```

### Config field reference

| Field                        | What It Does                                                                                                                                                                                          |
| ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `target_model_name_or_path`  | HuggingFace model ID for the frozen target LLM                                                                                                                                                        |
| `train_data_path`            | Path to chat data (HF dataset id, parquet dir, or JSONL)                                                                                                                                              |
| `train_split`                | Optional HF slice syntax to limit data size                                                                                                                                                           |
| `seq_length`                 | Context window length (1024 for quick tests, 2048 for production)                                                                                                                                     |
| `micro_batch_size`           | Per-GPU batch size                                                                                                                                                                                    |
| `grad_accumulation_steps`    | Gradient accumulation for larger effective batches                                                                                                                                                    |
| `ttt_steps`                  | TTT unroll depth; 4 is the default, cost is linear per step                                                                                                                                           |
| `draft_vocab_size`           | Draft output vocabulary size; smaller = faster inference                                                                                                                                              |
| `freeze_embeddings`          | Freeze the embedding table copied from the target so only draft layers train (recommended `true`)                                                                                                     |
| `target_attn_implementation` | Optional attention backend for the frozen target (e.g. `sdpa`); defaults to HF auto-select. Set `sdpa` if the target's FlashAttention path is broken on your build (e.g. the Qwen3 FA2 `s_aux` crash) |
| `fc_norm`                    | EAGLE-3.1: per-chunk independent RMSNorm (`ModuleList`) on auxiliary hidden states before the `fc` projection (default `false`)                                                                       |
| `norm_output`                | EAGLE-3.1: feed the post-`model.norm` hidden state into the next TTT step and `lm_head` (default `false`)                                                                                             |
| `warmup_ratio`               | Fraction of total steps for LR warmup                                                                                                                                                                 |

### Launch training

**Multi-GPU (8x A100, production):**

```bash
torchrun --nproc-per-node=8 \
    -m nemo_automodel.recipes.llm.train_eagle3 \
    eagle3_llama8b.yaml
```

**Single-GPU (quick test with Llama 3.2 1B):**

For a quick test, use the MVP config with Llama 3.2 1B and a small dataset:

```bash
python -m nemo_automodel.recipes.llm.train_eagle3 \
    examples/speculative/eagle3/llama_eagle3_mvp.yaml
```

For GPUs with FlashAttention support, add `draft_attn_implementation: flash_attention_2`
to `recipe_args` for faster training. See
[llama\_eagle3\_mvp\_flash\_attn.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle3/llama_eagle3_mvp_flash_attn.yaml)
for a complete example.

***

## Step 4 — Monitor Training and Inspect Checkpoints

### What to watch

Training loss should drop steadily. Here is a sample log from the PerfectBlend
200k run on 8x A100:

```
[epoch 0] step   20 | loss 3.4521 | grad_norm  12.84 | lr 8.00e-05 | tokens/s 2841
[epoch 0] step   40 | loss 2.8973 | grad_norm   8.31 | lr 1.60e-04 | tokens/s 3102
[epoch 0] step  100 | loss 2.1245 | grad_norm   5.62 | lr 2.00e-04 | tokens/s 3254
[epoch 0] step  500 | loss 1.5832 | grad_norm   3.17 | lr 1.98e-04 | tokens/s 3198
[epoch 0] step 1000 | loss 1.3401 | grad_norm   2.45 | lr 1.92e-04 | tokens/s 3221
[epoch 0] step 3000 | loss 1.1056 | grad_norm   1.89 | lr 1.58e-04 | tokens/s 3245
[epoch 0] step 6000 | loss 0.9823 | grad_norm   1.52 | lr 0.42e-04 | tokens/s 3230
```

### Checkpoint layout

Each checkpoint is saved under `<checkpoint_dir>/epoch_<E>_step_<S>/`:

```
outputs/eagle3_llama8b/checkpoints/
  epoch_0_step_1000/
    config.json              # draft model config
    draft_model.pt           # draft model weights
    eagle3_meta.pt           # token mapping (selected_token_ids + mask)
    optimizer.pt             # Adam state (for resume)
    scheduler.pt             # LR scheduler state
    rng/                     # distributed RNG state
  epoch_0_step_2000/
    ...
  LATEST -> epoch_0_step_6250
```

### Resume from checkpoint

If training is interrupted, resume from the latest checkpoint:

```yaml
checkpoint:
  restore_from: LATEST
```

Or point to a specific checkpoint:

```yaml
checkpoint:
  restore_from: epoch_0_step_3000
```

***

## Step 5 — Serve with SGLang

The `serve_sglang` helper converts the training checkpoint into an
HF/SGLang-compatible format and launches the server in one command.

### Launch the server

```bash
python -m nemo_automodel.components.speculative.serve_sglang \
    --target meta-llama/Llama-3.1-8B-Instruct \
    --draft ./outputs/eagle3_llama8b/checkpoints/LATEST \
    --algorithm EAGLE3 \
    --num-steps 3 \
    --topk 1 \
    --num-draft-tokens 4 \
    --port 30000
```

On first launch, the helper:

1. Loads `draft_model.pt` and `eagle3_meta.pt` from the checkpoint
2. Rewrites the architecture name for SGLang compatibility (`LlamaEagle3DraftModel` -> `LlamaForCausalLMEagle3`)
3. Exports `model.safetensors` and `speculative_token_map.pt` into a `model/` subdirectory
4. Launches SGLang with the correct speculative decoding flags

### Serving parameters

| Flag                 | Default | Notes                                                |
| -------------------- | ------- | ---------------------------------------------------- |
| `--algorithm`        | —       | `EAGLE3` for EAGLE-3 drafters, `EAGLE` for EAGLE-1/2 |
| `--num-steps`        | 3       | Speculative steps per draft iteration                |
| `--topk`             | 1       | Branching factor for tree search                     |
| `--num-draft-tokens` | 4       | Budget of draft tokens per branch                    |
| `--dtype`            | auto    | Must match training dtype (e.g., `bfloat16`)         |
| `--tp-size`          | 1       | Tensor parallelism (shards the target model only)    |
| `--print-only`       | —       | Inspect the resolved command without launching       |

Pass extra SGLang flags after `--`:

```bash
python -m nemo_automodel.components.speculative.serve_sglang \
    --target meta-llama/Llama-3.1-8B-Instruct \
    --draft ./outputs/eagle3_llama8b/checkpoints/LATEST \
    --algorithm EAGLE3 \
    -- --enable-torch-compile --schedule-conservativeness 1.2
```

### Smoke-test the server

Once you see `Uvicorn running on http://0.0.0.0:30000`, test it:

```bash
curl http://localhost:30000/generate \
    -H "Content-Type: application/json" \
    -d '{
      "text": "Hello, my name is",
      "sampling_params": {"max_new_tokens": 64}
    }'
```

Expected output:

```json
{
  "text": "Hello, my name is Sarah and I am a 25-year-old software engineer...",
  "meta_info": {
    "prompt_tokens": 6,
    "completion_tokens": 64,
    "accept_length_per_step": 3.2
  }
}
```

The `accept_length_per_step` metric shows how many tokens the target model
accepts per speculative step on average. Higher is better — a value of 3.0+
indicates the drafter is accurately predicting the target's behavior.

### OpenAI-compatible endpoint

SGLang also exposes an OpenAI-compatible API:

```bash
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "messages": [
        {"role": "user", "content": "Explain speculative decoding in one paragraph."}
      ],
      "max_tokens": 256
    }'
```

Expected output:

```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Speculative decoding is a technique for accelerating autoregressive language model inference. It works by using a small, fast \"draft\" model to predict multiple future tokens, which are then verified in parallel by the larger \"target\" model in a single forward pass. Tokens that match the target model's predictions are accepted, while incorrect tokens are rejected and regenerated. Because verification is cheaper than sequential generation (it processes all candidate tokens simultaneously), the overall throughput increases significantly — typically 2-3x — while producing output that is mathematically identical to running the target model alone."
    }
  }],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 112
  }
}
```

***

## Step 6 — (Bonus) Train an EAGLE-1 Drafter

EAGLE-1 is simpler and faster to train, making it a good starting point for
experimentation. It uses a single transformer layer and trains with a combined
hidden-state MSE + token cross-entropy loss.

### YAML config

Save as `eagle1_llama8b.yaml`:

```yaml
recipe: TrainEagle1Recipe

dist_env:
  backend: nccl
  timeout_minutes: 30

recipe_args:
  target_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
  train_data_path: ./data/perfectblend_renamed
  val_data_path: null
  train_split: "train[:200000]"
  val_split: null
  output_dir: ./outputs/eagle1_llama8b
  seq_length: 2048
  micro_batch_size: 1
  grad_accumulation_steps: 4
  num_workers: 4
  num_epochs: 1

  # EAGLE-1 specific
  draft_num_hidden_layers: 1    # number of draft transformer layers
  hidden_loss_weight: 1.0       # MSE loss on hidden states
  token_loss_weight: 0.1        # cross-entropy loss on tokens

  freeze_embeddings: true
  shuffle_seed: 42
  log_every_steps: 10
  max_grad_norm: 1.0

optimizer:
  lr: 1.0e-4
  betas: [0.9, 0.95]
  weight_decay: 0.0

checkpoint:
  enabled: true
  checkpoint_dir: ./outputs/eagle1_llama8b/checkpoints
  # Defaults to safetensors + consolidated; can be omitted.
  model_save_format: safetensors
  save_consolidated: true
```

### Launch

```bash
torchrun --nproc-per-node=8 \
    -m nemo_automodel.recipes.llm.train_eagle1 \
    eagle1_llama8b.yaml
```

### Serve

Use `--algorithm EAGLE` (not `EAGLE3`) for EAGLE-1/2 drafters:

```bash
python -m nemo_automodel.components.speculative.serve_sglang \
    --target meta-llama/Llama-3.1-8B-Instruct \
    --draft ./outputs/eagle1_llama8b/checkpoints/LATEST \
    --algorithm EAGLE \
    --num-steps 3 --topk 1 --num-draft-tokens 4 \
    --port 30000
```

### EAGLE-1 vs EAGLE-3

|                        | EAGLE-1               | EAGLE-3                    |
| ---------------------- | --------------------- | -------------------------- |
| **Draft layers**       | 1 (configurable)      | 2 (with aux fusion)        |
| **Training objective** | Hidden MSE + token CE | TTT unroll with decay      |
| **Vocabulary**         | Full target vocab     | Reduced (e.g., 8k-32k)     |
| **Training speed**     | Faster                | Slower (due to TTT unroll) |
| **Inference speedup**  | Good (2-2.5x)         | Better (2.5-3x)            |
| **Best for**           | Quick experiments     | Production deployment      |

***

## Example Configs Reference

| Config                                                                                                                                                   | Target       | Variant   | Notes                                                    |
| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | --------- | -------------------------------------------------------- |
| [llama\_eagle3\_mvp.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle3/llama_eagle3_mvp.yaml)                          | Llama 3.2 1B | EAGLE-3   | Quick test, single GPU                                   |
| [llama\_eagle3\_mvp\_flash\_attn.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle3/llama_eagle3_mvp_flash_attn.yaml)  | Llama 3.2 1B | EAGLE-3   | With FlashAttention-2                                    |
| [llama\_eagle3\_perfectblend.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle3/llama_eagle3_perfectblend.yaml)        | Llama 3.1 8B | EAGLE-3   | Production config, 200k samples                          |
| [llama\_eagle3\_1\_perfectblend.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle3_1/llama_eagle3_1_perfectblend.yaml) | Llama 3.1 8B | EAGLE-3.1 | Production config with `fc_norm` + `norm_output` enabled |
| [llama\_eagle1\_mvp.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle1/llama_eagle1_mvp.yaml)                          | Llama 3.2 1B | EAGLE-1   | Quick test, single GPU                                   |
| [llama\_eagle2\_mvp.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/speculative/eagle2/llama_eagle2_mvp.yaml)                          | Llama 3.2 1B | EAGLE-2   | Same as EAGLE-1 (alias)                                  |

## Troubleshooting

| Symptom                                    | Fix                                                                                                 |
| ------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` during training         | Reduce `seq_length` (1024 instead of 2048) or `micro_batch_size`                                    |
| Loss stays flat or NaN                     | Check `max_grad_norm` (default 1.0), reduce `lr`                                                    |
| SGLang `model not found` error             | Ensure `--algorithm` matches the recipe (`EAGLE3` for train\_eagle3, `EAGLE` for train\_eagle1/2)   |
| `dtype mismatch` at serving                | Pass `--dtype bfloat16` to match training precision                                                 |
| `conversations` vs `messages` column error | Rename the column in your dataset (see Step 2 warning)                                              |
| Checkpoint resume fails                    | Use `restore_from: LATEST` or the exact subdirectory name like `epoch_0_step_1000`                  |
| Low `accept_length_per_step` at serving    | Train longer, use more data, or try regenerating answers with the target model (Option B in Step 2) |