Serving an Automodel-trained EAGLE drafter with SGLang#
The EAGLE / EAGLE-3 drafters trained by the recipes under
nemo_automodel/recipes/llm/train_eagle{1,2,3}.py are saved in a recipe-native
layout (draft_model.pt plus metadata). The helper script below converts that
checkpoint into an HF/SGLang-readable model/ directory on first launch, so
you can point it directly at epoch_<E>_step_<S>.
Install SGLang#
SGLang is not bundled with the NeMo-AutoModel container image and is not
declared as a dependency in pyproject.toml. Install it yourself in the same
Python environment where you ran training:
uv pip install "sglang>=0.5.9"
See https://github.com/sgl-project/sglang for the version matching your CUDA and PyTorch stack.
Locate the checkpoint#
A successful training run leaves the recipe checkpoint at:
<output_dir>/checkpoints/epoch_<E>_step_<S>/
config.json
draft_model.pt
eagle3_meta.pt # EAGLE-3 only, contains the small-vocab token map
eagle1_meta.pt # EAGLE-1 / EAGLE-2 only, holds bookkeeping (no token map)
On first launch the helper exports:
<output_dir>/checkpoints/epoch_<E>_step_<S>/model/
config.json # architectures rewritten to LlamaForCausalLMEagle3 for EAGLE3
model.safetensors
speculative_token_map.pt # EAGLE-3 only, materialized from selected_token_ids
The helper accepts either the outer epoch_<E>_step_<S> directory or the
inner model/ directory. If you point at the inner model/ and the
speculative_token_map.pt is missing, the helper will look one directory up
for eagle3_meta.pt and regenerate the token map from it.
Launch the server#
Use the helper module to start an SGLang HTTP server with the trained drafter:
python -m nemo_automodel.components.speculative.serve_sglang \
--target meta-llama/Llama-3.1-8B-Instruct \
--draft ./checkpoints/epoch_0_step_1000 \
--algorithm EAGLE3 \
--num-steps 3 --topk 1 --num-draft-tokens 4 \
--port 30000
Add --print-only to inspect the resolved sglang.launch_server command
without executing it. In that mode the helper skips the checkpoint export
step entirely — the printed --speculative-draft-model-path reflects the
path that would be produced on a real launch. Pass any extra SGLang flags
after a -- separator, e.g.:
python -m nemo_automodel.components.speculative.serve_sglang \
--target meta-llama/Llama-3.1-8B-Instruct \
--draft ./checkpoints/epoch_0_step_1000 \
-- --enable-torch-compile --schedule-conservativeness 1.2
Smoke-test the server#
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, my name is", "sampling_params": {"max_new_tokens": 64}}'
The acceptance length (and the resulting tokens/sec speedup) depends on
--num-steps, --topk, and --num-draft-tokens. Sweep them via
benchmarks/ in SpecForge or the SGLang bench suite to find the best
configuration for your workload.
Notes & caveats#
dtype must match training. Pass
--dtype bfloat16if you trained the drafter in bf16. Mixed precision between target and drafter degrades acceptance.EAGLE vs EAGLE3. Use
--algorithm EAGLE3for drafters trained withtrain_eagle3.py; useEAGLEfor drafters trained withtrain_eagle1.pyortrain_eagle2.py.Tensor parallelism.
--tp-sizecontrols SGLang’s TP. The drafter itself is a single-layer transformer so TP only meaningfully shards the target model.Custom target architectures. Pass
--trust-remote-codeif the target model relies onauto_mapentries in its config.