nemo_automodel.components.speculative.serve_sglang
nemo_automodel.components.speculative.serve_sglang
Serve an Automodel-trained EAGLE / EAGLE-3 drafter with SGLang.
The EAGLE drafter checkpoints produced by the EAGLE recipes
(recipes/llm/train_eagle{1,2,3}.py) are written by the consolidated
checkpointer as an HF-style model/ directory (model.safetensors +
config.json) inside each epoch_<E>_step_<S>/ checkpoint, alongside the
EAGLE-3 vocab metadata (eagle_meta.pt). This script resolves that directory
(regenerating SGLang’s speculative token map from the metadata when needed) —
and still accepts an older draft_model.pt layout as a fallback for
hand-exported checkpoints — then shells out to
python -m sglang.launch_server with the right speculative-decoding flags.
NOTE — P-EAGLE (parallel-drafting) heads are NOT servable here. A draft trained
with parallel_drafting: true only loads into vLLM’s parallel-drafting
runtime (vLLM >= 0.16); SGLang support is tracked upstream in
https://github.com/sgl-project/sglang/issues/23171. This script rejects such a
checkpoint with an actionable error rather than mis-serving it as a plain
EAGLE-3 head.
NOTE — SGLang is NOT bundled with the NeMo-AutoModel container image and
is intentionally NOT declared in pyproject.toml. To use this entry
point, install it yourself into the same environment:
uv pip install “sglang>=0.5.9”
Refer to https://github.com/sgl-project/sglang for the version matching your CUDA / PyTorch stack. If SGLang is missing this script exits with a clear install hint rather than crashing on import.
Typical usage (after training produces a checkpoint at
./checkpoints/epoch_0_step_1000):
python -m nemo_automodel.components.speculative.serve_sglang
—target meta-llama/Llama-3.1-8B-Instruct
—draft ./checkpoints/epoch_0_step_1000
—algorithm EAGLE3
—num-steps 3 —topk 1 —num-draft-tokens 4
Pass --print-only to inspect the command without launching it; in that
mode no checkpoint export is performed and the printed paths reflect what
would be produced on a real launch.
Module Contents
Functions
Data
_SGLANG_ARCHITECTURE_FOR_ALGORITHM
API
Verify the sglang package can actually be imported, else exit (code 2).
Return True when config_path does not match the SGLang architecture for algorithm.
Return the EAGLE-3 vocab-metadata file in checkpoint_dir, if present.
The recipes write the metadata as eagle_meta.pt; older hand-exported
checkpoints used eagle3_meta.pt. Prefer the current name and fall back to
the legacy one, mirroring the recipe’s own loader (_load_extra_state in
recipes/llm/train_eagle3.py).
Return True if path already contains a HF-style weight artifact.
Infer num_hidden_layers from a state dict by counting unique layer indices.
Return safetensors.torch.save_file or exit with an install hint.
Export a legacy bare draft_model.pt checkpoint into an HF/SGLang model/ directory.
The standard recipe output is already a consolidated model/ directory and
is resolved directly by resolve_draft_artifacts; this is the fallback for
the older layout where the draft weights were hand-saved as a bare
draft_model.pt. When checkpoint_dir has no draft_model.pt +
config.json it is returned unchanged (the model/ path handles it).
Parameters:
A dir holding a legacy draft_model.pt + config.json
(and eagle_meta.pt for EAGLE-3, or the legacy eagle3_meta.pt);
otherwise this is a no-op.
Speculative algorithm name, used to pick the right SGLang architecture and to decide whether a token map is needed.
When True, return the paths that would be produced without writing anything.
Returns: tuple[Path, Path | None]
(export_dir, token_map_path_or_None).
Parse command-line arguments for the serve helper.
Reject P-EAGLE (parallel-drafting) drafts: SGLang cannot serve them yet.
A P-EAGLE head carries parallel_drafting: true (and a mask_hidden
tensor) and only loads into vLLM’s parallel-drafting runtime
(https://github.com/vllm-project/speculators/pull/480). Serving it through
SGLang’s EAGLE-3 path would silently produce wrong drafts because SGLang
ignores mask_hidden / mask_token_id / the COD config. SGLang support
is tracked upstream in https://github.com/sgl-project/sglang/issues/23171;
until it lands, fail loudly with an actionable message instead.
Extract selected_token_ids from a recipe meta file into a SGLang token map.
Copy src_config_path to dst_config_path and normalize architectures.
For algorithms in _SGLANG_ARCHITECTURE_FOR_ALGORITHM the
architectures field is rewritten to the SGLang-canonical class name
(e.g. LlamaForCausalLMEagle3). For other algorithms the original
field is preserved. When num_hidden_layers is provided it is written
into the config so the exported drafter reflects its actual depth rather
than the target model’s depth. The write is staged through a sibling
.tmp file and finalized with os.replace so an interrupted write
cannot leave the destination half-truncated when rewriting in place.
Load a torch pickle, preferring weights_only=True when supported.
Build the python -m sglang.launch_server argv for a given config.
Validate the environment, resolve the drafter ckpt, then exec sglang.
Returns the SGLang server’s exit code, or 2 if SGLang or safetensors
is missing.
Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.
Accepts either the outer epoch_<E>_step_<S> directory or the inner
model/ directory; HF Hub repo ids are passed through untouched.
Parameters:
A local path or HF Hub repo id.
Speculative algorithm name.
When True, no on-disk export is performed and the returned paths reflect what would be produced on a real launch.
Returns: tuple[str, str | None]
(draft_path, token_map_path_or_None) suitable for SGLang flags.