nemo_automodel.components.speculative.serve_sglang

View as Markdown

Serve an Automodel-trained EAGLE / EAGLE-3 drafter with SGLang.

The EAGLE drafter checkpoints produced by the EAGLE recipes (recipes/llm/train_eagle{1,2,3}.py) are written by the consolidated checkpointer as an HF-style model/ directory (model.safetensors + config.json) inside each epoch_<E>_step_<S>/ checkpoint, alongside the EAGLE-3 vocab metadata (eagle_meta.pt). This script resolves that directory (regenerating SGLang’s speculative token map from the metadata when needed) — and still accepts an older draft_model.pt layout as a fallback for hand-exported checkpoints — then shells out to python -m sglang.launch_server with the right speculative-decoding flags.

NOTE — P-EAGLE (parallel-drafting) heads are NOT servable here. A draft trained with parallel_drafting: true only loads into vLLM’s parallel-drafting runtime (vLLM >= 0.16); SGLang support is tracked upstream in https://github.com/sgl-project/sglang/issues/23171. This script rejects such a checkpoint with an actionable error rather than mis-serving it as a plain EAGLE-3 head.

NOTE — SGLang is NOT bundled with the NeMo-AutoModel container image and is intentionally NOT declared in pyproject.toml. To use this entry point, install it yourself into the same environment:

uv pip install “sglang>=0.5.9”

Refer to https://github.com/sgl-project/sglang for the version matching your CUDA / PyTorch stack. If SGLang is missing this script exits with a clear install hint rather than crashing on import.

Typical usage (after training produces a checkpoint at ./checkpoints/epoch_0_step_1000):

python -m nemo_automodel.components.speculative.serve_sglang
—target meta-llama/Llama-3.1-8B-Instruct
—draft ./checkpoints/epoch_0_step_1000
—algorithm EAGLE3
—num-steps 3 —topk 1 —num-draft-tokens 4

Pass --print-only to inspect the command without launching it; in that mode no checkpoint export is performed and the printed paths reflect what would be produced on a real launch.

Module Contents

Functions

NameDescription
_check_sglang_availableVerify the sglang package can actually be imported, else exit (code 2).
_config_needs_rewriteReturn True when config_path does not match the SGLang architecture for algorithm.
_find_eagle_metaReturn the EAGLE-3 vocab-metadata file in checkpoint_dir, if present.
_has_hf_weight_fileReturn True if path already contains a HF-style weight artifact.
_infer_num_hidden_layersInfer num_hidden_layers from a state dict by counting unique layer indices.
_load_safetensors_save_fileReturn safetensors.torch.save_file or exit with an install hint.
_maybe_export_training_checkpointExport a legacy bare draft_model.pt checkpoint into an HF/SGLang model/ directory.
_parse_argsParse command-line arguments for the serve helper.
_raise_if_parallel_draftingReject P-EAGLE (parallel-drafting) drafts: SGLang cannot serve them yet.
_regenerate_token_mapExtract selected_token_ids from a recipe meta file into a SGLang token map.
_rewrite_config_for_sglangCopy src_config_path to dst_config_path and normalize architectures.
_torch_loadLoad a torch pickle, preferring weights_only=True when supported.
build_sglang_argvBuild the python -m sglang.launch_server argv for a given config.
mainValidate the environment, resolve the drafter ckpt, then exec sglang.
resolve_draft_artifactsResolve a user-supplied drafter path to the model and token-map paths SGLang expects.

Data

_SAFETENSORS_INSTALL_HINT

_SGLANG_ARCHITECTURE_FOR_ALGORITHM

_SGLANG_INSTALL_HINT

logger

API

nemo_automodel.components.speculative.serve_sglang._check_sglang_available() -> None

Verify the sglang package can actually be imported, else exit (code 2).

nemo_automodel.components.speculative.serve_sglang._config_needs_rewrite(
config_path: pathlib.Path,
algorithm: str
) -> bool

Return True when config_path does not match the SGLang architecture for algorithm.

nemo_automodel.components.speculative.serve_sglang._find_eagle_meta(
checkpoint_dir: pathlib.Path
) -> pathlib.Path | None

Return the EAGLE-3 vocab-metadata file in checkpoint_dir, if present.

The recipes write the metadata as eagle_meta.pt; older hand-exported checkpoints used eagle3_meta.pt. Prefer the current name and fall back to the legacy one, mirroring the recipe’s own loader (_load_extra_state in recipes/llm/train_eagle3.py).

nemo_automodel.components.speculative.serve_sglang._has_hf_weight_file(
path: pathlib.Path
) -> bool

Return True if path already contains a HF-style weight artifact.

nemo_automodel.components.speculative.serve_sglang._infer_num_hidden_layers(
state_dict: dict[str, typing.Any]
) -> int | None

Infer num_hidden_layers from a state dict by counting unique layer indices.

nemo_automodel.components.speculative.serve_sglang._load_safetensors_save_file() -> typing.Callable[..., None]

Return safetensors.torch.save_file or exit with an install hint.

nemo_automodel.components.speculative.serve_sglang._maybe_export_training_checkpoint(
checkpoint_dir: pathlib.Path,
algorithm: str,
dry_run: bool = False
) -> tuple[pathlib.Path, pathlib.Path | None]

Export a legacy bare draft_model.pt checkpoint into an HF/SGLang model/ directory.

The standard recipe output is already a consolidated model/ directory and is resolved directly by resolve_draft_artifacts; this is the fallback for the older layout where the draft weights were hand-saved as a bare draft_model.pt. When checkpoint_dir has no draft_model.pt + config.json it is returned unchanged (the model/ path handles it).

Parameters:

checkpoint_dir
Path

A dir holding a legacy draft_model.pt + config.json (and eagle_meta.pt for EAGLE-3, or the legacy eagle3_meta.pt); otherwise this is a no-op.

algorithm
str

Speculative algorithm name, used to pick the right SGLang architecture and to decide whether a token map is needed.

dry_run
boolDefaults to False

When True, return the paths that would be produced without writing anything.

Returns: tuple[Path, Path | None]

(export_dir, token_map_path_or_None).

nemo_automodel.components.speculative.serve_sglang._parse_args(
argv: list[str] | None = None
) -> argparse.Namespace

Parse command-line arguments for the serve helper.

nemo_automodel.components.speculative.serve_sglang._raise_if_parallel_drafting(
config_path: pathlib.Path
) -> None

Reject P-EAGLE (parallel-drafting) drafts: SGLang cannot serve them yet.

A P-EAGLE head carries parallel_drafting: true (and a mask_hidden tensor) and only loads into vLLM’s parallel-drafting runtime (https://github.com/vllm-project/speculators/pull/480). Serving it through SGLang’s EAGLE-3 path would silently produce wrong drafts because SGLang ignores mask_hidden / mask_token_id / the COD config. SGLang support is tracked upstream in https://github.com/sgl-project/sglang/issues/23171; until it lands, fail loudly with an actionable message instead.

nemo_automodel.components.speculative.serve_sglang._regenerate_token_map(
meta_path: pathlib.Path,
token_map_path: pathlib.Path
) -> None

Extract selected_token_ids from a recipe meta file into a SGLang token map.

nemo_automodel.components.speculative.serve_sglang._rewrite_config_for_sglang(
src_config_path: pathlib.Path,
dst_config_path: pathlib.Path,
algorithm: str,
num_hidden_layers: int | None = None
) -> None

Copy src_config_path to dst_config_path and normalize architectures.

For algorithms in _SGLANG_ARCHITECTURE_FOR_ALGORITHM the architectures field is rewritten to the SGLang-canonical class name (e.g. LlamaForCausalLMEagle3). For other algorithms the original field is preserved. When num_hidden_layers is provided it is written into the config so the exported drafter reflects its actual depth rather than the target model’s depth. The write is staged through a sibling .tmp file and finalized with os.replace so an interrupted write cannot leave the destination half-truncated when rewriting in place.

nemo_automodel.components.speculative.serve_sglang._torch_load(
path: pathlib.Path
) -> typing.Any

Load a torch pickle, preferring weights_only=True when supported.

nemo_automodel.components.speculative.serve_sglang.build_sglang_argv(
args: argparse.Namespace
) -> list[str]

Build the python -m sglang.launch_server argv for a given config.

nemo_automodel.components.speculative.serve_sglang.main(
argv: list[str] | None = None
) -> int

Validate the environment, resolve the drafter ckpt, then exec sglang.

Returns the SGLang server’s exit code, or 2 if SGLang or safetensors is missing.

nemo_automodel.components.speculative.serve_sglang.resolve_draft_artifacts(
draft: str,
algorithm: str,
dry_run: bool = False
) -> tuple[str, str | None]

Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.

Accepts either the outer epoch_<E>_step_<S> directory or the inner model/ directory; HF Hub repo ids are passed through untouched.

Parameters:

draft
str

A local path or HF Hub repo id.

algorithm
str

Speculative algorithm name.

dry_run
boolDefaults to False

When True, no on-disk export is performed and the returned paths reflect what would be produced on a real launch.

Returns: tuple[str, str | None]

(draft_path, token_map_path_or_None) suitable for SGLang flags.

nemo_automodel.components.speculative.serve_sglang._SAFETENSORS_INSTALL_HINT = 'safetensors is required to export Automodel EAGLE checkpoints for SGLang. Insta...
nemo_automodel.components.speculative.serve_sglang._SGLANG_ARCHITECTURE_FOR_ALGORITHM = {'EAGLE3': 'LlamaForCausalLMEagle3'}
nemo_automodel.components.speculative.serve_sglang._SGLANG_INSTALL_HINT = 'sglang is not installed in this environment. Install it manually with `uv pip i...
nemo_automodel.components.speculative.serve_sglang.logger = logging.getLogger(__name__)