nemo_automodel.components.speculative.serve_sglang#

Serve an Automodel-trained EAGLE / EAGLE-3 drafter with SGLang.

The EAGLE drafter checkpoints produced by the EAGLE recipes (recipes/llm/train_eagle{1,2,3}.py) are saved as draft_model.pt plus recipe metadata. This script converts that layout into an HF/SGLang-readable model/ directory when needed, then shells out to python -m sglang.launch_server with the right speculative-decoding flags.

NOTE — SGLang is NOT bundled with the NeMo-AutoModel container image and is intentionally NOT declared in pyproject.toml. To use this entry point, install it yourself into the same environment:

uv pip install "sglang>=0.5.9"

Refer to https://github.com/sgl-project/sglang for the version matching your CUDA / PyTorch stack. If SGLang is missing this script exits with a clear install hint rather than crashing on import.

Typical usage (after training produces a checkpoint at ./checkpoints/epoch_0_step_1000):

python -m nemo_automodel.components.speculative.serve_sglang \
    --target meta-llama/Llama-3.1-8B-Instruct \
    --draft ./checkpoints/epoch_0_step_1000 \
    --algorithm EAGLE3 \
    --num-steps 3 --topk 1 --num-draft-tokens 4

Pass --print-only to inspect the command without launching it; in that mode no checkpoint export is performed and the printed paths reflect what would be produced on a real launch.

Module Contents#

Functions#

_has_hf_weight_file

Return True if path already contains a HF-style weight artifact.

_check_sglang_available

Verify the sglang package can actually be imported, else exit (code 2).

_load_safetensors_save_file

Return safetensors.torch.save_file or exit with an install hint.

_torch_load

Load a torch pickle, preferring weights_only=True when supported.

_rewrite_config_for_sglang

Copy src_config_path to dst_config_path and normalize architectures.

_config_needs_rewrite

Return True when config_path does not match the SGLang architecture for algorithm.

_infer_num_hidden_layers

Infer num_hidden_layers from a state dict by counting unique layer indices.

_regenerate_token_map

Extract selected_token_ids from a recipe meta file into a SGLang token map.

_maybe_export_training_checkpoint

Convert recipe-native EAGLE checkpoints into an HF/SGLang-readable directory.

resolve_draft_artifacts

Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.

build_sglang_argv

Build the python -m sglang.launch_server argv for a given config.

_parse_args

Parse command-line arguments for the serve helper.

main

Validate the environment, resolve the drafter ckpt, then exec sglang.

Data#

API#

nemo_automodel.components.speculative.serve_sglang.logger#

‘getLogger(…)’

nemo_automodel.components.speculative.serve_sglang._SGLANG_INSTALL_HINT#

‘sglang is not installed in this environment. Install it manually with `uv pip install “sglang>=0.5.9…’

nemo_automodel.components.speculative.serve_sglang._SAFETENSORS_INSTALL_HINT#

‘safetensors is required to export Automodel EAGLE checkpoints for SGLang. Install it with `uv pip in…’

nemo_automodel.components.speculative.serve_sglang._SGLANG_ARCHITECTURE_FOR_ALGORITHM#

None

nemo_automodel.components.speculative.serve_sglang._has_hf_weight_file(path: pathlib.Path) bool#

Return True if path already contains a HF-style weight artifact.

nemo_automodel.components.speculative.serve_sglang._check_sglang_available() None#

Verify the sglang package can actually be imported, else exit (code 2).

nemo_automodel.components.speculative.serve_sglang._load_safetensors_save_file() Callable[..., None]#

Return safetensors.torch.save_file or exit with an install hint.

nemo_automodel.components.speculative.serve_sglang._torch_load(path: pathlib.Path) Any#

Load a torch pickle, preferring weights_only=True when supported.

nemo_automodel.components.speculative.serve_sglang._rewrite_config_for_sglang(
src_config_path: pathlib.Path,
dst_config_path: pathlib.Path,
algorithm: str,
*,
num_hidden_layers: int | None = None,
) None#

Copy src_config_path to dst_config_path and normalize architectures.

For algorithms in _SGLANG_ARCHITECTURE_FOR_ALGORITHM the architectures field is rewritten to the SGLang-canonical class name (e.g. LlamaForCausalLMEagle3). For other algorithms the original field is preserved. When num_hidden_layers is provided it is written into the config so the exported drafter reflects its actual depth rather than the target model’s depth. The write is staged through a sibling .tmp file and finalized with os.replace so an interrupted write cannot leave the destination half-truncated when rewriting in place.

nemo_automodel.components.speculative.serve_sglang._config_needs_rewrite(
config_path: pathlib.Path,
algorithm: str,
) bool#

Return True when config_path does not match the SGLang architecture for algorithm.

nemo_automodel.components.speculative.serve_sglang._infer_num_hidden_layers(
state_dict: dict[str, Any],
) int | None#

Infer num_hidden_layers from a state dict by counting unique layer indices.

nemo_automodel.components.speculative.serve_sglang._regenerate_token_map(
meta_path: pathlib.Path,
token_map_path: pathlib.Path,
) None#

Extract selected_token_ids from a recipe meta file into a SGLang token map.

nemo_automodel.components.speculative.serve_sglang._maybe_export_training_checkpoint(
checkpoint_dir: pathlib.Path,
algorithm: str,
*,
dry_run: bool = False,
) tuple[pathlib.Path, pathlib.Path | None]#

Convert recipe-native EAGLE checkpoints into an HF/SGLang-readable directory.

Parameters:
  • checkpoint_dir – Recipe checkpoint dir, expected to contain draft_model.pt and config.json (and eagle3_meta.pt for EAGLE-3).

  • algorithm – Speculative algorithm name, used to pick the right SGLang architecture and to decide whether a token map is needed.

  • dry_run – When True, return the paths that would be produced without writing anything.

Returns:

(export_dir, token_map_path_or_None).

nemo_automodel.components.speculative.serve_sglang.resolve_draft_artifacts(
draft: str,
algorithm: str,
*,
dry_run: bool = False,
) tuple[str, str | None]#

Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.

Accepts either the outer epoch_<E>_step_<S> directory or the inner model/ directory; HF Hub repo ids are passed through untouched.

Parameters:
  • draft – A local path or HF Hub repo id.

  • algorithm – Speculative algorithm name.

  • dry_run – When True, no on-disk export is performed and the returned paths reflect what would be produced on a real launch.

Returns:

(draft_path, token_map_path_or_None) suitable for SGLang flags.

nemo_automodel.components.speculative.serve_sglang.build_sglang_argv(args: argparse.Namespace) list[str]#

Build the python -m sglang.launch_server argv for a given config.

nemo_automodel.components.speculative.serve_sglang._parse_args(argv: list[str] | None = None) argparse.Namespace#

Parse command-line arguments for the serve helper.

nemo_automodel.components.speculative.serve_sglang.main(argv: list[str] | None = None) int#

Validate the environment, resolve the drafter ckpt, then exec sglang.

Returns the SGLang server’s exit code, or 2 if SGLang or safetensors is missing.