> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.speculative.serve_sglang

Serve an Automodel-trained EAGLE / EAGLE-3 drafter with SGLang.

The EAGLE drafter checkpoints produced by the EAGLE recipes
(`recipes/llm/train_eagle&#123;1,2,3&#125;.py`) are written by the consolidated
checkpointer as an HF-style `model/` directory (`model.safetensors` +
`config.json`) inside each `epoch_&lt;E&gt;_step_&lt;S&gt;/` checkpoint, alongside the
EAGLE-3 vocab metadata (`eagle_meta.pt`). This script resolves that directory
(regenerating SGLang's speculative token map from the metadata when needed) --
and still accepts an older `draft_model.pt` layout as a fallback for
hand-exported checkpoints -- then shells out to
`python -m sglang.launch_server` with the right speculative-decoding flags.

NOTE — P-EAGLE (parallel-drafting) heads are NOT servable here. A draft trained
with `parallel_drafting: true` only loads into vLLM's parallel-drafting
runtime (vLLM >= 0.16); SGLang support is tracked upstream in
[https://github.com/sgl-project/sglang/issues/23171](https://github.com/sgl-project/sglang/issues/23171). This script rejects such a
checkpoint with an actionable error rather than mis-serving it as a plain
EAGLE-3 head.

NOTE — SGLang is NOT bundled with the NeMo-AutoModel container image and
is intentionally NOT declared in `pyproject.toml`. To use this entry
point, install it yourself into the same environment:

uv pip install "sglang>=0.5.9"

Refer to [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang) for the version matching
your CUDA / PyTorch stack. If SGLang is missing this script exits with a
clear install hint rather than crashing on import.

Typical usage (after training produces a checkpoint at
`./checkpoints/epoch_0_step_1000`):

python -m nemo\_automodel.components.speculative.serve\_sglang \
\--target meta-llama/Llama-3.1-8B-Instruct \
\--draft ./checkpoints/epoch\_0\_step\_1000 \
\--algorithm EAGLE3 \
\--num-steps 3 --topk 1 --num-draft-tokens 4

Pass `--print-only` to inspect the command without launching it; in that
mode no checkpoint export is performed and the printed paths reflect what
would be produced on a real launch.

## Module Contents

### Functions

| Name                                                                                                                         | Description                                                                            |
| ---------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| [`_check_sglang_available`](#nemo_automodel-components-speculative-serve_sglang-_check_sglang_available)                     | Verify the `sglang` package can actually be imported, else exit (code 2).              |
| [`_config_needs_rewrite`](#nemo_automodel-components-speculative-serve_sglang-_config_needs_rewrite)                         | Return True when `config_path` does not match the SGLang architecture for `algorithm`. |
| [`_find_eagle_meta`](#nemo_automodel-components-speculative-serve_sglang-_find_eagle_meta)                                   | Return the EAGLE-3 vocab-metadata file in `checkpoint_dir`, if present.                |
| [`_has_hf_weight_file`](#nemo_automodel-components-speculative-serve_sglang-_has_hf_weight_file)                             | Return True if `path` already contains a HF-style weight artifact.                     |
| [`_infer_num_hidden_layers`](#nemo_automodel-components-speculative-serve_sglang-_infer_num_hidden_layers)                   | Infer num\_hidden\_layers from a state dict by counting unique layer indices.          |
| [`_load_safetensors_save_file`](#nemo_automodel-components-speculative-serve_sglang-_load_safetensors_save_file)             | Return `safetensors.torch.save_file` or exit with an install hint.                     |
| [`_maybe_export_training_checkpoint`](#nemo_automodel-components-speculative-serve_sglang-_maybe_export_training_checkpoint) | Export a legacy bare `draft_model.pt` checkpoint into an HF/SGLang `model/` directory. |
| [`_parse_args`](#nemo_automodel-components-speculative-serve_sglang-_parse_args)                                             | Parse command-line arguments for the serve helper.                                     |
| [`_raise_if_parallel_drafting`](#nemo_automodel-components-speculative-serve_sglang-_raise_if_parallel_drafting)             | Reject P-EAGLE (parallel-drafting) drafts: SGLang cannot serve them yet.               |
| [`_regenerate_token_map`](#nemo_automodel-components-speculative-serve_sglang-_regenerate_token_map)                         | Extract `selected_token_ids` from a recipe meta file into a SGLang token map.          |
| [`_rewrite_config_for_sglang`](#nemo_automodel-components-speculative-serve_sglang-_rewrite_config_for_sglang)               | Copy `src_config_path` to `dst_config_path` and normalize `architectures`.             |
| [`_torch_load`](#nemo_automodel-components-speculative-serve_sglang-_torch_load)                                             | Load a torch pickle, preferring `weights_only=True` when supported.                    |
| [`build_sglang_argv`](#nemo_automodel-components-speculative-serve_sglang-build_sglang_argv)                                 | Build the `python -m sglang.launch_server` argv for a given config.                    |
| [`main`](#nemo_automodel-components-speculative-serve_sglang-main)                                                           | Validate the environment, resolve the drafter ckpt, then exec sglang.                  |
| [`resolve_draft_artifacts`](#nemo_automodel-components-speculative-serve_sglang-resolve_draft_artifacts)                     | Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.  |

### Data

[`_SAFETENSORS_INSTALL_HINT`](#nemo_automodel-components-speculative-serve_sglang-_SAFETENSORS_INSTALL_HINT)

[`_SGLANG_ARCHITECTURE_FOR_ALGORITHM`](#nemo_automodel-components-speculative-serve_sglang-_SGLANG_ARCHITECTURE_FOR_ALGORITHM)

[`_SGLANG_INSTALL_HINT`](#nemo_automodel-components-speculative-serve_sglang-_SGLANG_INSTALL_HINT)

[`logger`](#nemo_automodel-components-speculative-serve_sglang-logger)

### API

```python
nemo_automodel.components.speculative.serve_sglang._check_sglang_available() -> None
```

Verify the `sglang` package can actually be imported, else exit (code 2).

```python
nemo_automodel.components.speculative.serve_sglang._config_needs_rewrite(
    config_path: pathlib.Path,
    algorithm: str
) -> bool
```

Return True when `config_path` does not match the SGLang architecture for `algorithm`.

```python
nemo_automodel.components.speculative.serve_sglang._find_eagle_meta(
    checkpoint_dir: pathlib.Path
) -> pathlib.Path | None
```

Return the EAGLE-3 vocab-metadata file in `checkpoint_dir`, if present.

The recipes write the metadata as `eagle_meta.pt`; older hand-exported
checkpoints used `eagle3_meta.pt`. Prefer the current name and fall back to
the legacy one, mirroring the recipe's own loader (`_load_extra_state` in
`recipes/llm/train_eagle3.py`).

```python
nemo_automodel.components.speculative.serve_sglang._has_hf_weight_file(
    path: pathlib.Path
) -> bool
```

Return True if `path` already contains a HF-style weight artifact.

```python
nemo_automodel.components.speculative.serve_sglang._infer_num_hidden_layers(
    state_dict: dict[str, typing.Any]
) -> int | None
```

Infer num\_hidden\_layers from a state dict by counting unique layer indices.

```python
nemo_automodel.components.speculative.serve_sglang._load_safetensors_save_file() -> typing.Callable[..., None]
```

Return `safetensors.torch.save_file` or exit with an install hint.

```python
nemo_automodel.components.speculative.serve_sglang._maybe_export_training_checkpoint(
    checkpoint_dir: pathlib.Path,
    algorithm: str,
    dry_run: bool = False
) -> tuple[pathlib.Path, pathlib.Path | None]
```

Export a legacy bare `draft_model.pt` checkpoint into an HF/SGLang `model/` directory.

The standard recipe output is already a consolidated `model/` directory and
is resolved directly by `resolve_draft_artifacts`; this is the fallback for
the older layout where the draft weights were hand-saved as a bare
`draft_model.pt`. When `checkpoint_dir` has no `draft_model.pt` +
`config.json` it is returned unchanged (the `model/` path handles it).

**Parameters:**

A dir holding a legacy `draft_model.pt` + `config.json`
(and `eagle_meta.pt` for EAGLE-3, or the legacy `eagle3_meta.pt`);
otherwise this is a no-op.

Speculative algorithm name, used to pick the right
SGLang architecture and to decide whether a token map is needed.

When True, return the paths that *would* be produced
without writing anything.

**Returns:** `tuple[Path, Path | None]`

`(export_dir, token_map_path_or_None)`.

```python
nemo_automodel.components.speculative.serve_sglang._parse_args(
    argv: list[str] | None = None
) -> argparse.Namespace
```

Parse command-line arguments for the serve helper.

```python
nemo_automodel.components.speculative.serve_sglang._raise_if_parallel_drafting(
    config_path: pathlib.Path
) -> None
```

Reject P-EAGLE (parallel-drafting) drafts: SGLang cannot serve them yet.

A P-EAGLE head carries `parallel_drafting: true` (and a `mask_hidden`
tensor) and only loads into vLLM's parallel-drafting runtime
([https://github.com/vllm-project/speculators/pull/480](https://github.com/vllm-project/speculators/pull/480)). Serving it through
SGLang's EAGLE-3 path would silently produce wrong drafts because SGLang
ignores `mask_hidden` / `mask_token_id` / the COD config. SGLang support
is tracked upstream in [https://github.com/sgl-project/sglang/issues/23171](https://github.com/sgl-project/sglang/issues/23171);
until it lands, fail loudly with an actionable message instead.

```python
nemo_automodel.components.speculative.serve_sglang._regenerate_token_map(
    meta_path: pathlib.Path,
    token_map_path: pathlib.Path
) -> None
```

Extract `selected_token_ids` from a recipe meta file into a SGLang token map.

```python
nemo_automodel.components.speculative.serve_sglang._rewrite_config_for_sglang(
    src_config_path: pathlib.Path,
    dst_config_path: pathlib.Path,
    algorithm: str,
    num_hidden_layers: int | None = None
) -> None
```

Copy `src_config_path` to `dst_config_path` and normalize `architectures`.

For algorithms in `_SGLANG_ARCHITECTURE_FOR_ALGORITHM` the
`architectures` field is rewritten to the SGLang-canonical class name
(e.g. `LlamaForCausalLMEagle3`). For other algorithms the original
field is preserved. When `num_hidden_layers` is provided it is written
into the config so the exported drafter reflects its actual depth rather
than the target model's depth. The write is staged through a sibling
`.tmp` file and finalized with `os.replace` so an interrupted write
cannot leave the destination half-truncated when rewriting in place.

```python
nemo_automodel.components.speculative.serve_sglang._torch_load(
    path: pathlib.Path
) -> typing.Any
```

Load a torch pickle, preferring `weights_only=True` when supported.

```python
nemo_automodel.components.speculative.serve_sglang.build_sglang_argv(
    args: argparse.Namespace
) -> list[str]
```

Build the `python -m sglang.launch_server` argv for a given config.

```python
nemo_automodel.components.speculative.serve_sglang.main(
    argv: list[str] | None = None
) -> int
```

Validate the environment, resolve the drafter ckpt, then exec sglang.

Returns the SGLang server's exit code, or `2` if SGLang or safetensors
is missing.

```python
nemo_automodel.components.speculative.serve_sglang.resolve_draft_artifacts(
    draft: str,
    algorithm: str,
    dry_run: bool = False
) -> tuple[str, str | None]
```

Resolve a user-supplied drafter path to the model and token-map paths SGLang expects.

Accepts either the outer `epoch_&lt;E&gt;_step_&lt;S&gt;` directory or the inner
`model/` directory; HF Hub repo ids are passed through untouched.

**Parameters:**

A local path or HF Hub repo id.

Speculative algorithm name.

When True, no on-disk export is performed and the returned
paths reflect what *would* be produced on a real launch.

**Returns:** `tuple[str, str | None]`

`(draft_path, token_map_path_or_None)` suitable for SGLang flags.

```python
nemo_automodel.components.speculative.serve_sglang._SAFETENSORS_INSTALL_HINT = 'safetensors is required to export Automodel EAGLE checkpoints for SGLang. Insta...
```

```python
nemo_automodel.components.speculative.serve_sglang._SGLANG_ARCHITECTURE_FOR_ALGORITHM = {'EAGLE3': 'LlamaForCausalLMEagle3'}
```

```python
nemo_automodel.components.speculative.serve_sglang._SGLANG_INSTALL_HINT = 'sglang is not installed in this environment. Install it manually with `uv pip i...
```

```python
nemo_automodel.components.speculative.serve_sglang.logger = logging.getLogger(__name__)
```