> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.speculative.bench_sglang

Offline acceptance / speedup benchmark for a trained EAGLE drafter on SGLang.

Training reports draft loss and top-1 token accuracy, but the metric that
actually matters for deployment is the *speculative-decoding acceptance length*:
how many draft tokens the target accepts per verification step. This script
drives a workload against a running SGLang server that hosts the drafter and
reports:

* `accept_length` -- SGLang's `avg_spec_accept_length` (mean tokens emitted
  per target verify step, including the one guaranteed bonus token). This is the
  "tokens accepted" headline number.
* `acceptance_rate` -- the fraction of the proposed draft chain that is
  accepted, derived as `(accept_length - 1) / speculative_num_steps`.
* `output_throughput_tok_s` -- measured decode throughput (output tokens per
  wall-clock second).
* `speedup` -- optional: `output_throughput` divided by the same workload's
  throughput against a `--baseline-server` running *without* speculation.

The acceptance length is read exactly the way SGLang's own `bench_serving`
reads it -- `GET /server_info` -> `internal_states[0].avg_spec_accept_length`
(unwrapping a `decode` stage for PD-disaggregated servers). Because that value
is a server-cumulative running average, point this benchmark at a *freshly
started* server dedicated to the run for an accurate number.

Typical usage (after `serve_sglang` launches the drafter on port 30000):

python -m nemo\_automodel.components.speculative.bench\_sglang \
\--server [http://localhost:30000](http://localhost:30000) \
\--model meta-llama/Llama-3.1-8B-Instruct \
\--input-data Aeala/ShareGPT\_Vicuna\_unfiltered \
\--num-prompts 64 --concurrency 16 --max-new-tokens 256

Add `--baseline-server http://localhost:30001` (a second server started
without `--speculative-algorithm`) to also report the end-to-end speedup.

SGLang is intentionally NOT a dependency of this script -- it talks to the
server over HTTP, so only `aiohttp` is required (already pulled in by the
project). The server itself must be running separately; see `serve_sglang`.

## Module Contents

### Classes

| Name                                                                                   | Description                                              |
| -------------------------------------------------------------------------------------- | -------------------------------------------------------- |
| [`WorkloadResult`](#nemo_automodel-components-speculative-bench_sglang-WorkloadResult) | Aggregate timing for one workload pass against a server. |

### Functions

| Name                                                                                                   | Description                                                                            |
| ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------- |
| [`_acceptance_rate`](#nemo_automodel-components-speculative-bench_sglang-_acceptance_rate)             | Fraction of the proposed draft chain accepted: `(accept_length - 1) / num_steps`.      |
| [`_build_parser`](#nemo_automodel-components-speculative-bench_sglang-_build_parser)                   | -                                                                                      |
| [`_chat_completion`](#nemo_automodel-components-speculative-bench_sglang-_chat_completion)             | POST one chat completion and return its `completion_tokens` (0 on no usage).           |
| [`_extract_accept_length`](#nemo_automodel-components-speculative-bench_sglang-_extract_accept_length) | Read `avg_spec_accept_length` the way SGLang's bench\_serving does.                    |
| [`_extract_num_steps`](#nemo_automodel-components-speculative-bench_sglang-_extract_num_steps)         | Read `speculative_num_steps` from `/server_info` if the server reports it.             |
| [`_fetch_server_info`](#nemo_automodel-components-speculative-bench_sglang-_fetch_server_info)         | GET `&lt;server&gt;/server_info`; return the parsed JSON or `None` on failure.         |
| [`_internal_state`](#nemo_automodel-components-speculative-bench_sglang-_internal_state)               | Return `internal_states[0]` from a `/server_info` payload, or `None`.                  |
| [`_load_prompts`](#nemo_automodel-components-speculative-bench_sglang-_load_prompts)                   | Load up to `--num-prompts` chat prompts (trailing assistant turn dropped).             |
| [`_normalize_server_url`](#nemo_automodel-components-speculative-bench_sglang-_normalize_server_url)   | Return the SGLang root URL without a trailing slash or `/v1` suffix.                   |
| [`_output_throughput`](#nemo_automodel-components-speculative-bench_sglang-_output_throughput)         | Output tokens per wall-clock second, or `None` if nothing was timed.                   |
| [`_run`](#nemo_automodel-components-speculative-bench_sglang-_run)                                     | Async driver: load prompts, run the workload(s), report metrics. Returns an exit code. |
| [`_run_workload`](#nemo_automodel-components-speculative-bench_sglang-_run_workload)                   | Send every prompt through `&lt;server&gt;/v1/chat/completions` and time the pass.      |
| [`_speedup`](#nemo_automodel-components-speculative-bench_sglang-_speedup)                             | Return `spec / baseline` output throughput, or `None` if not computable.               |
| [`_summarize`](#nemo_automodel-components-speculative-bench_sglang-_summarize)                         | Assemble the metrics dict reported to stdout / `--output-json`.                        |
| [`_unwrap_server_info`](#nemo_automodel-components-speculative-bench_sglang-_unwrap_server_info)       | Return the dict that holds `internal_states`.                                          |
| [`_validate_args`](#nemo_automodel-components-speculative-bench_sglang-_validate_args)                 | Reject invalid CLI values before any network work starts.                              |
| [`main`](#nemo_automodel-components-speculative-bench_sglang-main)                                     | CLI entry point. Parses `argv` and returns the process exit code.                      |

### Data

[`logger`](#nemo_automodel-components-speculative-bench_sglang-logger)

### API

```python
class nemo_automodel.components.speculative.bench_sglang.WorkloadResult(
    wall_clock_s: float,
    output_tokens: int,
    completed: int,
    failed: int
)
```

Dataclass

Aggregate timing for one workload pass against a server.

```python
nemo_automodel.components.speculative.bench_sglang._acceptance_rate(
    accept_length: float | None,
    num_steps: int | None
) -> float | None
```

Fraction of the proposed draft chain accepted: `(accept_length - 1) / num_steps`.

`accept_length` counts the one guaranteed bonus token from the target, so
`accept_length - 1` is the mean number of *draft* tokens accepted per step,
and dividing by the proposed depth `num_steps` gives a \[0, 1] rate. This is
exact for a linear draft chain (topk=1) and approximate for tree drafting.
Returns `None` when either input is unavailable.

```python
nemo_automodel.components.speculative.bench_sglang._build_parser() -> argparse.ArgumentParser
```

```python
nemo_automodel.components.speculative.bench_sglang._chat_completion(
    session,
    url: str,
    payload: dict[str, typing.Any],
    timeout_s: float,
    max_retries: int
) -> int
```

async

POST one chat completion and return its `completion_tokens` (0 on no usage).

```python
nemo_automodel.components.speculative.bench_sglang._extract_accept_length(
    server_info_json: typing.Any
) -> float | None
```

Read `avg_spec_accept_length` the way SGLang's bench\_serving does.

```python
nemo_automodel.components.speculative.bench_sglang._extract_num_steps(
    server_info_json: typing.Any
) -> int | None
```

Read `speculative_num_steps` from `/server_info` if the server reports it.

```python
nemo_automodel.components.speculative.bench_sglang._fetch_server_info(
    server: str,
    timeout_s: float
) -> dict[str, typing.Any] | None
```

async

GET `&lt;server&gt;/server_info`; return the parsed JSON or `None` on failure.

```python
nemo_automodel.components.speculative.bench_sglang._internal_state(
    server_info_json: typing.Any
) -> dict[str, typing.Any] | None
```

Return `internal_states[0]` from a `/server_info` payload, or `None`.

```python
nemo_automodel.components.speculative.bench_sglang._load_prompts(
    args: argparse.Namespace
) -> list[list[dict[str, typing.Any]]]
```

Load up to `--num-prompts` chat prompts (trailing assistant turn dropped).

```python
nemo_automodel.components.speculative.bench_sglang._normalize_server_url(
    url: str
) -> str
```

Return the SGLang root URL without a trailing slash or `/v1` suffix.

Chat completions live at `&lt;root&gt;/v1/chat/completions` and server info at
`&lt;root&gt;/server_info`; accept either `http://host:port` or the OpenAI-style
`http://host:port/v1` so the flag is forgiving.

```python
nemo_automodel.components.speculative.bench_sglang._output_throughput(
    result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult
) -> float | None
```

Output tokens per wall-clock second, or `None` if nothing was timed.

```python
nemo_automodel.components.speculative.bench_sglang._run(
    args: argparse.Namespace
) -> int
```

async

Async driver: load prompts, run the workload(s), report metrics. Returns an exit code.

```python
nemo_automodel.components.speculative.bench_sglang._run_workload(
    server: str,
    prompts: list[list[dict[str, typing.Any]]],
    gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
    concurrency: int,
    timeout_s: float,
    max_retries: int
) -> nemo_automodel.components.speculative.bench_sglang.WorkloadResult
```

async

Send every prompt through `&lt;server&gt;/v1/chat/completions` and time the pass.

```python
nemo_automodel.components.speculative.bench_sglang._speedup(
    spec_throughput: float | None,
    baseline_throughput: float | None
) -> float | None
```

Return `spec / baseline` output throughput, or `None` if not computable.

```python
nemo_automodel.components.speculative.bench_sglang._summarize(
    gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
    spec_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult,
    server_info: dict[str, typing.Any] | None,
    num_steps_arg: int | None,
    baseline_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult | None
) -> dict[str, typing.Any]
```

Assemble the metrics dict reported to stdout / `--output-json`.

```python
nemo_automodel.components.speculative.bench_sglang._unwrap_server_info(
    server_info_json: typing.Any
) -> dict[str, typing.Any] | None
```

Return the dict that holds `internal_states`.

PD-disaggregated servers nest the decode engine's state under a `decode`
list; `bench_serving` unwraps `server_info_json["decode"][0]` before
reading `internal_states`. Mirror that so both server topologies work.

```python
nemo_automodel.components.speculative.bench_sglang._validate_args(
    args: argparse.Namespace
) -> None
```

Reject invalid CLI values before any network work starts.

```python
nemo_automodel.components.speculative.bench_sglang.main(
    argv: list[str] | None = None
) -> int
```

CLI entry point. Parses `argv` and returns the process exit code.

```python
nemo_automodel.components.speculative.bench_sglang.logger = logging.getLogger(__name__)
```