nemo_automodel.components.speculative.bench_sglang

Offline acceptance / speedup benchmark for a trained EAGLE drafter on SGLang.

Training reports draft loss and top-1 token accuracy, but the metric that actually matters for deployment is the speculative-decoding acceptance length: how many draft tokens the target accepts per verification step. This script drives a workload against a running SGLang server that hosts the drafter and reports:

accept_length — SGLang’s avg_spec_accept_length (mean tokens emitted per target verify step, including the one guaranteed bonus token). This is the “tokens accepted” headline number.
acceptance_rate — the fraction of the proposed draft chain that is accepted, derived as (accept_length - 1) / speculative_num_steps.
output_throughput_tok_s — measured decode throughput (output tokens per wall-clock second).
speedup — optional: output_throughput divided by the same workload’s throughput against a --baseline-server running without speculation.

The acceptance length is read exactly the way SGLang’s own bench_serving reads it — GET /server_info -> internal_states[0].avg_spec_accept_length (unwrapping a decode stage for PD-disaggregated servers). Because that value is a server-cumulative running average, point this benchmark at a freshly started server dedicated to the run for an accurate number.

Typical usage (after serve_sglang launches the drafter on port 30000):

python -m nemo_automodel.components.speculative.bench_sglang
—server http://localhost:30000
—model meta-llama/Llama-3.1-8B-Instruct
—input-data Aeala/ShareGPT_Vicuna_unfiltered
—num-prompts 64 —concurrency 16 —max-new-tokens 256

Add --baseline-server http://localhost:30001 (a second server started without --speculative-algorithm) to also report the end-to-end speedup.

SGLang is intentionally NOT a dependency of this script — it talks to the server over HTTP, so only aiohttp is required (already pulled in by the project). The server itself must be running separately; see serve_sglang.

Module Contents

Classes

Name	Description
`WorkloadResult`	Aggregate timing for one workload pass against a server.

Functions

Name	Description
`_acceptance_rate`	Fraction of the proposed draft chain accepted: `(accept_length - 1) / num_steps`.
`_build_parser`	-
`_chat_completion`	POST one chat completion and return its `completion_tokens` (0 on no usage).
`_extract_accept_length`	Read `avg_spec_accept_length` the way SGLang’s bench_serving does.
`_extract_num_steps`	Read `speculative_num_steps` from `/server_info` if the server reports it.
`_fetch_server_info`	GET `<server>/server_info`; return the parsed JSON or `None` on failure.
`_internal_state`	Return `internal_states[0]` from a `/server_info` payload, or `None`.
`_load_prompts`	Load up to `--num-prompts` chat prompts (trailing assistant turn dropped).
`_normalize_server_url`	Return the SGLang root URL without a trailing slash or `/v1` suffix.
`_output_throughput`	Output tokens per wall-clock second, or `None` if nothing was timed.
`_run`	Async driver: load prompts, run the workload(s), report metrics. Returns an exit code.
`_run_workload`	Send every prompt through `<server>/v1/chat/completions` and time the pass.
`_speedup`	Return `spec / baseline` output throughput, or `None` if not computable.
`_summarize`	Assemble the metrics dict reported to stdout / `--output-json`.
`_unwrap_server_info`	Return the dict that holds `internal_states`.
`_validate_args`	Reject invalid CLI values before any network work starts.
`main`	CLI entry point. Parses `argv` and returns the process exit code.

Data

logger

API

class nemo_automodel.components.speculative.bench_sglang.WorkloadResult(
    wall_clock_s: float,
    output_tokens: int,
    completed: int,
    failed: int
)

Dataclass

Aggregate timing for one workload pass against a server.

completed

int

failed

int

output_tokens

int

wall_clock_s

float

nemo_automodel.components.speculative.bench_sglang._acceptance_rate(
    accept_length: float | None,
    num_steps: int | None
) -> float | None

Fraction of the proposed draft chain accepted: (accept_length - 1) / num_steps.

accept_length counts the one guaranteed bonus token from the target, so accept_length - 1 is the mean number of draft tokens accepted per step, and dividing by the proposed depth num_steps gives a [0, 1] rate. This is exact for a linear draft chain (topk=1) and approximate for tree drafting. Returns None when either input is unavailable.

nemo_automodel.components.speculative.bench_sglang._build_parser() -> argparse.ArgumentParser

nemo_automodel.components.speculative.bench_sglang._chat_completion(
    session,
    url: str,
    payload: dict[str, typing.Any],
    timeout_s: float,
    max_retries: int
) -> int

async

POST one chat completion and return its completion_tokens (0 on no usage).

nemo_automodel.components.speculative.bench_sglang._extract_accept_length(
    server_info_json: typing.Any
) -> float | None

Read avg_spec_accept_length the way SGLang’s bench_serving does.

nemo_automodel.components.speculative.bench_sglang._extract_num_steps(
    server_info_json: typing.Any
) -> int | None

Read speculative_num_steps from /server_info if the server reports it.

nemo_automodel.components.speculative.bench_sglang._fetch_server_info(
    server: str,
    timeout_s: float
) -> dict[str, typing.Any] | None

async

GET <server>/server_info; return the parsed JSON or None on failure.

nemo_automodel.components.speculative.bench_sglang._internal_state(
    server_info_json: typing.Any
) -> dict[str, typing.Any] | None

Return internal_states[0] from a /server_info payload, or None.

nemo_automodel.components.speculative.bench_sglang._load_prompts(
    args: argparse.Namespace
) -> list[list[dict[str, typing.Any]]]

Load up to --num-prompts chat prompts (trailing assistant turn dropped).

nemo_automodel.components.speculative.bench_sglang._normalize_server_url(
    url: str
) -> str

Return the SGLang root URL without a trailing slash or /v1 suffix.

Chat completions live at <root>/v1/chat/completions and server info at <root>/server_info; accept either http://host:port or the OpenAI-style http://host:port/v1 so the flag is forgiving.

nemo_automodel.components.speculative.bench_sglang._output_throughput(
    result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult
) -> float | None

Output tokens per wall-clock second, or None if nothing was timed.

nemo_automodel.components.speculative.bench_sglang._run(
    args: argparse.Namespace
) -> int

async

Async driver: load prompts, run the workload(s), report metrics. Returns an exit code.

nemo_automodel.components.speculative.bench_sglang._run_workload(
    server: str,
    prompts: list[list[dict[str, typing.Any]]],
    gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
    concurrency: int,
    timeout_s: float,
    max_retries: int
) -> nemo_automodel.components.speculative.bench_sglang.WorkloadResult

async

Send every prompt through <server>/v1/chat/completions and time the pass.

nemo_automodel.components.speculative.bench_sglang._speedup(
    spec_throughput: float | None,
    baseline_throughput: float | None
) -> float | None

Return spec / baseline output throughput, or None if not computable.

nemo_automodel.components.speculative.bench_sglang._summarize(
    gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
    spec_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult,
    server_info: dict[str, typing.Any] | None,
    num_steps_arg: int | None,
    baseline_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult | None
) -> dict[str, typing.Any]

Assemble the metrics dict reported to stdout / --output-json.

nemo_automodel.components.speculative.bench_sglang._unwrap_server_info(
    server_info_json: typing.Any
) -> dict[str, typing.Any] | None

Return the dict that holds internal_states.

PD-disaggregated servers nest the decode engine’s state under a decode list; bench_serving unwraps server_info_json["decode"][0] before reading internal_states. Mirror that so both server topologies work.

nemo_automodel.components.speculative.bench_sglang._validate_args(
    args: argparse.Namespace
) -> None

Reject invalid CLI values before any network work starts.

nemo_automodel.components.speculative.bench_sglang.main(
    argv: list[str] | None = None
) -> int

CLI entry point. Parses argv and returns the process exit code.

nemo_automodel.components.speculative.bench_sglang.logger = logging.getLogger(__name__)