nemo_automodel.components.speculative.bench_sglang

View as Markdown

Offline acceptance / speedup benchmark for a trained EAGLE drafter on SGLang.

Training reports draft loss and top-1 token accuracy, but the metric that actually matters for deployment is the speculative-decoding acceptance length: how many draft tokens the target accepts per verification step. This script drives a workload against a running SGLang server that hosts the drafter and reports:

  • accept_length — SGLang’s avg_spec_accept_length (mean tokens emitted per target verify step, including the one guaranteed bonus token). This is the “tokens accepted” headline number.
  • acceptance_rate — the fraction of the proposed draft chain that is accepted, derived as (accept_length - 1) / speculative_num_steps.
  • output_throughput_tok_s — measured decode throughput (output tokens per wall-clock second).
  • speedup — optional: output_throughput divided by the same workload’s throughput against a --baseline-server running without speculation.

The acceptance length is read exactly the way SGLang’s own bench_serving reads it — GET /server_info -> internal_states[0].avg_spec_accept_length (unwrapping a decode stage for PD-disaggregated servers). Because that value is a server-cumulative running average, point this benchmark at a freshly started server dedicated to the run for an accurate number.

Typical usage (after serve_sglang launches the drafter on port 30000):

python -m nemo_automodel.components.speculative.bench_sglang
—server http://localhost:30000
—model meta-llama/Llama-3.1-8B-Instruct
—input-data Aeala/ShareGPT_Vicuna_unfiltered
—num-prompts 64 —concurrency 16 —max-new-tokens 256

Add --baseline-server http://localhost:30001 (a second server started without --speculative-algorithm) to also report the end-to-end speedup.

SGLang is intentionally NOT a dependency of this script — it talks to the server over HTTP, so only aiohttp is required (already pulled in by the project). The server itself must be running separately; see serve_sglang.

Module Contents

Classes

NameDescription
WorkloadResultAggregate timing for one workload pass against a server.

Functions

NameDescription
_acceptance_rateFraction of the proposed draft chain accepted: (accept_length - 1) / num_steps.
_build_parser-
_chat_completionPOST one chat completion and return its completion_tokens (0 on no usage).
_extract_accept_lengthRead avg_spec_accept_length the way SGLang’s bench_serving does.
_extract_num_stepsRead speculative_num_steps from /server_info if the server reports it.
_fetch_server_infoGET <server>/server_info; return the parsed JSON or None on failure.
_internal_stateReturn internal_states[0] from a /server_info payload, or None.
_load_promptsLoad up to --num-prompts chat prompts (trailing assistant turn dropped).
_normalize_server_urlReturn the SGLang root URL without a trailing slash or /v1 suffix.
_output_throughputOutput tokens per wall-clock second, or None if nothing was timed.
_runAsync driver: load prompts, run the workload(s), report metrics. Returns an exit code.
_run_workloadSend every prompt through <server>/v1/chat/completions and time the pass.
_speedupReturn spec / baseline output throughput, or None if not computable.
_summarizeAssemble the metrics dict reported to stdout / --output-json.
_unwrap_server_infoReturn the dict that holds internal_states.
_validate_argsReject invalid CLI values before any network work starts.
mainCLI entry point. Parses argv and returns the process exit code.

Data

logger

API

class nemo_automodel.components.speculative.bench_sglang.WorkloadResult(
wall_clock_s: float,
output_tokens: int,
completed: int,
failed: int
)
Dataclass

Aggregate timing for one workload pass against a server.

completed
int
failed
int
output_tokens
int
wall_clock_s
float
nemo_automodel.components.speculative.bench_sglang._acceptance_rate(
accept_length: float | None,
num_steps: int | None
) -> float | None

Fraction of the proposed draft chain accepted: (accept_length - 1) / num_steps.

accept_length counts the one guaranteed bonus token from the target, so accept_length - 1 is the mean number of draft tokens accepted per step, and dividing by the proposed depth num_steps gives a [0, 1] rate. This is exact for a linear draft chain (topk=1) and approximate for tree drafting. Returns None when either input is unavailable.

nemo_automodel.components.speculative.bench_sglang._build_parser() -> argparse.ArgumentParser
nemo_automodel.components.speculative.bench_sglang._chat_completion(
session,
url: str,
payload: dict[str, typing.Any],
timeout_s: float,
max_retries: int
) -> int
async

POST one chat completion and return its completion_tokens (0 on no usage).

nemo_automodel.components.speculative.bench_sglang._extract_accept_length(
server_info_json: typing.Any
) -> float | None

Read avg_spec_accept_length the way SGLang’s bench_serving does.

nemo_automodel.components.speculative.bench_sglang._extract_num_steps(
server_info_json: typing.Any
) -> int | None

Read speculative_num_steps from /server_info if the server reports it.

nemo_automodel.components.speculative.bench_sglang._fetch_server_info(
server: str,
timeout_s: float
) -> dict[str, typing.Any] | None
async

GET <server>/server_info; return the parsed JSON or None on failure.

nemo_automodel.components.speculative.bench_sglang._internal_state(
server_info_json: typing.Any
) -> dict[str, typing.Any] | None

Return internal_states[0] from a /server_info payload, or None.

nemo_automodel.components.speculative.bench_sglang._load_prompts(
args: argparse.Namespace
) -> list[list[dict[str, typing.Any]]]

Load up to --num-prompts chat prompts (trailing assistant turn dropped).

nemo_automodel.components.speculative.bench_sglang._normalize_server_url(
url: str
) -> str

Return the SGLang root URL without a trailing slash or /v1 suffix.

Chat completions live at <root>/v1/chat/completions and server info at <root>/server_info; accept either http://host:port or the OpenAI-style http://host:port/v1 so the flag is forgiving.

nemo_automodel.components.speculative.bench_sglang._output_throughput(
result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult
) -> float | None

Output tokens per wall-clock second, or None if nothing was timed.

nemo_automodel.components.speculative.bench_sglang._run(
args: argparse.Namespace
) -> int
async

Async driver: load prompts, run the workload(s), report metrics. Returns an exit code.

nemo_automodel.components.speculative.bench_sglang._run_workload(
server: str,
prompts: list[list[dict[str, typing.Any]]],
gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
concurrency: int,
timeout_s: float,
max_retries: int
) -> nemo_automodel.components.speculative.bench_sglang.WorkloadResult
async

Send every prompt through <server>/v1/chat/completions and time the pass.

nemo_automodel.components.speculative.bench_sglang._speedup(
spec_throughput: float | None,
baseline_throughput: float | None
) -> float | None

Return spec / baseline output throughput, or None if not computable.

nemo_automodel.components.speculative.bench_sglang._summarize(
gen_cfg: nemo_automodel.components.speculative.regenerate.GenerationConfig,
spec_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult,
server_info: dict[str, typing.Any] | None,
num_steps_arg: int | None,
baseline_result: nemo_automodel.components.speculative.bench_sglang.WorkloadResult | None
) -> dict[str, typing.Any]

Assemble the metrics dict reported to stdout / --output-json.

nemo_automodel.components.speculative.bench_sglang._unwrap_server_info(
server_info_json: typing.Any
) -> dict[str, typing.Any] | None

Return the dict that holds internal_states.

PD-disaggregated servers nest the decode engine’s state under a decode list; bench_serving unwraps server_info_json["decode"][0] before reading internal_states. Mirror that so both server topologies work.

nemo_automodel.components.speculative.bench_sglang._validate_args(
args: argparse.Namespace
) -> None

Reject invalid CLI values before any network work starts.

nemo_automodel.components.speculative.bench_sglang.main(
argv: list[str] | None = None
) -> int

CLI entry point. Parses argv and returns the process exit code.

nemo_automodel.components.speculative.bench_sglang.logger = logging.getLogger(__name__)