nemo_automodel.components.speculative.bench_sglang
nemo_automodel.components.speculative.bench_sglang
Offline acceptance / speedup benchmark for a trained EAGLE drafter on SGLang.
Training reports draft loss and top-1 token accuracy, but the metric that actually matters for deployment is the speculative-decoding acceptance length: how many draft tokens the target accepts per verification step. This script drives a workload against a running SGLang server that hosts the drafter and reports:
accept_length— SGLang’savg_spec_accept_length(mean tokens emitted per target verify step, including the one guaranteed bonus token). This is the “tokens accepted” headline number.acceptance_rate— the fraction of the proposed draft chain that is accepted, derived as(accept_length - 1) / speculative_num_steps.output_throughput_tok_s— measured decode throughput (output tokens per wall-clock second).speedup— optional:output_throughputdivided by the same workload’s throughput against a--baseline-serverrunning without speculation.
The acceptance length is read exactly the way SGLang’s own bench_serving
reads it — GET /server_info -> internal_states[0].avg_spec_accept_length
(unwrapping a decode stage for PD-disaggregated servers). Because that value
is a server-cumulative running average, point this benchmark at a freshly
started server dedicated to the run for an accurate number.
Typical usage (after serve_sglang launches the drafter on port 30000):
python -m nemo_automodel.components.speculative.bench_sglang
—server http://localhost:30000
—model meta-llama/Llama-3.1-8B-Instruct
—input-data Aeala/ShareGPT_Vicuna_unfiltered
—num-prompts 64 —concurrency 16 —max-new-tokens 256
Add --baseline-server http://localhost:30001 (a second server started
without --speculative-algorithm) to also report the end-to-end speedup.
SGLang is intentionally NOT a dependency of this script — it talks to the
server over HTTP, so only aiohttp is required (already pulled in by the
project). The server itself must be running separately; see serve_sglang.
Module Contents
Classes
Functions
Data
API
Aggregate timing for one workload pass against a server.
Fraction of the proposed draft chain accepted: (accept_length - 1) / num_steps.
accept_length counts the one guaranteed bonus token from the target, so
accept_length - 1 is the mean number of draft tokens accepted per step,
and dividing by the proposed depth num_steps gives a [0, 1] rate. This is
exact for a linear draft chain (topk=1) and approximate for tree drafting.
Returns None when either input is unavailable.
POST one chat completion and return its completion_tokens (0 on no usage).
Read avg_spec_accept_length the way SGLang’s bench_serving does.
Read speculative_num_steps from /server_info if the server reports it.
GET <server>/server_info; return the parsed JSON or None on failure.
Return internal_states[0] from a /server_info payload, or None.
Load up to --num-prompts chat prompts (trailing assistant turn dropped).
Return the SGLang root URL without a trailing slash or /v1 suffix.
Chat completions live at <root>/v1/chat/completions and server info at
<root>/server_info; accept either http://host:port or the OpenAI-style
http://host:port/v1 so the flag is forgiving.
Output tokens per wall-clock second, or None if nothing was timed.
Async driver: load prompts, run the workload(s), report metrics. Returns an exit code.
Send every prompt through <server>/v1/chat/completions and time the pass.
Return spec / baseline output throughput, or None if not computable.
Assemble the metrics dict reported to stdout / --output-json.
Return the dict that holds internal_states.
PD-disaggregated servers nest the decode engine’s state under a decode
list; bench_serving unwraps server_info_json["decode"][0] before
reading internal_states. Mirror that so both server topologies work.
Reject invalid CLI values before any network work starts.
CLI entry point. Parses argv and returns the process exit code.