Mocker is the live simulated engine in DynoSim. It runs as a Dynamo backend, registers workers, publishes KV events, and exercises the real frontend/router/planner path without requiring GPUs.
The mocker core is implemented in Rust and models the scheduling, memory management, and timing behavior of production engines. It can use polynomial timing, profile-derived timing, or AIC-backed timing. AIC predicts prefill/decode duration; Mocker still owns the scheduler, KV cache lifecycle, prefix-cache behavior, and request execution model.
The mocker simulates:
Note: While the mocker uses vLLM as its primary reference implementation, these core components—block-based KV cache management, continuous batching schedulers, LRU evictors, and prefix caching—are fundamental to all modern LLM inference engines, including SGLang and TensorRT-LLM. The architectural patterns simulated here are engine-agnostic and apply broadly across the inference ecosystem.
Note: For local scale tests and router benchmarks, prefer
--num-workersover launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker.
Mocker also powers DynoSim runs through the dedicated python -m dynamo.replay CLI, which exposes
offline|online, round_robin|kv_router, arrival_speedup_ratio, closed-loop concurrency
admission, synthetic workload generation, and offline disaggregated prefill/decode simulation directly:
The DynoSim CLI defaults to --replay-mode offline and --router-mode round_robin. Aggregated
runs use --extra-engine-args. Offline disaggregated runs instead use
--prefill-engine-args plus --decode-engine-args, together with
--num-prefill-workers and --num-decode-workers.
The same CLI also supports synthetic workloads without a trace file:
Synthetic workloads also support shared-prefix and multi-turn tests:
For trace files, DynoSim also understands multi-turn sessions when records share session_id. The
first turn uses timestamp/created_time; later turns can use delay or delay_ms:
For trace-file runs, --trace-block-size controls how many tokens each hash_id represents in
the dataset, while engine block_size still controls the simulated engine and router hashing. Public
Mooncake/toolagent traces use --trace-block-size 512; engine block_size can still stay at 64
to match the live runtime configuration.
The standalone DynoSim CLI prints an AIPerf-style summary table to stdout and writes the full report JSON to disk.
Timing semantics:
For full usage, constraints, and benchmarking guidance, see DynoSim Runs.
DynoSim runs support aggregated vllm and sglang engine configs. Internally the simulator uses canonical
block_size; for sglang, sglang.page_size is still accepted as a compatibility alias as long
as it matches block_size when both are provided.
Offline DynoSim runs also support disaggregated kv_router mode. In that mode:
--prefill-engine-args must describe a prefill worker--decode-engine-args must describe a decode worker--router-mode must be kv_routerExample:
By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass --planner-profile-data with either:
.npz file, orThe mocker automatically accepts profiler-style results directories and converts them internally.
It also accepts older raw-data directories containing:
prefill_raw_data.jsondecode_raw_data.jsonTo use the AIC SDK for latency prediction:
The AIC model automatically uses --model-path and --engine-type to select the appropriate performance data. Available systems include h200_sxm, h100_sxm, etc. (see AIC SDK documentation for the full list).
Important notes:
--aic-perf-model, python -m dynamo.mocker does not use AIC.python -m dynamo.replay has two separate AIC surfaces:
--extra-engine-args / staged engine JSON--aic-* flags plus router_prefill_load_model="aic" in --router-configdynamo._internal.aic module. Mocker CLI behavior is unchanged; this just removes duplicate AIC session code.aiconfigurator must be able to load the requested performance database for the selected system/backend/version. If the SDK is installed but the backing systems data is missing or unreadable, mocker now fails fast at startup with a clear error instead of failing later on first request.aiconfigurator with real Git LFS payloads materialized in its systems/ directory.This mocker AIC path is separate from the router-side prefill-load estimator. Live router,
frontend, and DynoSim runs all use router_prefill_load_model="aic" plus top-level --aic-* flags for
oldest-prefill prompt-load decay. DynoSim still uses engine-args AIC separately when you want the
mocked worker timing model itself to come from AIC.
For aggregated DynoSim runs, engine timing AIC still comes from --extra-engine-args:
For offline disaggregated DynoSim runs, pass the staged engine configs instead:
The aic_backend field enables the AIC perf model and should match engine_type ("vllm" or "sglang"). The aic_model_path field is the equivalent of --model-path in dynamo.mocker.
DynoSim router-side AIC prompt-load modeling is configured separately with top-level flags:
For MoE models that require AIC MoE parallelism, pass the same fields on the router-side AIC surface.
For Kimi-style TP-only MoE simulation, use --aic-moe-tp-size equal to --aic-tp-size,
--aic-moe-ep-size 1, and --aic-attention-dp-size 1.
For offline disaggregated DynoSim runs, the same top-level --aic-* flags drive the prefill-stage router only;
the decode-stage router keeps prompt tracking disabled.
Example --reasoning configuration:
The profile results directory should contain:
selected_prefill_interpolation/raw_data.npzselected_decode_interpolation/raw_data.npzTo generate profile data for your own model and hardware, run the profiler and then point --planner-profile-data at the resulting output directory.
The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through --durable-kv-events, but it is deprecated and should not be the preferred setup for new tests.
For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path:
--event-plane zmq--zmq-kv-events-ports for per-worker PUB base ports--zmq-replay-ports for optional replay/gap-recovery ROUTER base portsWhen set, each worker binds on its base port plus dp_rank, so the number of comma-separated base ports must match --num-workers.
--bootstrap-ports takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match --num-workers.
Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins.
The mocker can be deployed through example DynamoGraphDeployment manifests for both aggregated and disaggregated setups:
The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines. The scheduler (vLLM-style and SGLang-style variants) and KV block manager live inside the engine core. Multi-engine behavior — KV transfer/offloading simulation, KV router simulation, planner simulation — is added by the DynoSim run harness on top of multiple engine cores; see DynoSim Runs for the component-level diagram and for offline internals under lib/mocker/src/replay/offline/.
The mocker now has two scheduler shapes rather than one generic queue model:
waiting + running scheduler. Each request tracks
computed tokens, the scheduler spends one token budget across the running set first, and decode
pressure triggers inline preemption of running requests.Both schedulers simulate continuous batching, prefix reuse, chunked prefill, memory pressure, and decode token emission while publishing metrics about current resource utilization.
When resources become constrained, the mocker simulates the engine’s real recovery path:
The mocker’s KV block manager is now built on kvbm-logical::BlockManager<G1>, the same logical block manager the real Dynamo runtime uses. The mocker wraps it in lib/mocker/src/kv_manager/kvbm_backend.rs and translates its own MoveBlock protocol onto kvbm-logical’s RAII lifecycle (allocate → stage → register → drop).
Blocks still conceptually live in one of two pools:
MutableBlock<G1>; full blocks are held as ImmutableBlock<G1> clones (the clone vec length is the mocker’s refcount, one per Use).The lifecycle is RAII: dropping the last ImmutableBlock clone transitions the block from active to inactive (kvbm-logical’s reset pool), with no explicit deref/evict bookkeeping on the mocker side. When a sequence completes or is preempted, the mocker simply drops its handles; kvbm-logical recovers the capacity.
Three Use outcomes are tracked for KV-event emission: ActiveHit (bump refcount on an already-pinned block), InactiveHit (reactivate via match_blocks(plh)), and NewStore (fresh allocation). Only NewStore emits a Stored KV event — the router radix tree already knows about the other two and only forgets on explicit Removed.
The kvbm-logical inactive pool selects eviction victims via one of three backends, exposed as MockerEvictionBackend in lib/mocker/src/common/protocols.rs:
Lineage (default) — parent-chain aware: evicts leaf blocks first, preserving shared prefix chains. Subsumes the preemption-priority behavior the old hand-rolled LRUEvictor::push_front used to provide.Lru — plain recency-based LRU.MultiLru — 4-tier frequency-aware LRU built on a TinyLFU tracker.All three give the same “suffix blocks evicted before shared prefixes” outcome that the old evictor was designed to produce; Lineage does it structurally (via the block parent chain) rather than via monotonic counters.
Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (MutableBlock<G1>, still being filled) versus full (ImmutableBlock<G1>, complete and hashable for prefix caching). When a partial block fills up, it gets “promoted” to a full block with a content-based SequenceHash (or collapses onto an existing registered handle if the PLH is already present), enabling future cache hits from requests with matching prefixes.
The mocker supports three timing prediction modes:
Polynomial Model (Default): Uses hardcoded polynomial formulas that approximate typical GPU behavior. Prefill time scales quadratically with token count, while decode time depends on the total active KV cache size.
Interpolated Model: Loads actual profiling data from an NPZ file containing measured prefill and decode latencies. The mocker interpolates between data points to predict timing for any input size. This enables high-fidelity simulation matching a specific hardware configuration.
AIC Model (--aic-perf-model): Uses the NVIDIA AI Configurator (AIC) SDK for latency prediction. AIC provides calibrated performance models for specific GPU/model/engine combinations, predicting prefill and decode latency as a function of batch size, sequence length, and prefix cache hits. The model path is automatically derived from --model-path, and the engine type from --engine-type. This mode is opt-in and requires both the aiconfigurator SDK and loadable systems/perf data for the requested tuple.
For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker’s bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready.
The mocker simulates KV cache transfer time between prefill and decode workers. Before the prefill worker emits its first (and only) token, it sleeps for a duration based on:
num_layers * 2 * num_kv_heads * head_dim * dtype_bytes. The dtype_bytes is determined by --kv-cache-dtype: when set to auto (default), it uses the model’s dtype from config; when explicitly set (e.g., fp8), it uses the specified dtype instead. It can also be overridden directly with --kv-bytes-per-token.num_input_tokens * kv_bytes_per_token / bandwidthThis delay is injected after the scheduler’s prefill compute simulation completes, modeling the sequential flow: prefill computation → KV transfer → decode begins. Set --kv-transfer-bandwidth 0 to disable.
When prefix caching is enabled, the mocker publishes KV cache events to the distributed runtime. These events notify the system when blocks are stored (new content cached) or removed (evicted). This enables the KV-aware router to make intelligent routing decisions based on which workers have which prefixes cached.
Each scheduler publishes metrics about its current state, including the number of active decode blocks per DP rank. The router uses these metrics for load-aware routing decisions.
The mocker is particularly useful for:
For the broader mocker enhancement roadmap, see #6383.
The following features are not yet supported by the mocker: