> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Planner DynoSim Benchmarking

This guide shows how to benchmark the Dynamo Planner against a recorded trace by running it inside DynoSim. Use it to compare `agg` vs `disagg` topologies, tune SLA targets, and study how deployment realities (engine startup time, worker counts) affect planner behavior — all without bringing up a live cluster.

For the general mechanics of DynoSim runs (input format, arrival speedup, router modes, synthetic workloads), see [DynoSim Runs](/dynamo/user-guides/dynosim/runs). This guide focuses on the `--planner-config` path.

## 1. Setup

### Build

Install the Dynamo Python components and build the Rust runtime bindings. The
`python -m dynamo.replay` CLI is part of the Python components, and it imports the
runtime `_core` module from the bindings package:

```bash
.venv/bin/maturin develop --release -m lib/bindings/python/Cargo.toml
uv pip install -e .
```

The `--release` flag is strongly recommended. DynoSim execution is largely single-threaded and CPU-bound on the mocker engine core; a debug build can be 5–10× slower, which compounds across sweep runs.

### Key Planner Config Knobs

Passed as JSON via `--planner-config`. Uses the same schema as the live planner. The fields most relevant to benchmarking:

| Field | Purpose |
|---|---|
| `mode` | `"agg"` or `"disagg"` — picks scaling strategy and required engine args. |
| `optimization_target` | `"sla"` uses TTFT/ITL targets; `"throughput"` uses static queue/KV thresholds. |
| `ttft_ms` / `itl_ms` | SLA targets in ms. Drives load-scaling decisions. |
| `enable_throughput_scaling` | Periodic scaling based on predicted steady-state load. |
| `enable_load_scaling` | Reactive scaling to short-term traffic spikes. |
| `throughput_adjustment_interval_seconds` | Seconds between throughput-scaling decisions. |
| `load_adjustment_interval_seconds` | Seconds between load-scaling decisions. Short intervals mean faster reaction but more flapping. |
| `pre_deployment_sweeping_mode` | `"rapid"` uses AIC for optional bootstrap data and native perf-model identity; `"none"` lets planner warm from native AIC or live FPMs. |
| `prefill_engine_num_gpu` / `decode_engine_num_gpu` | GPUs per engine replica. **Must be set explicitly** — both default to `None`, and the simulation adapter silently treats `None` as `0`, which collapses the cumulative-GPU-hours metric in the report to zero. |
| `speculative_nextn` | Manual speculative-decode depth fallback. Replay normally supplies this from engine args, but the planner config fallback is useful when engine metadata is absent. |
| `report_filename` | Output HTML filename under `./planner_reports/`. |

### Key Engine Arg Knobs

Passed as JSON via `--extra-engine-args` (agg) or `--prefill-engine-args` / `--decode-engine-args` (disagg). DynoSim uses the mocker engine, so "engine args" means the analytical perf model inputs:

| Field | Purpose |
|---|---|
| `aic_backend` | Backend the analytical model should emulate, e.g. `"vllm"`, `"trtllm"`, `"sglang"`. |
| `aic_system` | GPU SKU for the perf model, e.g. `"h200_sxm"`, `"h100_sxm"`, `"b200"`. |
| `aic_model_path` | HF model identifier used by the perf model. |
| `aic_tp_size` | Tensor-parallel size of each engine replica. |
| `aic_nextn` | Speculative-decode depth for AIC-backed MTP runs. In disagg, put this on the decode engine args. |
| `aic_nextn_accept_rates` | Conditional draft-token accept rates used by the mocker burst sampler, for example `"1,1"` with `aic_nextn=2` accepts up to three visible tokens per decode forward. |
| `startup_time` | Simulated seconds between a planner scale-up decision and the new worker becoming active. Unset or `0` means workers activate instantly. |

Other fields follow the standard mocker engine protocol (see [DynoSim Runs](/dynamo/user-guides/dynosim/runs)).

### Planner Traffic Metrics From Replay

Planner-in-the-loop replay feeds the same traffic shape that the live planner consumes, but the observation source is the offline simulator instead of Prometheus. On planner traffic ticks, replay drains:

| Replay metric | Planner meaning |
|---|---|
| `num_req` | Completed requests in the observation window. |
| `avg_isl` / `avg_osl` | Raw request input and output lengths. `avg_osl` is not divided by speculative decode accept length. |
| `avg_kv_hit_rate` | Mean router prefix-cache hit rate at admission time, computed as the arithmetic mean of per-request `overlap_blocks / isl_blocks`. |
| `avg_accept_length` | Mean visible output tokens per decode request-forward, including the base token. Agg records the decode portion of the mixed engine pass; disagg records decode workers only. |

The planner treats KV hit rate and accept length as runtime metadata with last-value semantics. Missing accept-length samples leave the last valid value unchanged; without MTP metadata or without a prior valid sample, the effective accept length is `1.0`. When MTP is enabled, the planner clamps accept length to `[1.0, nextn + 1.0]`, where `nextn` comes from replay engine capabilities (`aic_nextn`) or the `speculative_nextn` planner fallback.

MTP changes decode capacity by discounting ITL, not by rewriting raw OSL. Accept length can be slightly below `nextn + 1` when the final decode burst is partial; for example, `nextn=2` with all draft tokens accepted over `OSL=64` gives `64 / 22 = 2.91`. With decode batch size `16` and raw per-forward wall time `40 ms`:

| Case | Accept length | Effective ITL | Engine RPS |
|---|---:|---:|---:|
| No MTP | `1.0` | `40.0 ms` | `16 / (64 * 0.040) = 6.25` |
| MTP | `2.91` | `40 / 2.91 = 13.7 ms` | `16 / (64 * 0.0137) = 18.2` |

The raw `64` output tokens still feed KV residency, context-length estimates, and request length. In agg capacity, raw OSL remains the request output length, but accept length also tightens the prefill/decode balance because faster decode egress requires more prefill admissions per forward. Decode ITL SLA checks and decode RPS use the discounted ITL.

For AIC-backed MTP replay, set `aic_nextn` on the agg engine args or on the disagg decode engine args, and set `aic_nextn_accept_rates` to control the mocker burst sampler. The planner bootstrap path asks AIC for raw forward iteration time with zero accept rates internally, so the regression is trained on undiscounted wall time and the planner applies the observed replay accept length exactly once.

## 2. Example: Agg vs Disagg On The Mooncake Agentic Trace

Download the trace:

```bash
mkdir -p traces/mooncake_fast25 && cd traces/mooncake_fast25
curl -sLO https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/toolagent_trace.jsonl
```

Run agg (2 workers, TP=1):

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace.jsonl \
  --planner-config '{
    "mode": "agg",
    "optimization_target": "sla",
    "ttft_ms": 1500, "itl_ms": 50,
    "enable_throughput_scaling": true, "enable_load_scaling": true,
    "pre_deployment_sweeping_mode": "rapid",
    "throughput_adjustment_interval_seconds": 300, "load_adjustment_interval_seconds": 10,
    "prefill_engine_num_gpu": 1, "decode_engine_num_gpu": 1,
    "report_filename": "dynosim_agg.html"
  }' \
  --extra-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \
  --num-workers 2 --arrival-speedup-ratio 1.0
```

Run disagg (1P1D, TP=1):

```bash
.venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace.jsonl \
  --planner-config '{
    "mode": "disagg",
    "optimization_target": "sla",
    "ttft_ms": 1500, "itl_ms": 50,
    "enable_throughput_scaling": true, "enable_load_scaling": true,
    "pre_deployment_sweeping_mode": "rapid",
    "throughput_adjustment_interval_seconds": 300, "load_adjustment_interval_seconds": 10,
    "prefill_engine_num_gpu": 1, "decode_engine_num_gpu": 1,
    "report_filename": "dynosim_disagg.html"
  }' \
  --prefill-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \
  --decode-engine-args  '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \
  --num-prefill-workers 1 --num-decode-workers 1 --arrival-speedup-ratio 1.0
```

Each run prints the AIPerf summary table to stdout and writes an HTML diagnostics report to `./planner_reports/<report_filename>`. For this trace with a long ISL and short OSL, agg is better than disagg, which gets slightly better ITL at the cost noticeably more GPU-hours.

## 3. Example: Cold-Start-Time Sweep

How sensitive is SLA attainment to engine startup time? Sweep `startup_time` from 0 to 300 seconds in 10-second steps and record TTFT/ITL/GPU-hours per run.

```bash
#!/usr/bin/env bash
set -euo pipefail

TRACE=traces/mooncake_fast25/toolagent_trace.jsonl
OUT=planner_reports/sweep_startup
mkdir -p "$OUT"

run_one() {
  local s=$1
  local name=$(printf "dynosim_agg_startup_%03d.html" "$s")
  local extra
  if [[ "$s" -eq 0 ]]; then
    extra='{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1}'
  else
    extra=$(printf '{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"startup_time":%d}' "$s")
  fi
  .venv/bin/python -m dynamo.replay "$TRACE" \
    --planner-config "$(printf '{"mode":"agg","optimization_target":"sla","ttft_ms":1500,"itl_ms":50,"enable_throughput_scaling":true,"enable_load_scaling":true,"pre_deployment_sweeping_mode":"rapid","throughput_adjustment_interval_seconds":300,"load_adjustment_interval_seconds":10,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"report_filename":"%s"}' "$name")" \
    --extra-engine-args "$extra" \
    --num-workers 2 --arrival-speedup-ratio 1.0 \
    --report-json "$OUT/startup_$(printf '%03d' "$s").json" \
    >"$OUT/startup_$(printf '%03d' "$s").log" 2>&1
}

export -f run_one
# Run 12 sweeps in parallel; adjust -P for your machine.
seq 0 10 300 | xargs -n1 -P12 -I{} bash -c 'run_one "$@"' _ {}
```

Each run emits the AIPerf metrics table (parse TTFT / ITL avg / p90) and its HTML report (grep `GPU hours: <float>`). Plotting those against `startup_time` gives:

![Planner DynoSim — startup time sweep](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/cebfa5849a4bd80ac622ee88350002820851b3d4a6c4ecd2d48738936584ddf4/pages-v1.2.1/assets/img/planner-replay-startup-sweep.png)

Observations from this sweep (agg, TTFT SLA 1,500 ms, ITL SLA 50 ms, H200-SXM, Llama-3.1-8B-FP8, TP=1):

- **SLA cliff near 100–120 s.** Below that, the planner scales up fast enough to hold TTFT; above it, p99 TTFT diverges and the system stays perpetually backlogged.
- **Scaling-event count drops monotonically** (42 → 8) as startup grows — long-startup runs require load planner to wait for stabilization before the next scaling decision.
- **ITL is less sensitive than TTFT** until the queue saturates. Below the cliff, ITL rises modestly (25 → 30 ms avg); above it, p90 ITL jumps to ~200 ms as decode requests starve.