Planner Guide
The Dynamo Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
For a quick overview, see the Planner overview. For architecture internals, see Planner Design.
Scaling Modes
The planner supports three optimization targets that determine how scaling decisions are made:
throughput(default): Uses static thresholds on queue depth and KV cache utilization. No SLA targets or profiling needed. Works out of the box.latency: Same approach asthroughputbut with more aggressive thresholds — scales up earlier and tolerates less queuing. Ideal for latency-sensitive workloads.sla: Uses regression-based performance models with specific TTFT/ITL targets. Supports both throughput-based (predictive) and load-based (reactive) scaling modes. For advanced users who need precise SLA control.
When to use which:
- Start with
throughput(the default) — it works immediately with no configuration. - Switch to
latencyif your workload has strict latency requirements and you prefer to over-provision rather than queue. - Use
slawhen you have pre-deployment profiling data and want to target specific TTFT/ITL values.
PlannerConfig Reference
The planner is configured via a PlannerConfig JSON/YAML object. When using the profiler, this is placed under the features.planner section of the DGDR spec:
For SLA-based scaling:
Optimization Target
When optimization_target is throughput or latency, load-based scaling is automatically enabled and throughput-based scaling is disabled. The ttft/itl fields are ignored.
Scaling Mode Fields (SLA mode)
At least one scaling mode must be enabled when using optimization_target: sla.
Pre-Deployment Sweeping
When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the get_perf_metrics Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at profile_results_dir. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model.
Throughput-Based Scaling Settings
Load-Based Scaling Settings
General Settings
Traffic Prediction Settings
Kalman Filter Settings
Diagnostics Reports
The same diagnostic signals surfaced in these reports are also exported as Prometheus metrics under the dynamo_planner_* prefix—for example estimated TTFT/ITL (dynamo_planner_estimated_ttft_ms, dynamo_planner_estimated_itl_ms), per-engine capacity and FPM queue depths, and load/throughput scaling decision enums.
Integration with Profiler
When the profiler runs with planner enabled, it:
- Selects the best prefill and decode engine configurations
- Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization)
- Saves the
PlannerConfigand performance data into separate Kubernetes ConfigMaps - Adds the planner service to the generated DGD, configured to read from those ConfigMaps
The planner receives its config via --config /path/to/planner_config.json which is mounted from the planner-config-XXXX ConfigMap. Profiling data is mounted from the planner-profile-data-XXXX ConfigMap.
See the Profiler Guide for the full profiling workflow and how to configure pre-deployment sweeping.
Hierarchical Deployments
If you want one public endpoint for a model but multiple private DGDs optimized for different request classes, use a hierarchical deployment:
- one control DGD with
Frontend,GlobalRouter, andGlobalPlanner - one or more prefill pool DGDs
- one or more decode pool DGDs
In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See the Global Planner Guide.
See Also
- Planner overview — Why LLM inference needs a different autoscaler
- Planner Design — Architecture and algorithm internals
- Planner Examples — DGDR YAML examples, sample configurations, advanced patterns
- Global Planner Guide — Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments
- Profiler Guide — How profiling data is generated