Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it’s low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable.
LLM inference breaks these assumptions:
The Dynamo Planner is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics.
The planner offers three optimization_target settings that control how scaling decisions are made:
We recommend starting with the default throughput target — it works out of the box with zero configuration. Switch to latency if your workload is latency-sensitive, or to sla when you need precise SLA targeting with pre-deployment profiling.
New to the Planner? Start with the Planner Guide for a complete workflow including profiling and deployment.
Need multi-DGD coordination? See the Global Planner Guide for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments.
The Planner supports two scaling modes that can run independently or together:
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
--adjustment-interval for throughput-based scaling.The planner works out of the box with no configuration needed. By default, optimization_target is set to throughput, which uses static thresholds on queue depth and KV cache utilization — no SLAs or profiling required:
For latency-sensitive workloads:
For precise SLA targeting with pre-deployment profiling, set optimization_target: sla:
The fastest path to SLA-based scaling is through a DynamoGraphDeploymentRequest, which automatically profiles your model:
See Planner Guide for the full workflow.
Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.
Requires ForwardPassMetrics (FPM). Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). FPM is currently only available for vllm and is automatically enabled when the engine uses InstrumentedScheduler and DYN_FORWARDPASS_METRIC_PORT is set. The KV Router is not required for load-based scaling.
In-flight requests during scale-down. When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. Workaround: Set --min-endpoint to a value that avoids scaling below your steady-state traffic floor, and use a lower --loadbased-scaling-down-sensitivity value to reduce the frequency of scale-down events.
Deploy the planner dashboard:
The dashboard shows:
When PLANNER_PROMETHEUS_PORT is set, the planner serves its own metrics endpoint. Exported series use the dynamo_planner_* naming convention (underscores and standard unit suffixes), replacing older planner:*-style names.
Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:
Load-based scaling uses ForwardPassMetrics (FPM) from the Dynamo event plane:
FpmEventSubscriber with automatic engine discovery and lifecycle tracking/metrics scraping requiredCore gauges on the planner port include replica counts (dynamo_planner_num_prefill_replicas, dynamo_planner_num_decode_replicas), observed traffic (dynamo_planner_observed_*), replica decisions (dynamo_planner_predicted_num_prefill_replicas, dynamo_planner_predicted_num_decode_replicas), and cumulative dynamo_planner_gpu_hours.
Throughput prediction gauges dynamo_planner_predicted_requests_per_second, dynamo_planner_predicted_input_sequence_tokens, and dynamo_planner_predicted_output_sequence_tokens are wired from throughput-scaling traffic prediction and exposed alongside observed sequence-length metrics.
Additional series support dashboards and offline analysis:
dynamo_planner_estimated_ttft_ms and dynamo_planner_estimated_itl_ms reflect the maximum estimated TTFT and ITL from the online regression across engines.dynamo_planner_engine_prefill_requests_per_second and dynamo_planner_engine_decode_requests_per_second report single-engine prefill and decode capacity under the configured SLA.dynamo_planner_load_scaling_decision and dynamo_planner_throughput_scaling_decision are Enum gauges whose state labels encode why each mode chose to scale, hold, or skip (for example scale_up, no_fpm_data, set_lower_bound).dynamo_planner_engine_queued_prefill_tokens, dynamo_planner_engine_queued_decode_kv_tokens, and dynamo_planner_engine_inflight_decode_kv_tokens are labeled with worker_id and dp_rank for each engine.The planner can emit periodic, self-contained HTML diagnostics files with interactive Plotly charts.
Configure this in PlannerConfig (or the equivalent YAML / constructor wiring your deployment uses):
report_interval_hours: interval in simulated time between reports (default 24.0 hours); set to None to disable.report_output_dir: directory where HTML files are written (default ./planner_reports).live_dashboard_port: port for a real-time HTTP dashboard (default 8080). Set to 0 to disable. An aiohttp server starts on the given port and serves the current accumulated snapshot data as an interactive Plotly report at http://<host>:<port>/. Unlike periodic reports, the live dashboard does not clear snapshots — it always shows all data accumulated since the last periodic report (or since startup if periodic reports are disabled).Reports aggregate per-tick snapshots and use TickInput.now_s for timestamps, so they behave the same in live runs (wall clock) and in replay with a simulated clock. Typical charts cover worker counts, observed versus estimated latencies versus SLA targets, request rate, engine capacity, scaling decision timelines, and input/output sequence lengths.