Tier 3 design documentation for contributors and architects. For user-facing docs, see docs/components/planner/.
The Planner is Dynamo’s autoscaling controller. It supports two scaling modes: throughput-based (using profiling data and traffic prediction) and load-based (using real-time engine metrics and online regression). This document covers the internal architecture, algorithms, and design trade-offs for both modes.
Every adjustment_interval seconds, the planner queries Prometheus for:
The Prometheus query targets the Frontend’s /metrics endpoint, which exposes histograms and counters.
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
These factors account for hard to model factors such as:
The correction factors are applied as multipliers to the next scaling decision. Setting --no-correction disables this for debugging or when cold-start artifacts dominate.
The planner forecasts three values for the next interval:
next_num_req: Number of requestsnext_isl: Average input sequence lengthnext_osl: Average output sequence lengthFour predictor implementations are available:
All predictors support warm-starting from trace files (--load-predictor-warmup-trace).
Prefill replicas:
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
Decode replicas:
The planner calls connector.set_component_replicas() with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
Design decisions:
DYN_PARENT_DGD_K8S_NAME to find its parent DGD (injected by operator)subComponentType field (prefill/decode), with fallback to legacy component namesFor non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via VirtualConnectorCoordinator (Rust binding). External systems use VirtualConnectorClient to poll decisions and report completion.
Scaling decision flow:
(num_prefill, num_decode, decision_id) to runtimeclient.wait()client.complete(decision)scaled_decision_id >= decision_id and proceedsTimeout: If scaling isn’t acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
Two interpolators are maintained:
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
The planner currently waits 30 seconds (INIT_PLANNER_START_DELAY in components/src/dynamo/planner/__main__.py) as a temporary workaround while other components (frontend, workers) register and stabilize; see Known Limitations for the planned readiness-probing replacement.
After the delay:
--environment)adjustment_interval is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.--no-correction flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.prefillInterpolationGranularity and decodeInterpolationGranularity in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.--kalman-min-points observations. During warm-up, the planner uses the constant predictor as fallback.The load-based mode uses real-time per-worker metrics from the router to make SLA-aware scaling decisions without requiring profiling data.
The planner pulls per-worker load metrics directly from the frontend’s /metrics endpoint:
A sliding-window linear regression maps load to latency:
(active_prefill_tokens + ISL) -> TTFTactive_decode_blocks -> ITLGiven a TTFT/ITL SLA target, the model reverse-solves for the maximum load that satisfies the SLA.
(num_workers - 1) / num_workers * sensitivity / 100When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor.
In aggregated mode (--mode agg), engines handle both prefill and decode via chunked prefill. The planner maintains both TTFT and ITL regression models but uses per-worker time-averaged metrics (not instantaneous) for regression training to smooth out chunked prefill noise. Scale up if either prefill or decode signals overload; scale down only if both signal underload.
adjustment_interval < time to scale, scaling decisions can pile up. The planner logs warnings but doesn’t queue.