Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it’s low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable.
LLM inference breaks these assumptions:
The Dynamo Planner is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics.
New to the Planner? Start with the Planner Guide for a complete workflow including profiling and deployment.
Need multi-DGD coordination? See the Global Planner Guide for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments.
The Planner supports two scaling modes that can run independently or together:
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
--adjustment-interval for throughput-based scaling.For throughput-based scaling, pre-deployment profiling is also required (Profiling Guide).
The fastest path to a throughput-based planner deployment is through a DynamoGraphDeploymentRequest, which automatically profiles your model:
See Planner Guide for the full workflow.
To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:
The planner will auto-discover the frontend metrics endpoint from the DGD. See disagg_planner_load.yaml for a complete example.
For manual control with throughput-based scaling, use the disaggregated planner templates:
Load-based scaling is experimental and has the following known limitations. These are actively being addressed as part of the metrics refactor work. Throughput-based scaling is not affected by any of these.
Requires the KV Router. Load-based scaling relies on per-worker engine metrics (active prefill tokens, active KV blocks) published by the KV Router. Other routing strategies (round-robin, random) do not emit these metrics, so load-based scaling cannot operate without the KV Router.
Scale-down with idle workers. If a worker receives no requests (for example, because the router is not distributing traffic evenly), the router does not publish metrics for that worker. Without metrics, the Planner cannot evaluate whether the worker is underutilized, which can prevent scale-down decisions. Workaround: Ensure traffic distribution reaches all workers. If you observe workers stuck at zero load, check your router configuration.
In-flight requests during scale-down. When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. Workaround: Set --min-endpoint to a value that avoids scaling below your steady-state traffic floor, and use a lower --loadbased-scaling-down-sensitivity value to reduce the frequency of scale-down events.
Deploy the planner dashboard:
The dashboard shows:
Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:
Load-based scaling pulls per-engine status directly from the frontend’s /metrics endpoint: