Planner | NVIDIA Dynamo Documentation

Why LLM Inference Needs a Different Autoscaler

Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it’s low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable.

LLM inference breaks these assumptions:

Latency depends on request content, not just request count. A single request with a 32K-token prompt consumes orders of magnitude more compute than a short one. Two requests per second can mean completely different GPU loads depending on input/output sequence lengths.
Prefill and decode have different scaling characteristics. In disaggregated serving, prefill is compute-bound (scales with input length) while decode is memory-bound (scales with concurrent sequences and KV cache usage). A single replica count doesn’t capture both.
The metrics that matter aren’t standard. The SLAs users care about — Time to First Token (TTFT) and Inter-Token Latency (ITL) — don’t map cleanly to CPU utilization or request throughput. HPA can’t target “keep P95 TTFT under 500ms” because that requires understanding the relationship between sequence lengths, GPU memory pressure, and latency.
Scaling decisions are expensive. Spinning up a GPU worker takes minutes, not seconds. Overscaling wastes GPU-hours at cloud prices; underscaling violates SLAs. The autoscaler needs to predict demand, not just react to it.

The Dynamo Planner is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics.

Getting Started: Optimization Targets

The planner offers three optimization_target settings that control how scaling decisions are made:

Target	Description	Requires SLA?	Requires Profiling?
`throughput` (default)	Maximizes throughput by scaling based on queue depth and KV cache utilization. Scales up when engines are saturated, scales down when utilization drops.	No	No
`latency`	Minimizes latency by scaling aggressively to keep queues short. Scales up at lower utilization thresholds.	No	No
`sla`	Targets specific TTFT/ITL SLA values using the Rust engine perf shim: native AIC estimates when available, online FPM tuning, and FPM regression fallback.	Yes (`ttft_ms`, `itl_ms`)	Recommended

We recommend starting with the default throughput target — it works out of the box with zero configuration. Switch to latency if your workload is latency-sensitive, or to sla when you need precise SLA targeting with native AIC or FPM-based performance modeling.

New to the Planner? Start with the Planner Guide for a complete workflow including profiling and deployment.

Need multi-DGD coordination? See the Global Planner Guide for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments.

Scaling Modes

The Planner supports two scaling modes that can run independently or together:

Throughput-based scaling: Uses the engine perf shim and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. The shim can use native AIC estimates, self-benchmark/profiler FPM bootstrap data, and live FPM tuning. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
Load-based scaling: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and queries the same perf shim for short-term TTFT/ITL estimates. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.

When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.

Feature Matrix

Feature	Throughput-Based	Load-Based
Deployment
Disaggregated	Supported	Supported
Aggregated	Supported	Supported
LLM Framework
SGLang	Supported	Supported
TensorRT-LLM	Supported	Supported
vLLM	Supported	Supported
Requires Pre-deployment Data	No; recommended for faster warmup when native AIC is unavailable	No
Load Predictors	ARIMA, Prophet, Kalman, Constant	N/A
Router
Any (round-robin, random, etc.)	Supported	Not supported
KV Router	Supported	Supported
Connectors
KubernetesConnector	Supported	Supported
VirtualConnector	Supported	Supported

When to Use Which Mode

Throughput-based scaling should be enabled for SLA mode when you want stable, prediction-based capacity planning. Native AIC or bootstrap FPMs make it ready sooner; otherwise it warms from live FPMs.
Load-based scaling should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
Both modes together: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer throughput_adjustment_interval_seconds than load_adjustment_interval_seconds.

Quick Start

Prerequisites

Dynamo platform installed on Kubernetes (Installation Guide)
kube-prometheus-stack installed (Metrics Setup)

Default Mode (zero config)

The planner works out of the box with no configuration needed. By default, optimization_target is set to throughput, which uses static thresholds on queue depth and KV cache utilization — no SLAs or profiling required:

1 # Minimal planner config — uses throughput optimization by default
2 features:
3   planner:
4     mode: disagg
5     backend: vllm

For latency-sensitive workloads:

1 features:
2   planner:
3     mode: disagg
4     backend: vllm
5     optimization_target: latency

SLA-Based Scaling (advanced)

For precise SLA targeting with native AIC estimates, optional bootstrap profiling data, or live FPM warmup, set optimization_target: sla:

1 features:
2   planner:
3     optimization_target: sla
4     enable_throughput_scaling: true
5     enable_load_scaling: true
6     ttft_ms: 500.0
7     itl_ms: 50.0
8     pre_deployment_sweeping_mode: rapid

The fastest path to SLA-based scaling is through a DynamoGraphDeploymentRequest, which automatically profiles your model. See Planner Examples for copyable DGDR manifests.

See Planner Guide for the full workflow.

Current Limitations

Load-based scaling

Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.

Requires ForwardPassMetrics (FPM). Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). The KV Router is not required for load-based scaling. FPM availability by backend:

vLLM — supported. Automatically enabled when the engine uses InstrumentedScheduler and DYN_FORWARDPASS_METRIC_PORT is set.
TensorRT-LLM — supported for non-attention-DP workers (attention_dp_size == 1); gated off when attention_dp_size > 1 pending per-rank FPM emission.
SGLang — pipeline wired in Dynamo, but the upstream SGLang FPM module is not included in the current 1.2.0 SGLang runtime image. See the SGLang FPM section for the runtime-image prerequisite.

General

In-flight requests during scale-down. When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. Workaround: Set min_endpoint to a value that avoids scaling below your steady-state traffic floor, and use a lower load_scaling_down_sensitivity value to reduce the frequency of scale-down events.

Documentation

Document	Description
Planner Guide	Deployment, configuration, integration
Planner Design	Architecture and algorithm internals
Planner Examples	DGDR YAML examples, sample configurations, advanced patterns
Global Planner Guide	Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments

Configuration Reference

Key PlannerConfig Fields

The planner process is launched with --config /path/to/planner_config.json. DGDR planner features and generated ConfigMaps are materialized into these PlannerConfig fields.

Field	Default	Description
Common
`namespace`	`$DYN_NAMESPACE` or `dynamo`	Dynamo logical namespace
`backend`	`vllm`	Backend framework (`sglang`, `trtllm`, `vllm`)
`mode`	`disagg`	Planner mode (`disagg`, `prefill`, `decode`, `agg`)
`optimization_target`	`throughput`	Scaling target: `throughput` (queue/util thresholds), `latency` (aggressive low-latency), `sla` (Rust engine perf model SLA targeting)
`environment`	`kubernetes`	Deployment environment
`ttft_ms`	`500.0`	Target Time To First Token (ms)
`itl_ms`	`50.0`	Target Inter-Token Latency (ms)
`max_gpu_budget`	`8`	Maximum GPUs across all workers
`min_endpoint`	`1`	Minimum replicas per worker type
`decode_engine_num_gpu`	`1`	GPUs per decode engine
`prefill_engine_num_gpu`	`1`	GPUs per prefill engine
`advisory`	`false`	Suggestion-only mode. The Planner computes and reports recommended replica counts, but does not execute scaling actions or change the deployment.
Throughput-based scaling
`enable_throughput_scaling`	`true`	Enable throughput-based scaling
`throughput_adjustment_interval_seconds`	`180`	Seconds between throughput-based scaling decisions
`profile_results_dir`	`profiling_results`	Path to profiling data (NPZ/JSON)
`load_predictor`	`arima`	Prediction model (`arima`, `prophet`, `kalman`, `constant`)
Load-based scaling
`enable_load_scaling`	`false`	Enable load-based scaling
`load_adjustment_interval_seconds`	`5`	Seconds between FPM tuning updates and load-based scaling decisions
`max_num_fpm_samples`	`64`	Maximum retained FPM observations for online tuning or regression
`fpm_sample_bucket_size`	`16`	Number of buckets for observation retirement (must be perfect square)
`load_scaling_down_sensitivity`	`80`	Scale-down sensitivity 0-100 (0=never, 100=aggressive)
`load_min_observations`	`5`	Minimum observations before regression activates
`prefill_scale_up_queue_tokens` / `prefill_scale_down_queue_tokens`	`null`	Queue token thresholds for `optimization_target: load` prefill scaling.
`decode_scale_up_kv_rate` / `decode_scale_down_kv_rate`	`null`	Decode KV utilization thresholds for `optimization_target: load` decode scaling.
Plugin pipeline
`scheduling.scale_interval_seconds`	gcd of enabled builtin intervals	Base pipeline cadence. Plugins fire according to their own execution intervals.
`scheduling.tick_max_duration_seconds`	`30.0`	Deadline for one full plugin pipeline tick.
`plugin_registration.transport.request_timeout_seconds`	`5.0`	Per-plugin RPC timeout.

Environment Variables

Variable	Default	Description
`DYN_NAMESPACE`	`dynamo`	Dynamo logical namespace
`DYN_PARENT_DGD_K8S_NAME`	(required)	Parent DGD K8s resource name
`PROMETHEUS_ENDPOINT`	`http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`	Prometheus URL
`PLANNER_PROMETHEUS_PORT`	`0` (disabled)	Port for planner’s own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

$ kubectl apply -n monitoring -f deploy/observability/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

Worker counts and GPU usage over time
Observed TTFT, ITL, request rate, sequence lengths
Predicted load and recommended replica counts
Engine perf model status

Prometheus Metrics

When PLANNER_PROMETHEUS_PORT is set, the planner serves its own metrics endpoint. Exported series use the dynamo_planner_* naming convention (underscores and standard unit suffixes), replacing older planner:*-style names.

Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:

Request count and duration
TTFT and ITL distributions
Input/output sequence lengths

Planner can read these traffic signals from either the public Frontend or a pool-local LocalRouter. Use throughput_metrics_source: "frontend" for a single-DGD deployment. Use throughput_metrics_source: "router" for GlobalPlanner / multi-pool deployments so each pool Planner reads its own router traffic instead of the shared public endpoint.

Planner input	Frontend source	Router source
Request count	`dynamo_frontend_requests_total`	`dynamo_component_router_requests_total`
TTFT	`dynamo_frontend_time_to_first_token_seconds`	`dynamo_component_router_time_to_first_token_seconds`
ITL	`dynamo_frontend_inter_token_latency_seconds`	`dynamo_component_router_inter_token_latency_seconds`
Request duration	`dynamo_frontend_request_duration_seconds`	`dynamo_component_request_duration_seconds` until router-specific duration metrics are available
Input sequence length / ISL	`dynamo_frontend_input_sequence_tokens`	`dynamo_component_router_input_sequence_tokens`
Output sequence length / OSL	`dynamo_frontend_output_sequence_tokens`	`dynamo_component_router_output_sequence_tokens`
KV hit rate	Not available from frontend source	`dynamo_component_router_kv_hit_rate`

The throughput planner uses request count, ISL, OSL, and optional KV hit rate as the core traffic forecast inputs. TTFT, ITL, and request duration are also scraped and exported as observed diagnostics.

Load-based scaling uses ForwardPassMetrics (FPM) from the Dynamo event plane:

Per-iteration wall time, scheduled prefill/decode tokens, and queued request status
Delivered via FpmEventSubscriber with automatic engine discovery and lifecycle tracking
No router /metrics scraping required

FPM observes engine-side scheduled and queued work. It does not include requests still queued in the LocalRouter before engine assignment.

Core gauges on the planner port include replica counts (dynamo_planner_num_prefill_replicas, dynamo_planner_num_decode_replicas), observed traffic (dynamo_planner_observed_*), replica recommendations (dynamo_planner_predicted_num_prefill_replicas, dynamo_planner_predicted_num_decode_replicas), and cumulative dynamo_planner_gpu_hours.

Throughput prediction gauges dynamo_planner_predicted_requests_per_second, dynamo_planner_predicted_input_sequence_tokens, and dynamo_planner_predicted_output_sequence_tokens are wired from throughput-scaling traffic prediction and exposed alongside observed sequence-length metrics.

Advisory mode

Set advisory: true to run the local Planner in suggestion-only mode. This is recommended when you are evaluating a new Planner configuration, validating SLA targets, or reviewing how the Planner would react to production traffic before allowing it to scale workers.

In advisory mode, the Planner still observes traffic and FPM data, computes recommended prefill and decode replica counts, logs recommendation summaries, exports predicted replica metrics, and includes recommendations in diagnostics reports. The recommendations are not applied as scaling decisions: the Planner does not execute scaling actions, send replica changes to Kubernetes or GlobalPlanner, or mutate the deployment.

Diagnostics metrics

Additional series support dashboards and offline analysis:

Perf-model latency estimates: dynamo_planner_estimated_ttft_ms and dynamo_planner_estimated_itl_ms reflect the maximum estimated TTFT and ITL from the engine perf model across engines.
Engine capacity: dynamo_planner_engine_prefill_requests_per_second and dynamo_planner_engine_decode_requests_per_second report single-engine prefill and decode capacity under the configured SLA.
Scaling decision reasons: dynamo_planner_load_scaling_decision and dynamo_planner_throughput_scaling_decision are Enum gauges whose state labels encode why each mode chose to scale, hold, or skip (for example scale_up, no_fpm_data, set_lower_bound).
Per-engine FPM queue depths: dynamo_planner_engine_queued_prefill_tokens, dynamo_planner_engine_queued_decode_kv_tokens, and dynamo_planner_engine_inflight_decode_kv_tokens are labeled with worker_id and dp_rank for each engine.

HTML diagnostics reports

The planner can emit periodic, self-contained HTML diagnostics files with interactive Plotly charts.

Configure this in PlannerConfig (or the equivalent YAML / constructor wiring your deployment uses):

report_interval_hours: interval in simulated time between reports (default 24.0 hours); set to None to disable.
report_output_dir: directory where HTML files are written (default ./planner_reports).
live_dashboard_port: port for a real-time HTTP dashboard (default 8080). Set to 0 to disable. An aiohttp server starts on the given port and serves the current accumulated snapshot data as an interactive Plotly report at http://<host>:<port>/. Unlike periodic reports, the live dashboard does not clear snapshots — it always shows all data accumulated since the last periodic report (or since startup if periodic reports are disabled).

Reports aggregate per-tick snapshots and use TickInput.now_s for timestamps, so they behave the same in live runs (wall clock) and in replay with a simulated clock. Typical charts cover worker counts, recommended replica counts, observed versus estimated latencies versus SLA targets, request rate, engine capacity, scaling decision timelines, and input/output sequence lengths. In the Replica Counts plot, actual replicas are shown as lines and the Planner’s recommended prefill and decode replica counts are shown as discrete markers at the ticks where recommendations were produced. This is especially useful with advisory: true because those recommendations are suggestions only.