Planner Design | NVIDIA Dynamo Documentation

Tier 3 design documentation for contributors and architects. For user-facing docs, see docs/components/planner/.

Overview

The Planner is Dynamo’s autoscaling controller. It supports two scaling modes: throughput-based (using profiling data and traffic prediction) and load-based (using real-time engine metrics and online regression). This document covers the internal architecture, algorithms, and design trade-offs for both modes.

Throughput-Based Scaling

Planner architecture showing Metric Collector, Load Predictor, and Performance Interpolator feeding into the Scaling Algorithm and Connector Layer

Scaling Algorithm

Step 1: Metric Collection

Every adjustment_interval seconds, the planner queries Prometheus for:

Average TTFT and ITL over the interval
Total request count
Average input sequence length (ISL) and output sequence length (OSL)

The Prometheus query targets the Frontend’s /metrics endpoint, which exposes histograms and counters.

Step 2: Correction Factor Calculation

The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:

prefill_correction = actual_ttft / expected_ttft
decode_correction  = actual_itl  / expected_itl

These factors account for hard to model factors such as:

Request queueing: Bursty traffic causes higher TTFT than profiled steady-state
Prefix cache hits: KV reuse reduces effective prefill tokens, lowering actual TTFT
Chunked prefill in decode: Small prefills processed in decode engine affect ITL
Metric variance: Average ISL/OSL may not represent the actual distribution

The correction factors are applied as multipliers to the next scaling decision. Setting --no-correction disables this for debugging or when cold-start artifacts dominate.

Step 3: Load Prediction

The planner forecasts three values for the next interval:

next_num_req: Number of requests
next_isl: Average input sequence length
next_osl: Average output sequence length

Four predictor implementations are available:

Predictor	Algorithm	Best For
Constant	`next = current`	Stable workloads, long intervals
ARIMA	Auto-ARIMA with optional log1p transform	Trending/seasonal patterns
Kalman	Local linear trend Kalman filter	Bursty traffics
Prophet	Facebook Prophet time-series model	Complex seasonality

All predictors support warm-starting from trace files (--load-predictor-warmup-trace).

Step 4: Replica Calculation

Prefill replicas:

1 predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
2 prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)

The prefill correction factor has a linear effect on throughput because prefill is single-batched.

Decode replicas:

1 # Apply correction to the ITL SLA target
2 corrected_itl = target_itl / decode_correction_factor
3 
4 # Find best throughput/GPU that achieves corrected ITL at predicted context length
5 throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
6     itl=corrected_itl,
7     context_length=next_isl + next_osl / 2
8 )
9 
10 # Calculate required replicas
11 decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)

Step 5: Scaling Execution

The planner calls connector.set_component_replicas() with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.

Connector Design

Interface

1 class PlannerConnector(ABC):
2     async def add_component(self, component_name)
3     async def remove_component(self, component_name)
4     # Extended interface (not on ABC, but implemented by both connectors):
5     async def set_component_replicas(self, targets, blocking)
6     async def validate_deployment(self, ...)
7     async def wait_for_deployment_ready(self)

KubernetesConnector

Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.

Design decisions:

Uses DYN_PARENT_DGD_K8S_NAME to find its parent DGD (injected by operator)
Resolves services by subComponentType field (prefill/decode), with fallback to legacy component names
Validates deployment structure on startup: checks that prefill and decode services exist and model names match

VirtualConnector

For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via VirtualConnectorCoordinator (Rust binding). External systems use VirtualConnectorClient to poll decisions and report completion.

Scaling decision flow:

Planner writes (num_prefill, num_decode, decision_id) to runtime
External system reads decision via client.wait()
External system executes scaling
External system reports completion via client.complete(decision)
Planner sees scaled_decision_id >= decision_id and proceeds

Timeout: If scaling isn’t acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.

Performance Interpolation

The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).

Two interpolators are maintained:

Prefill interpolator: Maps (throughput_per_gpu, ISL) -> TTFT
Decode interpolator: Maps (throughput_per_gpu, context_length) -> ITL

The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.

Initialization

The planner currently waits 30 seconds (INIT_PLANNER_START_DELAY in components/src/dynamo/planner/__main__.py) as a temporary workaround while other components (frontend, workers) register and stabilize; see Known Limitations for the planned readiness-probing replacement.

After the delay:

Initialize the connector (K8s or Virtual based on --environment)
Validate deployment structure
Load profiling results
Build interpolators
Initialize load predictor
Enter main scaling loop

Performance Considerations

Adjustment interval sizing: The interval must be long enough for scaling operations to complete. If adjustment_interval is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
Correction factor stability: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The --no-correction flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
Interpolation accuracy vs profiling cost: Higher prefillInterpolationGranularity and decodeInterpolationGranularity in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
Predictor warm-up period: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after --kalman-min-points observations. During warm-up, the planner uses the constant predictor as fallback.

Load-Based Scaling

The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router.

Metrics

Each engine emits per-iteration ForwardPassMetrics via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via FpmEventSubscriber with automatic engine discovery and MDC-based lifecycle tracking. Key fields used:

wall_time: per-iteration execution time (regression target)
scheduled_requests.sum_prefill_tokens: prefill regression input
scheduled_requests.sum_decode_kv_tokens: decode regression input
queued_requests: queued prefill/decode load for TTFT/ITL simulation
Idle heartbeats (wall_time=0) are skipped

Diagnostics

Each tick, the scaling state machine fills TickDiagnostics with intermediate decision data—estimated latencies, predicted load, per-engine RPS, and decision reasons—via internal _diag_* fields. The adapter layer reads this from PlannerEffects.diagnostics and:

Sets Prometheus gauges (e.g. dynamo_planner_estimated_ttft_ms and related estimates)
Records enum metrics for load-scaling decision reasons (dynamo_planner_load_scaling_decision)
Feeds DiagnosticsRecorder, which accumulates per-tick snapshots and emits Plotly-based HTML reports on a schedule

Per-engine FPM queue depths from _collect_fpm() are exported as labeled Prometheus gauges.

Regression Models

Three specialized regression models (fpm_regression.py):

PrefillRegressionModel: 1D regression sum_prefill_tokens -> wall_time. Estimates TTFT by simulating chunked prefill scheduling (chunks of max_num_batched_tokens).
DecodeRegressionModel: 1D regression sum_decode_kv_tokens -> wall_time. Estimates ITL for total decode load (scheduled + queued + avg decode length).
AggRegressionModel: 2D regression (sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill).

Scaling Decisions

Prefill/Decode: Scale up if ALL engines’ estimated TTFT/ITL > SLA; scale down if ALL < SLA * sensitivity
Agg: Scale up if (ALL TTFT > SLA) OR (ALL ITL > SLA); scale down if (ALL TTFT < SLA * sensitivity) AND (ALL ITL < SLA * sensitivity)
Only scales by +/-1 per interval (non-blocking with pending-desired guard: metrics continue to be observed while scaling is in progress, but no new scaling action is issued until the previous one completes)

Co-existence with Throughput-Based Scaling

When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor.

Aggregated Mode

In aggregated mode (--mode agg), engines handle both prefill and decode via chunked prefill. The planner maintains both TTFT and ITL regression models but uses per-worker time-averaged metrics (not instantaneous) for regression training to smooth out chunked prefill noise. Scale up if either prefill or decode signals overload; scale down only if both signal underload.

Known Limitations

30-second startup delay: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
Adjustment interval vs scaling latency: If adjustment_interval < time to scale, scaling decisions can pile up. The planner logs warnings but doesn’t queue.
Average-based interpolation: Throughput-based scaling uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
Single DGD scope: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.

Future Work

Multi-DGD coordination for shared-cluster scenarios
Distribution-aware interpolation (beyond mean ISL/OSL)
Adaptive adjustment interval based on observed scaling latency

File Map

File	Purpose
`planner_core.py`	Base planner, shared scaling loop, algorithm core
`disagg_planner.py`	Disaggregated mode orchestrator (prefill + decode)
`agg_planner.py`	Aggregated mode orchestrator (load-based only)
`prefill_planner.py`	Prefill-specific scaling logic
`decode_planner.py`	Decode-specific scaling logic
`load_based_regression.py`	Sliding-window linear regression for load-based scaling
`prometheus.py`	Prometheus/router metrics clients, data classes
`perf_interpolation.py`	NPZ data loading and throughput/latency interpolation
`load_predictor.py`	ARIMA, Prophet, Kalman, Constant predictors
`pre_swept_results_utils.py`	Pre-computed H100/H200 profiling data loader
`kubernetes_connector.py`	K8s API integration for DGD scaling
`kube.py`	Low-level K8s client wrapper
`exceptions.py`	Custom exception hierarchy
`defaults.py`	Default configs, backend name mappings
`planner_argparse.py`	CLI argument definitions