For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
      • Planner Guide
      • Planner Examples
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Scaling Modes
  • PlannerConfig Reference
  • Scaling Mode Fields
  • Pre-Deployment Sweeping
  • Throughput-Based Scaling Settings
  • Load-Based Scaling Settings
  • General Settings
  • Traffic Prediction Settings
  • Kalman Filter Settings
  • Integration with Profiler
  • Hierarchical Deployments
  • See Also
ComponentsPlanner

Planner Guide

||View as Markdown|
Edit this page
Previous

Planner

Next

Planner Examples

The Dynamo Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.

For a quick overview, see the Planner overview. For architecture internals, see Planner Design.

Scaling Modes

The planner supports two scaling modes that can be used independently or together:

  • Throughput-based scaling (enable_throughput_scaling: true): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the Profiler.
  • Load-based scaling (enable_load_scaling: true): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the KV Router — see Current Limitations.

When to use which:

  • Enable throughput-based scaling whenever profiling data is available. It provides stable, prediction-based capacity planning.
  • Enable load-based scaling when traffic is bursty. It reacts quickly to real-time load changes.
  • Enable both for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer throughput_adjustment_interval.

PlannerConfig Reference

The planner is configured via a PlannerConfig JSON/YAML object. When using the profiler, this is placed under the features.planner section of the DGDR spec:

1features:
2 planner:
3 enable_throughput_scaling: true
4 enable_load_scaling: false
5 pre_deployment_sweeping_mode: rapid
6 mode: disagg
7 backend: vllm

Scaling Mode Fields

FieldTypeDefaultDescription
enable_throughput_scalingbooltrueEnable throughput-based scaling (requires pre-deployment profiling data).
enable_load_scalingboolfalseEnable load-based scaling (no pre-deployment profiling data required).

At least one scaling mode must be enabled.

Pre-Deployment Sweeping

FieldTypeDefaultDescription
pre_deployment_sweeping_modestringrapidHow to generate engine interpolation data: rapid (AIC simulation, ~30s), thorough (real GPUs, 2-4h), or none (skip).

When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the pre_deployment_sweeping_mode setting. See the Profiler Guide for details on how this data is produced.

Throughput-Based Scaling Settings

FieldTypeDefaultDescription
throughput_adjustment_intervalint180Seconds between throughput-based scaling decisions.
min_endpointint1Minimum number of engine endpoints to maintain.
max_gpu_budgetint8Maximum total GPUs the planner may allocate.
ttftfloat500.0TTFT SLA target (ms) for scaling decisions.
itlfloat50.0ITL SLA target (ms) for scaling decisions.
no_correctionbooltrueDisable latency correction factor. Auto-disabled when load-based scaling is on.

Load-Based Scaling Settings

FieldTypeDefaultDescription
load_adjustment_intervalint5Seconds between load-based scaling decisions. Must be shorter than throughput_adjustment_interval.
load_learning_windowint50Sliding window size for regression model.
load_scaling_down_sensitivityint80Scale-down sensitivity 0–100 (0=never, 100=aggressive).
load_metric_samplesint10Number of metric samples to collect per decision.
load_min_observationsint5Minimum observations before making scaling decisions.
load_router_metrics_urlstringnullRouter metrics endpoint. Auto-discovered in Kubernetes mode.

General Settings

FieldTypeDefaultDescription
modestringdisaggPlanner mode: disagg, prefill, decode, or agg.
backendstringvllmBackend: vllm, sglang, trtllm, or mocker.
environmentstringkubernetesRuntime environment: kubernetes, virtual, or global-planner.
namespacestringenv DYN_NAMESPACEKubernetes namespace for the deployment.

Traffic Prediction Settings

FieldTypeDefaultDescription
load_predictorstringarimaPrediction method: constant, arima, kalman, or prophet.
load_predictor_log1pboolfalseApply log1p transform to load data before prediction.
prophet_window_sizeint50Window size (seconds) for Prophet predictor.
load_predictor_warmup_tracestringnullPath to a warmup trace file for bootstrapping predictions.

Kalman Filter Settings

FieldTypeDefaultDescription
kalman_q_levelfloat1.0Process noise for level component.
kalman_q_trendfloat0.1Process noise for trend component.
kalman_rfloat10.0Measurement noise.
kalman_min_pointsint5Minimum data points before Kalman predictions activate.

Integration with Profiler

When the profiler runs with planner enabled, it:

  1. Selects the best prefill and decode engine configurations
  2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
  3. Saves the PlannerConfig and profiling data into separate Kubernetes ConfigMaps
  4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps

The planner receives its config via --config /path/to/planner_config.json which is mounted from the planner-config-XXXX ConfigMap. Profiling data is mounted from the planner-profile-data-XXXX ConfigMap.

See the Profiler Guide for the full profiling workflow and how to configure pre-deployment sweeping.

Hierarchical Deployments

If you want one public endpoint for a model but multiple private DGDs optimized for different request classes, use a hierarchical deployment:

  • one control DGD with Frontend, GlobalRouter, and GlobalPlanner
  • one or more prefill pool DGDs
  • one or more decode pool DGDs

In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See the Global Planner Guide.

See Also

  • Planner overview — Why LLM inference needs a different autoscaler
  • Planner Design — Architecture and algorithm internals
  • Planner Examples — DGDR YAML examples, sample configurations, advanced patterns
  • Global Planner Guide — Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments
  • Profiler Guide — How profiling data is generated