Planner#
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.
Feature Matrix#
Category |
Feature |
Status |
|---|---|---|
Backend |
Local (bare metal) |
Deprecated |
Kubernetes |
Supported |
|
LLM Framework |
vLLM |
Supported |
TensorRT-LLM |
Supported |
|
SGLang |
Supported |
|
Serving Type |
Aggregated |
Unsupported |
Disaggregated |
Supported |
|
Scaling Mode |
SLA-based (TTFT/ITL targets) |
Supported (primary) |
Load-based (KV cache/queue thresholds) |
Deprecated |
|
Load Predictors |
ARIMA |
Supported |
Prophet |
Supported |
|
Kalman filter |
Supported |
|
Constant (current = next) |
Supported |
|
Connectors |
KubernetesConnector (native DGD scaling) |
Supported |
VirtualConnector (external environments) |
Supported |
Quick Start#
Prerequisites#
Dynamo platform installed on Kubernetes (Installation Guide)
kube-prometheus-stack installed (Metrics Setup)
Pre-deployment profiling completed (Profiling Guide)
Deploy with DGDR (Recommended)#
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
This automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.
Deploy with DGD (Manual)#
For manual control, use the disaggregated planner templates:
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
Documentation#
Document |
Description |
|---|---|
Deployment, configuration, integration, troubleshooting |
|
DGDR YAML examples, sample configurations, advanced patterns |
|
End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
|
Scaling algorithm, correction factors, load prediction details |
|
Legacy load-based scaling (deprecated) |
|
Pre-deployment profiling process and configuration |
|
Architecture deep-dive for contributors |
Configuration Reference#
Key Arguments#
Argument |
Default |
Description |
|---|---|---|
|
|
Dynamo logical namespace |
|
|
Backend framework ( |
|
|
Deployment environment |
|
|
Seconds between scaling decisions |
|
|
Target Time To First Token (ms) |
|
|
Target Inter-Token Latency (ms) |
|
|
Expected average input sequence length |
|
|
Expected average output sequence length |
|
|
Prediction model ( |
|
|
Maximum GPUs across all workers |
|
|
Minimum replicas per worker type |
|
|
GPUs per decode engine |
|
|
GPUs per prefill engine |
|
|
Observation mode (no actual scaling) |
|
|
Disable correction factors |
|
|
Path to profiling data (NPZ/JSON) |
Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Dynamo logical namespace |
|
(required) |
Parent DGD K8s resource name |
|
|
Prometheus URL |
|
|
Port for planner’s own Prometheus metrics |
Monitoring#
Grafana Dashboard#
Deploy the planner dashboard:
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
The dashboard shows:
Worker counts and GPU usage over time
Observed TTFT, ITL, request rate, sequence lengths
Predicted load and recommended replica counts
Correction factors (actual vs. expected performance)
Prometheus Metrics#
The planner queries the frontend’s /metrics endpoint via Prometheus. Required metrics:
Request count and duration
TTFT and ITL distributions
Input/output sequence lengths