Planner#

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix#

Category

Feature

Status

Backend

Local (bare metal)

Deprecated

Kubernetes

Supported

LLM Framework

vLLM

Supported

TensorRT-LLM

Supported

SGLang

Supported

Serving Type

Aggregated

Unsupported

Disaggregated

Supported

Scaling Mode

SLA-based (TTFT/ITL targets)

Supported (primary)

Load-based (KV cache/queue thresholds)

Deprecated

Load Predictors

ARIMA

Supported

Prophet

Supported

Kalman filter

Supported

Constant (current = next)

Supported

Connectors

KubernetesConnector (native DGD scaling)

Supported

VirtualConnector (external environments)

Supported

Quick Start#

Prerequisites#

Deploy with DGD (Manual)#

For manual control, use the disaggregated planner templates:

# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation#

Document

Description

Planner Guide

Deployment, configuration, integration, troubleshooting

Planner Examples

DGDR YAML examples, sample configurations, advanced patterns

SLA Planner Guide

End-to-end DGDR workflow: define SLAs, profile, deploy, monitor

SLA-based Planner

Scaling algorithm, correction factors, load prediction details

Load-based Planner

Legacy load-based scaling (deprecated)

SLA-Driven Profiling

Pre-deployment profiling process and configuration

Planner Design

Architecture deep-dive for contributors

Configuration Reference#

Key Arguments#

Argument

Default

Description

--namespace

$DYN_NAMESPACE or dynamo

Dynamo logical namespace

--backend

vllm

Backend framework (vllm, sglang, trtllm)

--environment

kubernetes

Deployment environment

--adjustment-interval

180

Seconds between scaling decisions

--ttft

500.0

Target Time To First Token (ms)

--itl

50.0

Target Inter-Token Latency (ms)

--isl

3000

Expected average input sequence length

--osl

150

Expected average output sequence length

--load-predictor

arima

Prediction model (arima, prophet, kalman, constant)

--max-gpu-budget

8

Maximum GPUs across all workers

--min-endpoint

1

Minimum replicas per worker type

--decode-engine-num-gpu

1

GPUs per decode engine

--prefill-engine-num-gpu

1

GPUs per prefill engine

--no-operation

false

Observation mode (no actual scaling)

--no-correction

false

Disable correction factors

--profile-results-dir

profiling_results

Path to profiling data (NPZ/JSON)

Environment Variables#

Variable

Default

Description

DYN_NAMESPACE

dynamo

Dynamo logical namespace

DYN_PARENT_DGD_K8S_NAME

(required)

Parent DGD K8s resource name

PROMETHEUS_ENDPOINT

http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090

Prometheus URL

PLANNER_PROMETHEUS_PORT

0 (disabled)

Port for planner’s own Prometheus metrics

Monitoring#

Grafana Dashboard#

Deploy the planner dashboard:

kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

  • Worker counts and GPU usage over time

  • Observed TTFT, ITL, request rate, sequence lengths

  • Predicted load and recommended replica counts

  • Correction factors (actual vs. expected performance)

Prometheus Metrics#

The planner queries the frontend’s /metrics endpoint via Prometheus. Required metrics:

  • Request count and duration

  • TTFT and ITL distributions

  • Input/output sequence lengths