This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the sglang-agg example from examples/backends/sglang/deploy/agg.yaml.
All examples in this guide use the following DGD:
Key identifiers:
sglang-aggdefaultFrontend, decodedefault-sglang-agg (used for metric filtering)Dynamo provides flexible autoscaling through the DynamoGraphDeploymentScalingAdapter (DGDSA) resource. To have the operator create a DGDSA for a service, follow the Enabling DGDSA for a Service section below. These adapters implement the Kubernetes Scale subresource, enabling integration with:
⚠️ Deprecation Notice: The
spec.services[X].autoscalingfield in DGD is deprecated and ignored. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs withautoscalingconfigured, you’ll see a warning. Remove the field to silence the warning.
How it works:
/scale subresourceAfter deploying the sglang-agg DGD, verify the auto-created adapters:
When DGDSA is enabled, it becomes the source of truth for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
spec.replicasspec.services[X].replicas in the DGDWhen DGDSA is enabled, use kubectl scale on the adapter (not the DGD):
By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
When to enable DGDSA:
When to keep DGDSA disabled (default):
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
When to use Planner:
How Planner works:
Planner is deployed as a service component within your DGD. It:
Deployment:
The recommended way to deploy Planner is via DynamoGraphDeploymentRequest (DGDR). See the SLA Planner Quick Start for complete instructions.
Example configurations with Planner:
examples/backends/vllm/deploy/disagg_planner.yamlexamples/backends/sglang/deploy/disagg_planner.yamlexamples/backends/trtllm/deploy/disagg_planner.yamlFor more details, see the SLA Planner documentation.
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ native autoscaling solution.
When to use HPA:
For custom metrics (like TTFT or queue depth), consider using KEDA instead - it’s simpler to configure.
Dynamo exports several metrics useful for autoscaling. These are available at the /metrics endpoint on each frontend pod.
See also: For a complete list of all Dynamo metrics, see the Metrics Reference. For Prometheus and Grafana setup, see the Prometheus and Grafana Setup Guide.
For the full definition of the stage and phase labels and derived-signal formulas, see Stage and phase labels in the Metrics Reference.
Dynamo metrics include these labels for filtering:
When you have multiple DGDs in the same namespace, use dynamo_namespace to filter metrics for a specific DGD.
Using HPA with Prometheus Adapter requires configuring external metrics.
Step 1: Configure Prometheus Adapter
Add this to your Helm values file (e.g., prometheus-adapter-values.yaml):
Step 2: Install Prometheus Adapter
Step 3: Verify the metric is available
Step 4: Create the HPA
How it works:
dynamo_frontend_time_to_first_token_seconds histogramdynamo_namespacedynamo_namespace: "default-sglang-agg"sglang-agg-decode adapterdecode service“Queue depth” here means the number of requests that have entered the frontend but haven’t yet received a first token — i.e. the sum of dynamo_frontend_stage_requests across the preprocess, route, and dispatch stages. This replaces the deprecated dynamo_frontend_queued_requests gauge.
Add this rule to your prometheus-adapter-values.yaml (alongside the TTFT rule):
Then create the HPA:
KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
Advantages over HPA + Prometheus Adapter:
kubectl apply the ScaledObjectWhen to use KEDA:
If you have Prometheus Adapter installed, either uninstall it first (helm uninstall prometheus-adapter -n monitoring) or install KEDA with --set metricsServer.enabled=false to avoid API conflicts.
Using the sglang-agg DGD from examples/backends/sglang/deploy/agg.yaml:
Apply it:
KEDA creates and manages an HPA under the hood:
For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
When DGDSA is enabled, scale via the adapter:
Verify the scaling:
If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
If you’ve disabled the scaling adapter for a service, edit the DGD directly:
Or edit the YAML (no scalingAdapter.enabled: true means direct edits are allowed):
Avoid configuring multiple autoscalers for the same service:
Prevent thrashing with appropriate stabilization:
Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
If HPA/KEDA shows <unknown> for metrics:
If you see unstable scaling:
cooldownPeriod in KEDA ScaledObjectstabilizationWindowSeconds in HPA behavior