This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the sglang-agg example from examples/backends/sglang/deploy/agg.yaml.
All examples in this guide use the following DGD:
Key identifiers:
sglang-aggdefaultFrontend, decodedefault-sglang-agg (used for metric filtering)Dynamo provides flexible autoscaling through the DynamoGraphDeploymentScalingAdapter (DGDSA) resource. To have the operator create a DGDSA for a service, follow the Enabling DGDSA for a Service section below. These adapters implement the Kubernetes Scale subresource, enabling integration with:
⚠️ Deprecation Notice: The
spec.services[X].autoscalingfield in DGD is deprecated and ignored. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs withautoscalingconfigured, you’ll see a warning. Remove the field to silence the warning.
How it works:
/scale subresourceAfter deploying the sglang-agg DGD, verify the auto-created adapters:
When DGDSA is enabled, it becomes the source of truth for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
spec.replicasspec.services[X].replicas in the DGDWhen DGDSA is enabled, use kubectl scale on the adapter (not the DGD):
By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
When to enable DGDSA:
When to keep DGDSA disabled (default):
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
When to use Planner:
How Planner works:
Planner is deployed as a service component within your DGD. It:
Deployment:
The recommended way to deploy Planner is via DynamoGraphDeploymentRequest (DGDR). See the SLA Planner Quick Start for complete instructions.
Example configurations with Planner:
examples/backends/vllm/deploy/disagg_planner.yamlexamples/backends/sglang/deploy/disagg_planner.yamlexamples/backends/trtllm/deploy/disagg_planner.yamlFor more details, see the SLA Planner documentation.
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ native autoscaling solution.
When to use HPA:
For custom metrics (like TTFT or queue depth), consider using KEDA instead - it’s simpler to configure.
Dynamo exports several metrics useful for autoscaling. These are available at the /metrics endpoint on each frontend pod.
See also: For a complete list of all Dynamo metrics, see the Metrics Reference. For Prometheus and Grafana setup, see the Prometheus and Grafana Setup Guide.
Dynamo metrics include these labels for filtering:
When you have multiple DGDs in the same namespace, use dynamo_namespace to filter metrics for a specific DGD.
Using HPA with Prometheus Adapter requires configuring external metrics.
Step 1: Configure Prometheus Adapter
Add this to your Helm values file (e.g., prometheus-adapter-values.yaml):
Step 2: Install Prometheus Adapter
Step 3: Verify the metric is available
Step 4: Create the HPA
How it works:
dynamo_frontend_time_to_first_token_seconds histogramdynamo_namespacedynamo_namespace: "default-sglang-agg"sglang-agg-decode adapterdecode serviceAdd this rule to your prometheus-adapter-values.yaml (alongside the TTFT rule):
Then create the HPA:
KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
Advantages over HPA + Prometheus Adapter:
kubectl apply the ScaledObjectWhen to use KEDA:
If you have Prometheus Adapter installed, either uninstall it first (helm uninstall prometheus-adapter -n monitoring) or install KEDA with --set metricsServer.enabled=false to avoid API conflicts.
Using the sglang-agg DGD from examples/backends/sglang/deploy/agg.yaml:
Apply it:
KEDA creates and manages an HPA under the hood:
For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
When DGDSA is enabled, scale via the adapter:
Verify the scaling:
If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
If you’ve disabled the scaling adapter for a service, edit the DGD directly:
Or edit the YAML (no scalingAdapter.enabled: true means direct edits are allowed):
Avoid configuring multiple autoscalers for the same service:
Prevent thrashing with appropriate stabilization:
Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
If HPA/KEDA shows <unknown> for metrics:
If you see unstable scaling:
cooldownPeriod in KEDA ScaledObjectstabilizationWindowSeconds in HPA behavior