Autoscaling | NVIDIA Dynamo Documentation

This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the sglang-agg example from examples/backends/sglang/deploy/agg.yaml.

Example DGD

All examples in this guide use the following DGD:

1 # examples/backends/sglang/deploy/agg.yaml
2 apiVersion: nvidia.com/v1alpha1
3 kind: DynamoGraphDeployment
4 metadata:
5   name: sglang-agg
6   namespace: default
7 spec:
8   services:
9     Frontend:
10       componentType: frontend
11       replicas: 1
12 
13     decode:
14       componentType: worker
15       replicas: 1
16       resources:
17         limits:
18           gpu: "1"

Key identifiers:

DGD name: sglang-agg
Namespace: default
Services: Frontend, decode
dynamo_namespace label: default-sglang-agg (used for metric filtering)

Overview

Dynamo provides flexible autoscaling through the DynamoGraphDeploymentScalingAdapter (DGDSA) resource. To have the operator create a DGDSA for a service, follow the Enabling DGDSA for a Service section below. These adapters implement the Kubernetes Scale subresource, enabling integration with:

Autoscaler	Description	Best For
KEDA	Event-driven autoscaling (recommended)	Most use cases
Kubernetes HPA	Native horizontal pod autoscaling	Simple CPU/memory-based scaling
Dynamo Planner	LLM-aware autoscaling with SLA optimization	Production LLM workloads
Custom Controllers	Any scale-subresource-compatible controller	Custom requirements

⚠️ Deprecation Notice: The spec.services[X].autoscaling field in DGD is deprecated and ignored. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with autoscaling configured, you’ll see a warning. Remove the field to silence the warning.

Architecture

┌──────────────────────────────────┐          ┌─────────────────────────────────────┐
│   DynamoGraphDeployment          │          │   Scaling Adapters (auto-created)   │
│   "sglang-agg"                   │          │   (one per service)                 │
├──────────────────────────────────┤          ├─────────────────────────────────────┤
│                                  │          │                                     │
│  spec.services:                  │          │  ┌─────────────────────────────┐    │      ┌──────────────────┐
│                                  │          │  │ sglang-agg-frontend         │◄───┼──────│   Autoscalers    │
│    ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1            │    │      │                  │
│    │ Frontend: 1 replica    │    │          │  └─────────────────────────────┘    │      │  • KEDA          │
│    └────────────────────────┘    │          │                                     │      │  • HPA           │
│                                  │          │  ┌─────────────────────────────┐    │      │  • Planner       │
│    ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode           │◄───┼──────│  • Custom        │
│    │ decode:   1 replica    │    │          │  │ spec.replicas: 1            │    │      │                  │
│    └────────────────────────┘    │          │  └─────────────────────────────┘    │      └──────────────────┘
│                                  │          │                                     │
└──────────────────────────────────┘          └─────────────────────────────────────┘

How it works:

You deploy a DGD with services (Frontend, decode)
The operator auto-creates one DGDSA per service
Autoscalers (KEDA, HPA, Planner) target the adapters via /scale subresource
Adapter controller syncs replica changes to the DGD
DGD controller reconciles the underlying pods

Viewing Scaling Adapters

After deploying the sglang-agg DGD, verify the auto-created adapters:

$ kubectl get dgdsa -n default
$ 
$ # Example output:
$ # NAME                  DGD         SERVICE    REPLICAS   AGE
$ # sglang-agg-frontend   sglang-agg  Frontend   1          5m
$ # sglang-agg-decode     sglang-agg  decode     1          5m

Replica Ownership Model

When DGDSA is enabled, it becomes the source of truth for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.

How It Works

DGDSA owns replicas: Autoscalers (HPA, KEDA, Planner) update the DGDSA’s spec.replicas
DGDSA syncs to DGD: The DGDSA controller writes the replica count to the DGD’s service
Direct DGD edits blocked: A validating webhook prevents users from directly editing spec.services[X].replicas in the DGD
Controllers allowed: Only authorized controllers (operator, Planner) can modify DGD replicas

Manual Scaling with DGDSA Enabled

When DGDSA is enabled, use kubectl scale on the adapter (not the DGD):

$ # ✅ Correct - scale via DGDSA
$ kubectl scale dgdsa sglang-agg-decode --replicas=3
$ 
$ # ❌ Blocked - direct DGD edit rejected by webhook
$ kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
$ # Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
$ #        use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead

Enabling DGDSA for a Service

By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: sglang-agg
5 spec:
6   services:
7     Frontend:
8       replicas: 2        # ← No DGDSA by default, direct edits allowed
9 
10     decode:
11       replicas: 1
12       scalingAdapter:
13         enabled: true    # ← DGDSA created, managed via adapter

When to enable DGDSA:

You want to use HPA, KEDA, or Planner for autoscaling
You want a clear separation between “desired scale” (adapter) and “deployment config” (DGD)
You want protection against accidental direct replica edits

When to keep DGDSA disabled (default):

You want simple, manual replica management
You don’t need autoscaling for that service
You prefer direct DGD edits over adapter-based scaling

Autoscaling with Dynamo Planner

The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.

When to use Planner:

You want LLM-optimized autoscaling out of the box
You need coordinated scaling across prefill/decode services
You want SLA-driven scaling (e.g., target TTFT < 500ms)

How Planner works:

Planner is deployed as a service component within your DGD. It:

Queries Prometheus for frontend metrics (request rate, latency, etc.)
Uses profiling data to predict optimal replica counts
Scales prefill/decode workers to meet SLA targets

Deployment:

The recommended way to deploy Planner is via DynamoGraphDeploymentRequest (DGDR). See the SLA Planner Quick Start for complete instructions.

Example configurations with Planner:

examples/backends/vllm/deploy/disagg_planner.yaml
examples/backends/sglang/deploy/disagg_planner.yaml
examples/backends/trtllm/deploy/disagg_planner.yaml

For more details, see the SLA Planner documentation.

Autoscaling with Kubernetes HPA

The Horizontal Pod Autoscaler (HPA) is Kubernetes’ native autoscaling solution.

When to use HPA:

You have simple, predictable scaling requirements
You want to use standard Kubernetes tooling
You need CPU or memory-based scaling

For custom metrics (like TTFT or queue depth), consider using KEDA instead - it’s simpler to configure.

Basic HPA (CPU-based)

1 apiVersion: autoscaling/v2
2 kind: HorizontalPodAutoscaler
3 metadata:
4   name: sglang-agg-frontend-hpa
5   namespace: default
6 spec:
7   scaleTargetRef:
8     apiVersion: nvidia.com/v1alpha1
9     kind: DynamoGraphDeploymentScalingAdapter
10     name: sglang-agg-frontend
11   minReplicas: 1
12   maxReplicas: 10
13   metrics:
14   - type: Resource
15     resource:
16       name: cpu
17       target:
18         type: Utilization
19         averageUtilization: 70
20   behavior:
21     scaleDown:
22       stabilizationWindowSeconds: 300
23     scaleUp:
24       stabilizationWindowSeconds: 0

HPA with Dynamo Metrics

Dynamo exports several metrics useful for autoscaling. These are available at the /metrics endpoint on each frontend pod.

See also: For a complete list of all Dynamo metrics, see the Metrics Reference. For Prometheus and Grafana setup, see the Prometheus and Grafana Setup Guide.

Available Dynamo Metrics

Metric	Type	Description	Good for scaling
`dynamo_frontend_active_requests`	Gauge	Total concurrent requests from HTTP entry to response complete	✅ All services
`dynamo_frontend_stage_requests{stage,phase}`	Gauge	Requests currently in a given frontend pipeline stage (`preprocess`, `route`, `dispatch`)	✅ Workers — use `sum(...)` for queue-depth behavior, or `stage="dispatch"` for backend-prefill saturation
`dynamo_frontend_time_to_first_token_seconds`	Histogram	TTFT latency	✅ Workers
`dynamo_frontend_inter_token_latency_seconds`	Histogram	ITL latency	✅ Decode
`dynamo_frontend_request_duration_seconds`	Histogram	Total request duration	⚠️ General
`dynamo_frontend_inflight_requests`	Gauge	Concurrent requests to engine	⚠️ Deprecated — use `dynamo_frontend_active_requests`
`dynamo_frontend_queued_requests`	Gauge	Requests waiting in HTTP queue	⚠️ Deprecated — use `sum(dynamo_frontend_stage_requests)` across `preprocess` + `route` + `dispatch`

For the full definition of the stage and phase labels and derived-signal formulas, see Stage and phase labels in the Metrics Reference.

Metric Labels

Dynamo metrics include these labels for filtering:

Label	Description	Example
`dynamo_namespace`	Unique DGD identifier (`{k8s-namespace}-{dgd-name}`)	`default-sglang-agg`
`model`	Model being served	`Qwen/Qwen3-0.6B`

When you have multiple DGDs in the same namespace, use dynamo_namespace to filter metrics for a specific DGD.

Example: Scale Decode Service Based on TTFT

Using HPA with Prometheus Adapter requires configuring external metrics.

Step 1: Configure Prometheus Adapter

Add this to your Helm values file (e.g., prometheus-adapter-values.yaml):

1 # prometheus-adapter-values.yaml
2 prometheus:
3   url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
4   port: 9090
5 
6 rules:
7   external:
8   # TTFT p95 from frontend - used to scale decode
9   - seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}'
10     resources:
11       overrides:
12         namespace: {resource: "namespace"}
13     name:
14       as: "dynamo_ttft_p95_seconds"
15     metricsQuery: |
16       histogram_quantile(0.95,
17         sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m]))
18         by (le, namespace, dynamo_namespace)
19       )

Step 2: Install Prometheus Adapter

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ 
$ helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
>   -n monitoring --create-namespace \
>   -f prometheus-adapter-values.yaml

Step 3: Verify the metric is available

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/dynamo_ttft_p95_seconds" | jq

Step 4: Create the HPA

1 apiVersion: autoscaling/v2
2 kind: HorizontalPodAutoscaler
3 metadata:
4   name: sglang-agg-decode-hpa
5 spec:
6   scaleTargetRef:
7     apiVersion: nvidia.com/v1alpha1
8     kind: DynamoGraphDeploymentScalingAdapter
9     name: sglang-agg-decode              # ← DGD name + service name (lowercase)
10   minReplicas: 1
11   maxReplicas: 10
12   metrics:
13   - type: External
14     external:
15       metric:
16         name: dynamo_ttft_p95_seconds
17         selector:
18           matchLabels:
19             dynamo_namespace: "default-sglang-agg"  # ← {namespace}-{dgd-name}
20       target:
21         type: Value
22         value: "500m"  # Scale up when TTFT p95 > 500ms
23   behavior:
24     scaleDown:
25       stabilizationWindowSeconds: 60    # Wait 1 min before scaling down
26       policies:
27       - type: Pods
28         value: 1
29         periodSeconds: 30
30     scaleUp:
31       stabilizationWindowSeconds: 0      # Scale up immediately
32       policies:
33       - type: Pods
34         value: 2
35         periodSeconds: 30

How it works:

Frontend pods export dynamo_frontend_time_to_first_token_seconds histogram
Prometheus Adapter calculates p95 TTFT per dynamo_namespace
HPA monitors this metric filtered by dynamo_namespace: "default-sglang-agg"
When TTFT p95 > 500ms, HPA scales up the sglang-agg-decode adapter
Adapter controller syncs the replica count to the DGD’s decode service
More decode workers are created, reducing TTFT

Example: Scale Based on Queue Depth

“Queue depth” here means the number of requests that have entered the frontend but haven’t yet received a first token — i.e. the sum of dynamo_frontend_stage_requests across the preprocess, route, and dispatch stages. This replaces the deprecated dynamo_frontend_queued_requests gauge.

Add this rule to your prometheus-adapter-values.yaml (alongside the TTFT rule):

1 # Add to rules.external in prometheus-adapter-values.yaml
2 - seriesQuery: 'dynamo_frontend_stage_requests{namespace!="",stage=~"preprocess|route|dispatch"}'
3   resources:
4     overrides:
5       namespace: {resource: "namespace"}
6   name:
7     as: "dynamo_frontend_pending_requests"
8   metricsQuery: |
9     sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)

Then create the HPA:

1 apiVersion: autoscaling/v2
2 kind: HorizontalPodAutoscaler
3 metadata:
4   name: sglang-agg-decode-queue-hpa
5   namespace: default
6 spec:
7   scaleTargetRef:
8     apiVersion: nvidia.com/v1alpha1
9     kind: DynamoGraphDeploymentScalingAdapter
10     name: sglang-agg-decode
11   minReplicas: 1
12   maxReplicas: 10
13   metrics:
14   - type: External
15     external:
16       metric:
17         name: dynamo_frontend_pending_requests
18         selector:
19           matchLabels:
20             dynamo_namespace: "default-sglang-agg"
21       target:
22         type: Value
23         value: "10"  # Scale up when queue > 10 requests

Autoscaling with KEDA (Recommended)

KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.

Advantages over HPA + Prometheus Adapter:

No Prometheus Adapter configuration needed
PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
Easy to update - just kubectl apply the ScaledObject
Can scale to zero when idle
Supports multiple triggers per object

When to use KEDA:

You want simpler configuration (no Prometheus Adapter to manage)
You need event-driven scaling (e.g., queue depth, Kafka, etc.)
You want to scale to zero when idle

Installing KEDA

$ # Add KEDA Helm repo
$ helm repo add kedacore https://kedacore.github.io/charts
$ helm repo update
$ 
$ # Install KEDA
$ helm install keda kedacore/keda \
>   --namespace keda \
>   --create-namespace
$ 
$ # Verify installation
$ kubectl get pods -n keda

If you have Prometheus Adapter installed, either uninstall it first (helm uninstall prometheus-adapter -n monitoring) or install KEDA with --set metricsServer.enabled=false to avoid API conflicts.

Example: Scale Decode Based on TTFT

Using the sglang-agg DGD from examples/backends/sglang/deploy/agg.yaml:

1 apiVersion: keda.sh/v1alpha1
2 kind: ScaledObject
3 metadata:
4   name: sglang-agg-decode-scaler
5   namespace: default
6 spec:
7   scaleTargetRef:
8     apiVersion: nvidia.com/v1alpha1
9     kind: DynamoGraphDeploymentScalingAdapter
10     name: sglang-agg-decode
11   minReplicaCount: 1
12   maxReplicaCount: 10
13   pollingInterval: 15      # Check metrics every 15 seconds
14   cooldownPeriod: 60       # Wait 60s before scaling down
15   triggers:
16   - type: prometheus
17     metadata:
18       # Update this URL to match your Prometheus service
19       serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
20       metricName: dynamo_ttft_p95
21       query: |
22         histogram_quantile(0.95,
23           sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
24           by (le)
25         )
26       threshold: "0.5"              # Scale up when TTFT p95 > 500ms (0.5 seconds)
27       activationThreshold: "0.1"    # Start scaling when TTFT > 100ms

Apply it:

$ kubectl apply -f sglang-agg-decode-scaler.yaml

Verify KEDA Scaling

$ # Check ScaledObject status
$ kubectl get scaledobject -n default
$ 
$ # KEDA creates an HPA under the hood - you can see it
$ kubectl get hpa -n default
$ 
$ # Example output:
$ # NAME                                REFERENCE                                              TARGETS      MINPODS   MAXPODS   REPLICAS
$ # keda-hpa-sglang-agg-decode-scaler   DynamoGraphDeploymentScalingAdapter/sglang-agg-decode  45m/500m     1         10        1
$ 
$ # Get detailed status
$ kubectl describe scaledobject sglang-agg-decode-scaler -n default

Example: Scale Based on Queue Depth

1 apiVersion: keda.sh/v1alpha1
2 kind: ScaledObject
3 metadata:
4   name: sglang-agg-decode-queue-scaler
5   namespace: default
6 spec:
7   scaleTargetRef:
8     apiVersion: nvidia.com/v1alpha1
9     kind: DynamoGraphDeploymentScalingAdapter
10     name: sglang-agg-decode
11   minReplicaCount: 1
12   maxReplicaCount: 10
13   pollingInterval: 15
14   cooldownPeriod: 60
15   triggers:
16   - type: prometheus
17     metadata:
18       serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
19       metricName: dynamo_frontend_pending_requests
20       query: |
21         sum(dynamo_frontend_stage_requests{dynamo_namespace="default-sglang-agg",stage=~"preprocess|route|dispatch"})
22       threshold: "10"    # Scale up when queue > 10 requests

How KEDA Works

KEDA creates and manages an HPA under the hood:

┌──────────────────────────────────────────────────────────────────────┐
│  You create: ScaledObject                                            │
│    - scaleTargetRef: sglang-agg-decode                               │
│    - triggers: prometheus query                                      │
└──────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────────┐
│  KEDA Operator automatically creates: HPA                            │
│    - name: keda-hpa-sglang-agg-decode-scaler                         │
│    - scaleTargetRef: sglang-agg-decode                               │
│    - metrics: External (from KEDA metrics server)                    │
└──────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────────┐
│  DynamoGraphDeploymentScalingAdapter: sglang-agg-decode              │
│    - spec.replicas: updated by HPA                                   │
└──────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────────┐
│  DynamoGraphDeployment: sglang-agg                                   │
│    - spec.services.decode.replicas: synced from adapter              │
└──────────────────────────────────────────────────────────────────────┘

Mixed Autoscaling

For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:

1 ---
2 # HPA for Frontend (CPU-based)
3 apiVersion: autoscaling/v2
4 kind: HorizontalPodAutoscaler
5 metadata:
6   name: sglang-agg-frontend-hpa
7   namespace: default
8 spec:
9   scaleTargetRef:
10     apiVersion: nvidia.com/v1alpha1
11     kind: DynamoGraphDeploymentScalingAdapter
12     name: sglang-agg-frontend
13   minReplicas: 1
14   maxReplicas: 5
15   metrics:
16   - type: Resource
17     resource:
18       name: cpu
19       target:
20         type: Utilization
21         averageUtilization: 70
22 
23 ---
24 # KEDA for Decode (TTFT-based)
25 apiVersion: keda.sh/v1alpha1
26 kind: ScaledObject
27 metadata:
28   name: sglang-agg-decode-scaler
29   namespace: default
30 spec:
31   scaleTargetRef:
32     apiVersion: nvidia.com/v1alpha1
33     kind: DynamoGraphDeploymentScalingAdapter
34     name: sglang-agg-decode
35   minReplicaCount: 1
36   maxReplicaCount: 10
37   triggers:
38   - type: prometheus
39     metadata:
40       serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
41       query: |
42         histogram_quantile(0.95,
43           sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
44           by (le)
45         )
46       threshold: "0.5"

Manual Scaling

With DGDSA Enabled

When DGDSA is enabled, scale via the adapter:

$ kubectl scale dgdsa sglang-agg-decode -n default --replicas=3

Verify the scaling:

$ kubectl get dgdsa sglang-agg-decode -n default
$ 
$ # Output:
$ # NAME                DGD         SERVICE   REPLICAS   AGE
$ # sglang-agg-decode   sglang-agg  decode    3          10m

If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.

With DGDSA Disabled (default)

If you’ve disabled the scaling adapter for a service, edit the DGD directly:

$ kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'

Or edit the YAML (no scalingAdapter.enabled: true means direct edits are allowed):

1 spec:
2   services:
3     decode:
4       replicas: 3
5       # No scalingAdapter.enabled means replicas can be edited directly

Best Practices

1. Choose One Autoscaler Per Service

Avoid configuring multiple autoscalers for the same service:

Configuration	Status
HPA for frontend, Planner for prefill/decode	✅ Good
KEDA for all services	✅ Good
Planner only (default)	✅ Good
HPA + Planner both targeting decode	❌ Bad - they will fight

2. Use Appropriate Metrics

Service Type	Recommended Metrics	Dynamo Metric
Frontend	CPU utilization, request rate	`dynamo_frontend_requests_total`
Prefill	Dispatch-stage depth (backend prefill saturation), TTFT	`dynamo_frontend_stage_requests{stage="dispatch"}`, `dynamo_frontend_time_to_first_token_seconds`
Decode	ITL, active concurrency	`dynamo_frontend_inter_token_latency_seconds`, `dynamo_frontend_active_requests`

3. Configure Stabilization Windows

Prevent thrashing with appropriate stabilization:

1 # HPA
2 behavior:
3   scaleDown:
4     stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
5   scaleUp:
6     stabilizationWindowSeconds: 0    # Scale up immediately
7 
8 # KEDA
9 spec:
10   cooldownPeriod: 300

4. Set Sensible Min/Max Replicas

Always configure minimum and maximum replicas in your HPA/KEDA to prevent:

Scaling to zero (unless intentional)
Unbounded scaling that exhausts cluster resources

Troubleshooting

Adapters Not Created

$ # Check DGD status
$ kubectl describe dgd sglang-agg -n default
$ 
$ # Check operator logs
$ kubectl logs -n dynamo-system deployment/dynamo-operator

Scaling Not Working

$ # Check adapter status
$ kubectl describe dgdsa sglang-agg-decode -n default
$ 
$ # Check HPA/KEDA status
$ kubectl describe hpa sglang-agg-decode-hpa -n default
$ kubectl describe scaledobject sglang-agg-decode-scaler -n default
$ 
$ # Verify metrics are available in Kubernetes metrics API
$ kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

Metrics Not Available

If HPA/KEDA shows <unknown> for metrics:

$ # Check if Dynamo metrics are being scraped
$ kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
$ curl http://localhost:8000/metrics | grep dynamo_frontend
$ 
$ # Example output (note: stage_requests has no `model` label — it's per frontend pod):
> # dynamo_frontend_active_requests{model="Qwen/Qwen3-0.6B"} 5
> # dynamo_frontend_stage_requests{stage="preprocess",phase=""} 0
> # dynamo_frontend_stage_requests{stage="route",phase="aggregated"} 0
> # dynamo_frontend_stage_requests{stage="dispatch",phase="aggregated"} 2
> # dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2        # deprecated
> # dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5      # deprecated
> 
> # Verify Prometheus is scraping the metrics
> kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
> # Then query: dynamo_frontend_time_to_first_token_seconds_bucket
> 
> # Check KEDA operator logs
> kubectl logs -n keda deployment/keda-operator

Rapid Scaling Up and Down

If you see unstable scaling:

Check if multiple autoscalers are targeting the same adapter
Increase cooldownPeriod in KEDA ScaledObject
Increase stabilizationWindowSeconds in HPA behavior

Example DGD

Overview

Architecture

Viewing Scaling Adapters

Replica Ownership Model

How It Works

Manual Scaling with DGDSA Enabled

Enabling DGDSA for a Service

Autoscaling with Dynamo Planner

Autoscaling with Kubernetes HPA

Basic HPA (CPU-based)

HPA with Dynamo Metrics

Available Dynamo Metrics

Metric Labels

Example: Scale Decode Service Based on TTFT

Example: Scale Based on Queue Depth

Autoscaling with KEDA (Recommended)

Installing KEDA

Example: Scale Decode Based on TTFT

Verify KEDA Scaling

Example: Scale Based on Queue Depth

How KEDA Works

Mixed Autoscaling

Manual Scaling

With DGDSA Enabled

With DGDSA Disabled (default)

Best Practices

1. Choose One Autoscaler Per Service

2. Use Appropriate Metrics

3. Configure Stabilization Windows

4. Set Sensible Min/Max Replicas

Troubleshooting

Adapters Not Created

Scaling Not Working

Metrics Not Available

Rapid Scaling Up and Down

References