SLA Planner Quick Start Guide#

Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.

Important

Prerequisites: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the Dynamo Platform installation.

Overview#

The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.

The deployment process consists of two mandatory phases:

  1. Pre-Deployment Profiling (2-4 hours) - Generates performance data

  2. SLA Planner Deployment (5-10 minutes) - Enables autoscaling

Tip

Fast Profiling with AI Configurator: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See AI Configurator section in the Profiling Guide.

        flowchart TD
    A[Start Setup] --> B{Profiling Done?}
    B -->|No| C[Run Profiling<br/>2-4 hours]
    C --> D[Verify Results]
    D --> E[Deploy Planner<br/>5-10 minutes]
    B -->|Yes| E
    E --> F[Test System]
    F --> G[Ready!]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style E fill:#e8f5e8
    style G fill:#f3e5f5
    style B fill:#fff8e1
    

Prerequisites#

Before deploying the SLA planner, ensure:

  • Dynamo platform installed (see Installation Guide)

  • kube-prometheus-stack installed and running. By default, the prometheus server is not deployed in the monitoring namespace. If it is deployed to a different namespace, set dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090".

  • Benchmarking resources setup (see Kubernetes utilities for Dynamo Benchmarking and Profiling) The script will create a dynamo-pvc with ReadWriteMany access, if your cluster’s default storageClassName does not allow ReadWriteMany, you need to specify a different storageClassName in deploy/utils/manifests/pvc.yaml which does support ReadWriteMany.

Pre-Deployment Profiling#

Deploying planner starts with running pre-deployment profiling.

Warning

MANDATORY: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model’s performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.

Step 1.1: Set Up Profiling Environment#

Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.

export NAMESPACE=your-namespace

Prerequisites: Ensure all dependencies are installed:

pip install -r deploy/utils/requirements.txt

Step 1.2: Inject Your Configuration#

Use the injector utility to place your DGD manifest into the PVC:

# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml

# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml

Note: All paths must start with /data/ for security reasons.

Step 1.3: Configure SLA Targets#

For dense models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml:

spec:
  template:
    spec:
      containers:
        - name: profile-sla
          args:
            - --isl
            - "3000" # average ISL is 3000 tokens
            - --osl
            - "150" # average OSL is 150 tokens
            - --ttft
            - "200" # target TTFT is 200ms
            - --itl
            - "20" # target ITL is 20ms
            - --backend
            - <vllm/sglang>
            - --deploy-after-profile

For MoE models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml instead.

To automatically deploy the optimized DGD with planner after profiling, add --deploy-after-profile to the profiling job. It will deploy the DGD with the engine of the optimized parallelization mapping found for the SLA targets.

Step 1.4: Run Profiling#

Set the container image and config path:

export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
export DGD_CONFIG_FILE=/data/configs/disagg.yaml

Run profiling:

# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -

# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -

# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -

Step 1.5: Monitor Profiling Progress#

kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE

Note

Time Investment: This profiling process is comprehensive and typically takes 2-4 hours to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.

Step 1.6: Download Profiling Results (Optional)#

If you want to view the profiling results and performance plots:

# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results

For detailed information about the output structure, performance plots, and how to analyze the results, see the Viewing Profiling Results section in the Profiling Guide.

Verify Success: Look for terminal output like:

Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
...
Final DGD config with planner: {...}
Deploying the optimized DGD with planner...

Step 1.7: Wait for Deployment to be Ready#

kubectl get pods -n $NAMESPACE

Expected pods (all should be 1/1 Running):

vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running

Step 1.8: Test the System#

# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000

# Send a request
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":true,
    "max_tokens": 30
  }'

Step 1.9: Monitor Scaling#

# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10

Expected successful output (after streaming requests):

New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
Observed ttft: X.XXms itl: X.XXms
Number of prefill workers: 1, number of decode workers: 1

Production Readiness#

Monitoring Metrics#

  • Basic metrics (request count): Available with any request type

  • Latency metrics (TTFT/ITL): Available for both streaming and non-streaming requests

  • Scaling decisions: Require sufficient request volume

Troubleshooting#

Connection Issues:

# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"

Missing Metrics:

# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service

Worker Issues:

  • Large models can take 10+ minutes to initialize

  • Check worker logs: kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend

  • Ensure GPU resources are available for workers

Unknown Field subComponentType:

If you encounter the following error when applying the deployment:

Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"

This is because the subComponentType field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions here.

Next Steps#

Quick Reference#

Phase

Duration

Purpose

Status Check

Profiling

2-4 hours

Generate performance data

kubectl logs job/profile-sla

Deployment

5-10 minutes

Enable autoscaling

kubectl get pods

Testing

5 minutes

Verify functionality

kubectl logs deployment/planner


Tip

Need Help? If you encounter issues, check the troubleshooting section or refer to the detailed guides linked in Next Steps.