SLA Planner Quick Start Guide#
Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.
Important
Prerequisites: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the Dynamo Platform installation.
Overview#
The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.
The deployment process consists of two mandatory phases:
Pre-Deployment Profiling (2-4 hours) - Generates performance data
SLA Planner Deployment (5-10 minutes) - Enables autoscaling
Tip
Fast Profiling with AI Configurator: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See AI Configurator section in the Profiling Guide.
flowchart TD
A[Start Setup] --> B{Profiling Done?}
B -->|No| C[Run Profiling<br/>2-4 hours]
C --> D[Verify Results]
D --> E[Deploy Planner<br/>5-10 minutes]
B -->|Yes| E
E --> F[Test System]
F --> G[Ready!]
style A fill:#e1f5fe
style C fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style B fill:#fff8e1
Prerequisites#
Before deploying the SLA planner, ensure:
Dynamo platform installed (see Installation Guide)
kube-prometheus-stack installed and running. By default, the prometheus server is not deployed in the
monitoringnamespace. If it is deployed to a different namespace, setdynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090".Benchmarking resources setup (see Kubernetes utilities for Dynamo Benchmarking and Profiling) The script will create a
dynamo-pvcwithReadWriteManyaccess, if your cluster’s default storageClassName does not allowReadWriteMany, you need to specify a different storageClassName indeploy/utils/manifests/pvc.yamlwhich does supportReadWriteMany.
Pre-Deployment Profiling#
Deploying planner starts with running pre-deployment profiling.
Warning
MANDATORY: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model’s performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.
Step 1.1: Set Up Profiling Environment#
Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.
export NAMESPACE=your-namespace
Prerequisites: Ensure all dependencies are installed:
pip install -r deploy/utils/requirements.txt
Step 1.2: Inject Your Configuration#
Use the injector utility to place your DGD manifest into the PVC:
# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
Note: All paths must start with
/data/for security reasons.
Step 1.3: Configure SLA Targets#
For dense models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml:
spec:
template:
spec:
containers:
- name: profile-sla
args:
- --isl
- "3000" # average ISL is 3000 tokens
- --osl
- "150" # average OSL is 150 tokens
- --ttft
- "200" # target TTFT is 200ms
- --itl
- "20" # target ITL is 20ms
- --backend
- <vllm/sglang>
- --deploy-after-profile
For MoE models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml instead.
To automatically deploy the optimized DGD with planner after profiling, add --deploy-after-profile to the profiling job. It will deploy the DGD with the engine of the optimized parallelization mapping found for the SLA targets.
Step 1.4: Run Profiling#
Set the container image and config path:
export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
export DGD_CONFIG_FILE=/data/configs/disagg.yaml
Run profiling:
# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -
Step 1.5: Monitor Profiling Progress#
kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE
Note
Time Investment: This profiling process is comprehensive and typically takes 2-4 hours to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.
Step 1.6: Download Profiling Results (Optional)#
If you want to view the profiling results and performance plots:
# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
For detailed information about the output structure, performance plots, and how to analyze the results, see the Viewing Profiling Results section in the Profiling Guide.
Verify Success: Look for terminal output like:
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
...
Final DGD config with planner: {...}
Deploying the optimized DGD with planner...
Step 1.7: Wait for Deployment to be Ready#
kubectl get pods -n $NAMESPACE
Expected pods (all should be 1/1 Running):
vllm-disagg-planner-frontend-* 1/1 Running
vllm-disagg-planner-planner-* 1/1 Running
vllm-disagg-planner-backend-* 1/1 Running
vllm-disagg-planner-prefill-* 1/1 Running
Step 1.8: Test the System#
# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
# Send a request
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":true,
"max_tokens": 30
}'
Step 1.9: Monitor Scaling#
# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
Expected successful output (after streaming requests):
New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
Observed ttft: X.XXms itl: X.XXms
Number of prefill workers: 1, number of decode workers: 1
Production Readiness#
Monitoring Metrics#
Basic metrics (request count): Available with any request type
Latency metrics (TTFT/ITL): Available for both streaming and non-streaming requests
Scaling decisions: Require sufficient request volume
Troubleshooting#
Connection Issues:
# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
Missing Metrics:
# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service
Worker Issues:
Large models can take 10+ minutes to initialize
Check worker logs:
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backendEnsure GPU resources are available for workers
Unknown Field subComponentType:
If you encounter the following error when applying the deployment:
Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
This is because the subComponentType field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions here.
Next Steps#
Architecture Details: See SLA-based Planner Architecture for technical details
Performance Tuning: See Pre-Deployment Profiling Guide for advanced profiling options
Load Testing: See SLA Planner Load Test for comprehensive testing tools
Quick Reference#
Phase |
Duration |
Purpose |
Status Check |
|---|---|---|---|
Profiling |
2-4 hours |
Generate performance data |
|
Deployment |
5-10 minutes |
Enable autoscaling |
|
Testing |
5 minutes |
Verify functionality |
|
Tip
Need Help? If you encounter issues, check the troubleshooting section or refer to the detailed guides linked in Next Steps.