SLA-Driven Profiling and Planner Deployment Quick Start Guide
SLA-Driven Profiling and Planner Deployment Quick Start Guide
SLA-Driven Profiling and Planner Deployment Quick Start Guide
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
Prerequisites: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the Dynamo Platform installation.
The DGDR workflow automates the entire process from SLA specification to deployment:
A DynamoGraphDeploymentRequest (DGDR) is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a “deployment order” where you specify:
model)ttft, itl)backend: vllm, sglang, or trtllm)profilingConfig.profilerImage, deploymentOverrides.workersImage)The Dynamo Operator watches for DGDRs and automatically:
Key Benefits:
Before creating a DGDR, ensure:
nvcr-imagepullsecret for NVIDIA images)Each DGDR requires you to specify container images for the profiling and deployment process:
profilingConfig.profilerImage (Required): Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
deploymentOverrides.workersImage (Optional): Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
If workersImage is omitted, the image from the base config file (e.g., disagg.yaml) is used. You may use our public images (0.6.1 and later) or build and push your own.
Dynamo provides sample DGDR configurations in benchmarks/profiler/deploy/. You can use these as starting points:
Available Sample DGDRs:
profile_sla_dgdr.yaml: Standard online profiling for dense modelsprofile_sla_aic_dgdr.yaml: Fast offline profiling using AI Configuratorprofile_sla_moe_dgdr.yaml: Online profiling for MoE models (SGLang)Or, you can create your own DGDR for your own needs.
Important - Profiling Config Cases: Prior to 0.8.1, any fields under
profilingConfig.configare represented in snake_case. Starting 0.8.1, fields underprofilingConfig.configare represented in camelCase for uniformity. There is backwards compatibility to snake_case, but as all example DGDRs are using camelCase, anyone using a release prior to 0.8.1 must manually update the configs under the examples to have snake_case config fields.
For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the DGDR Configuration Reference.
The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
The Dynamo Operator will immediately begin processing your request.
Watch the DGDR status:
DGDR Status States:
Pending: Initial state, preparing to profileProfiling: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)Deploying: Generating and applying DGD configurationReady: DGD successfully deployed and runningFailed: Error occurred (check events for details)With AI Configurator, profiling completes in 20-30 seconds! This is much faster than online profiling which takes 2-4 hours.
Once the DGDR reaches Ready state, your model is deployed and ready to serve:
If you want to monitor the SLA Planner’s decision-making in real-time, you can deploy the Planner Grafana dashboard.
Follow the instructions in Dynamo Metrics Collection on Kubernetes to access the Grafana UI and select the Dynamo Planner Dashboard.
The dashboard displays:
Use the Namespace dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
The sla section defines performance requirements and workload characteristics:
Choosing SLA Values:
Choose between online profiling (real measurements, 2-4 hours) or offline profiling with AI Configurator (estimated, 20-30 seconds):
For detailed comparison, supported configurations, and limitations, see SLA-Driven Profiling Documentation.
For details on hardware configuration and GPU discovery options, see Hardware Configuration in SLA-Driven Profiling.
If you have an existing DynamoGraphDeployment config (e.g., from examples/backends/*/deploy/disagg.yaml or custom recipes), you can reference it via ConfigMap:
Step 1: Create ConfigMap from your DGD config file:
Step 2: Reference the ConfigMap in your DGDR:
What’s happening: The profiler uses the DGD config from the ConfigMap as a base template, then optimizes it based on your SLA targets. The controller automatically injects
spec.modelintodeployment.modelandspec.backendintoengine.backendin the final configuration.
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your model and backend:
Note:
engine.configis a file path to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
Add planner-specific settings:
For details about the profiling process, performance plots, and interpolation data, see SLA-Driven Profiling Documentation.
Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for:
To deploy mocker instead of the real backend, set useMocker: true:
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
Starting in Dynamo 0.8.1, for large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See Model Cache PVC for configuration details.
DGDRs are immutable - if you need to update SLAs or configuration:
kubectl delete dgdr sla-aicThere are two ways to manually control deployment after profiling:
Disable auto-deployment to review the generated DGD before applying:
Then manually extract and apply the generated DGD:
The generated DGD includes optimized configurations and the SLA planner component. The required planner-profile-data ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully.
For advanced use cases, you can manually deploy using the standalone planner templates in examples/backends/*/deploy/disagg_planner.yaml:
Note: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it’s automatically tuned to your profiling results and SLA targets.
Important - Prometheus Configuration: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like “Failed to resolve prometheus service”, ensure the
PROMETHEUS_ENDPOINTenvironment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details.
The DGDR controller generates a DGD that:
The generated DGD is tracked via labels:
By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use dynamo-pvc. This is optional and will not affect the functionality of profiler or Planner.
What’s available in ConfigMaps (always created):
.json files)What’s available in PVC if attached to DGDR (optional):
.npz files)Setup:
outputPVC to your DGDR’s profilingConfig:For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see SLA-Driven Profiling Troubleshooting.
For comprehensive documentation of all DGDR configuration options, see the DGDR Configuration Reference.
This includes detailed explanations of: