For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodality Support
    • Tool Calling
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers in Dynamo
  • Components
    • Frontend
    • Router
    • Planner
      • Planner Guide
      • Planner Examples
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Discovery Plane
    • Request Plane
    • Event Plane
    • Router Design
    • KVBM Design
    • Planner Design
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Basic Examples
  • Minimal DGDR with AIC (Fastest)
  • Online Profiling (Real Measurements)
  • Kubernetes Examples
  • MoE Models (SGLang)
  • Using Existing DGD Configs (Custom Setups)
  • Inline Configuration (Simple Use Cases)
  • Mocker Deployment (Testing)
  • Model Cache PVC (0.8.1+)
  • Advanced Examples
  • Custom Load Predictors
  • Warm-starting with Trace Data
  • Kalman Filter Tuning
  • Prophet for Seasonal Workloads
  • Virtual Connector
  • Planner Configuration Passthrough
  • Review Before Deploy (autoApply: false)
  • Profiling Artifacts with PVC
  • Related Documentation
ComponentsPlanner

Planner Examples

||View as Markdown|
Edit this page
Previous

Planner Guide

Next

Profiler

Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the Planner Guide. For a quick overview, see the Planner README.

Basic Examples

Minimal DGDR with AIC (Fastest)

The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: sla-aic
5spec:
6 model: Qwen/Qwen3-32B
7 backend: vllm
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
11 config:
12 sla:
13 isl: 3000
14 osl: 150
15 ttft: 200
16 itl: 20
17 sweep:
18 useAiConfigurator: true
19 aicSystem: h200_sxm
20 aicHfId: Qwen/Qwen3-32B
21 aicBackendVersion: "0.20.0"
22
23 deploymentOverrides:
24 workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
25
26 autoApply: true

Deploy:

$export NAMESPACE=your-namespace
$kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

Online Profiling (Real Measurements)

Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: sla-online
5spec:
6 model: meta-llama/Llama-3.3-70B-Instruct
7 backend: vllm
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
11 config:
12 sla:
13 isl: 3000
14 osl: 150
15 ttft: 200
16 itl: 20
17 sweep:
18 useAiConfigurator: false
19 prefillInterpolationGranularity: 16
20 decodeInterpolationGranularity: 6
21
22 deploymentOverrides:
23 workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
24
25 autoApply: true

Deploy:

$kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE

Available sample DGDRs in benchmarks/profiler/deploy/:

  • profile_sla_dgdr.yaml: Standard online profiling for dense models
  • profile_sla_aic_dgdr.yaml: Fast offline profiling using AI Configurator
  • profile_sla_moe_dgdr.yaml: Online profiling for MoE models (SGLang)

Profiling Config Cases: Prior to 0.8.1, fields under profilingConfig.config use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.

Kubernetes Examples

MoE Models (SGLang)

For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: sla-moe
5spec:
6 model: deepseek-ai/DeepSeek-R1
7 backend: sglang
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
11 config:
12 sla:
13 isl: 4000
14 osl: 500
15 ttft: 300
16 itl: 10
17 sweep:
18 useAiConfigurator: false
19
20 deploymentOverrides:
21 workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
22
23 autoApply: true

Deploy:

$kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE

Using Existing DGD Configs (Custom Setups)

Reference an existing DynamoGraphDeployment config via ConfigMap:

Step 1: Create ConfigMap from your DGD config:

$kubectl create configmap deepseek-r1-config \
> --from-file=disagg.yaml=/path/to/your/disagg.yaml \
> --namespace $NAMESPACE \
> --dry-run=client -o yaml | kubectl apply -f -

Step 2: Reference it in your DGDR:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: deepseek-r1
5spec:
6 model: deepseek-ai/DeepSeek-R1
7 backend: sglang
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
11 configMapRef:
12 name: deepseek-r1-config
13 key: disagg.yaml # Must match the key used in --from-file
14 config:
15 sla:
16 isl: 4000
17 osl: 500
18 ttft: 300
19 itl: 10
20 sweep:
21 useAiConfigurator: true
22 aicSystem: h200_sxm
23 aicHfId: deepseek-ai/DeepSeek-V3
24 aicBackendVersion: "0.20.0"
25
26 deploymentOverrides:
27 workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
28
29 autoApply: true

The profiler uses the DGD config from the ConfigMap as a base template, then optimizes it based on your SLA targets. The controller automatically injects spec.model and spec.backend into the final configuration.

Inline Configuration (Simple Use Cases)

For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:

1profilingConfig:
2 config:
3 sla:
4 isl: 8000
5 osl: 200
6 ttft: 200.0
7 itl: 10.0
8
9 hardware:
10 minNumGpusPerEngine: 2
11 maxNumGpusPerEngine: 8
12 gpuType: h200_sxm
13
14 sweep:
15 prefillInterpolationGranularity: 16
16 decodeInterpolationGranularity: 6

Mocker Deployment (Testing)

Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:

  • Large-scale experiments without GPU resources
  • Testing planner behavior and infrastructure
  • Validating deployment configurations
1spec:
2 model: <model-name>
3 backend: trtllm # Real backend for profiling
4 useMocker: true # Deploy mocker instead of real backend
5
6 profilingConfig:
7 profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
8 config:
9 sla:
10 isl: 3000
11 osl: 150
12 ttft: 200
13 itl: 20
14 sweep:
15 useAiConfigurator: true
16 aicSystem: h100_sxm
17 autoApply: true

Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.

Model Cache PVC (0.8.1+)

For large models, use a pre-populated PVC instead of downloading from HuggingFace:

See SLA-Driven Profiling for configuration details.

Advanced Examples

Custom Load Predictors

Warm-starting with Trace Data

Pre-load predictors with historical request patterns before live traffic:

1# In planner arguments
2args:
3 - --load-predictor arima
4 - --load-predictor-warmup-trace /data/trace.jsonl
5 - --load-predictor-log1p

The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.

Kalman Filter Tuning

For workloads with rapid changes, tune the Kalman filter:

1args:
2 - --load-predictor kalman
3 - --kalman-q-level 2.0 # Higher = more responsive to level changes
4 - --kalman-q-trend 0.5 # Higher = trend changes faster
5 - --kalman-r 5.0 # Lower = trusts new measurements more
6 - --kalman-min-points 3 # Fewer points before forecasting starts
7 - --load-predictor-log1p # Often helps with request-rate series

Prophet for Seasonal Workloads

For workloads with daily/weekly patterns:

1args:
2 - --load-predictor prophet
3 - --prophet-window-size 100 # Larger window for seasonal detection
4 - --load-predictor-log1p

Virtual Connector

For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:

1from dynamo._core import DistributedRuntime, VirtualConnectorClient
2
3# Initialize client
4client = VirtualConnectorClient(distributed_runtime, namespace)
5
6# Main loop: watch for planner decisions and execute them
7while True:
8 # Block until the planner makes a new scaling decision
9 await client.wait()
10
11 # Read the decision
12 decision = await client.get()
13 print(f"Scale to: prefill={decision.num_prefill_workers}, "
14 f"decode={decision.num_decode_workers}, "
15 f"id={decision.decision_id}")
16
17 # Execute scaling in your environment
18 scale_prefill_workers(decision.num_prefill_workers)
19 scale_decode_workers(decision.num_decode_workers)
20
21 # Report completion
22 await client.complete(decision)

See components/planner/test/test_virtual_connector.py for a full working example.

Planner Configuration Passthrough

Pass planner-specific settings through the DGDR:

1profilingConfig:
2 config:
3 planner:
4 plannerMinEndpoint: 2

Review Before Deploy (autoApply: false)

Disable auto-deployment to inspect the generated DGD:

1spec:
2 autoApply: false

After profiling completes:

$# Extract and review generated DGD
$kubectl get dgdr sla-aic -n $NAMESPACE \
> -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
$
$# Review and modify as needed
$vi my-dgd.yaml
$
$# Deploy manually
$kubectl apply -f my-dgd.yaml -n $NAMESPACE

Profiling Artifacts with PVC

Save detailed profiling artifacts (plots, logs, raw data) to a PVC:

1spec:
2 profilingConfig:
3 outputPVC: "dynamo-pvc"
4 config:
5 sla:
6 isl: 3000
7 osl: 150
8 ttft: 200
9 itl: 20

Setup:

$export NAMESPACE=your-namespace
$deploy/utils/setup_benchmarking_resources.sh

Access results:

$kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
$kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
$kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
$kubectl delete pod pvc-access-pod -n $NAMESPACE

Related Documentation

  • Planner README — Overview and quick start
  • Planner Guide — Deployment, configuration, integration
  • Planner Design — Architecture deep-dive
  • DGDR Configuration Reference
  • SLA-Driven Profiling