Planner Examples | NVIDIA Dynamo Documentation

Practical examples for deploying the Planner with throughput-based scaling. The DGDR workflow can use native AIC estimates, optional bootstrap profiling data, or live FPM warmup depending on the model/backend combination. For deployment concepts, see the Planner Guide. For a quick overview, see the Planner README.

Basic Examples

Minimal DGDR with AIC (Fastest)

The simplest way to deploy with the Planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: sla-aic
5 spec:
6   model: Qwen/Qwen3-32B
7   backend: vllm
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

Deploy:

$ export NAMESPACE=your-namespace
$ # Save the manifest above as sla-aic.yaml first.
$ kubectl apply -f sla-aic.yaml -n $NAMESPACE

Online Profiling (Real Measurements)

Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: sla-online
5 spec:
6   model: meta-llama/Llama-3.3-70B-Instruct
7   backend: vllm
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

Deploy:

$ # Save the manifest above as sla-online.yaml first.
$ kubectl apply -f sla-online.yaml -n $NAMESPACE

Note: Starting with Dynamo 1.0.0 (DGDR API version v1beta1), DGDR fields use structured spec fields (e.g., spec.workload, spec.sla, spec.hardware) instead of the nested profilingConfig.config blob used in v1alpha1.

Kubernetes Examples

MoE Models (SGLang)

For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: sla-moe
5 spec:
6   model: deepseek-ai/DeepSeek-R1
7   backend: sglang
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

Deploy:

$ # Save the manifest above as sla-moe.yaml first.
$ kubectl apply -f sla-moe.yaml -n $NAMESPACE

Using Existing DGD Configs (Custom Setups)

Reference an existing DynamoGraphDeployment config via ConfigMap:

Step 1: Create ConfigMap from your DGD config:

$ kubectl create configmap deepseek-r1-config \
>   --from-file=disagg.yaml=/path/to/your/disagg.yaml \
>   --namespace $NAMESPACE \
>   --dry-run=client -o yaml | kubectl apply -f -

Step 2: Reference it in your DGDR:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: deepseek-r1
5 spec:
6   model: deepseek-ai/DeepSeek-R1
7   backend: sglang
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

The profiler uses the DGD config from the ConfigMap as a base template, then optimizes it based on your SLA targets. The controller automatically injects spec.model and spec.backend into the final configuration.

Inline Configuration (Simple Use Cases)

For simple use cases without a custom DGD config, provide the configuration directly in the v1beta1 DGDR spec fields. The profiler auto-generates a basic DGD configuration:

1 spec:
2   workload:
3     isl: 8000
4     osl: 200
5 
6   sla:
7     ttft: 200.0
8     itl: 10.0
9 
10   hardware:
11     gpuSku: h200_sxm
12 
13   searchStrategy: rapid

Simulation with Mocker

Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:

Large-scale experiments without GPU resources
Testing planner behavior and infrastructure
Validating deployment configurations

1 spec:
2   model: <model-name>
3   backend: trtllm  # Real backend for profiling
4   features:
5     mocker:
6       enabled: true  # Deploy mocker instead of real backend
7 
8   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0"  # dynamo-frontend for Dynamo < 1.1.0

Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.

Model Cache PVC (0.8.1+)

For large models, use a pre-populated PVC instead of downloading from HuggingFace:

See SLA-Driven Profiling for configuration details.

Advanced Examples

Custom Load Predictors

Warm-starting with Trace Data

Pre-load predictors with historical request patterns before live traffic:

1 # In planner arguments
2 args:
3   - --load-predictor arima
4   - --load-predictor-warmup-trace /data/trace.jsonl
5   - --load-predictor-log1p

The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.

Kalman Filter Tuning

For workloads with rapid changes, tune the Kalman filter:

1 args:
2   - --load-predictor kalman
3   - --kalman-q-level 2.0      # Higher = more responsive to level changes
4   - --kalman-q-trend 0.5      # Higher = trend changes faster
5   - --kalman-r 5.0            # Lower = trusts new measurements more
6   - --kalman-min-points 3     # Fewer points before forecasting starts
7   - --load-predictor-log1p    # Often helps with request-rate series

Prophet for Seasonal Workloads

For workloads with daily/weekly patterns:

1 args:
2   - --load-predictor prophet
3   - --prophet-window-size 100   # Larger window for seasonal detection
4   - --load-predictor-log1p

Virtual Connector

For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:

1 from dynamo._core import DistributedRuntime, VirtualConnectorClient
2 
3 # Initialize client
4 client = VirtualConnectorClient(distributed_runtime, namespace)
5 
6 # Main loop: watch for planner decisions and execute them
7 while True:
8     # Block until the planner makes a new scaling decision
9     await client.wait()
10 
11     # Read the decision
12     decision = await client.get()
13     print(f"Scale to: prefill={decision.num_prefill_workers}, "
14           f"decode={decision.num_decode_workers}, "
15           f"id={decision.decision_id}")
16 
17     # Execute scaling in your environment
18     scale_prefill_workers(decision.num_prefill_workers)
19     scale_decode_workers(decision.num_decode_workers)
20 
21     # Report completion
22     await client.complete(decision)

See components/planner/test/test_virtual_connector.py for a full working example.

Planner Configuration Passthrough

Pass planner-specific settings through the DGDR:

1 features:
2   planner:
3     optimization_target: sla
4     min_endpoint: 2

Review Before Deploy (autoApply: false)

Disable auto-deployment to inspect the generated DGD:

1 spec:
2   autoApply: false

After profiling completes:

$ # Extract and review generated DGD
$ kubectl get dgdr sla-aic -n $NAMESPACE \
>   -o jsonpath='{.status.profilingResults.selectedConfig}' > my-dgd.yaml
$ 
$ # Review and modify as needed
$ vi my-dgd.yaml
$ 
$ # Deploy manually
$ kubectl apply -f my-dgd.yaml -n $NAMESPACE

Profiling Artifacts with PVC

Save detailed profiling artifacts (plots, logs, raw data) to a PVC:

1 spec:
2   workload:
3     isl: 3000
4     osl: 150
5 
6   sla:
7     ttft: 200
8     itl: 20

Setup:

$ export NAMESPACE=your-namespace
$ deploy/utils/setup_benchmarking_resources.sh

Access results:

$ kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
$ kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
$ kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
$ kubectl delete pod pvc-access-pod -n $NAMESPACE

Planner README — Overview and quick start
Planner Guide — Deployment, configuration, integration
Planner Design — Architecture deep-dive
DGDR Configuration Reference
SLA-Driven Profiling

Basic Examples

Minimal DGDR with AIC (Fastest)

Online Profiling (Real Measurements)

Kubernetes Examples

MoE Models (SGLang)

Using Existing DGD Configs (Custom Setups)

Inline Configuration (Simple Use Cases)

Simulation with Mocker

Model Cache PVC (0.8.1+)

Advanced Examples

Custom Load Predictors

Warm-starting with Trace Data

Kalman Filter Tuning

Prophet for Seasonal Workloads

Virtual Connector

Planner Configuration Passthrough

Review Before Deploy (autoApply: false)

Profiling Artifacts with PVC

Related Documentation