Global Planner Deployment Guide

This guide explains how to deploy GlobalPlanner and when to use it. GlobalPlanner is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint.

New to Planner? We recommend starting with a single-DGD deployment using either throughput-based or load-based scaling before adopting GlobalPlanner. See the Planner overview and Planner Guide to get started.

Why Global Planner?

Without GlobalPlanner, each DGD’s local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:

apply centralized scaling policy across multiple DGDs
enforce shared constraints such as authorization or total GPU budget
coordinate scaling for a single-endpoint, multi-pool deployment

GlobalPlanner solves that by becoming the common scale-execution endpoint for multiple local planners.

Terminology

Planner: The dynamo.planner component that computes desired replica counts to maintain latency SLAs. See the Planner overview.
Local Planner: A pool-local instance of the Planner running inside a single DGD.
Global Planner: The centralized execution and policy layer that receives scale requests from local planners.
Single-endpoint multi-pool deployment: One model endpoint backed by multiple DGDs for the same model. This pattern uses both GlobalRouter and GlobalPlanner.

Deployment Patterns

Use GlobalPlanner in one of these two patterns:

Pattern	Use when	Needs `GlobalRouter`	Public endpoint shape
Multiple model endpoints or independent DGDs	Separate DGDs should share centralized scaling policy, such as authorization or total GPU budget	No	One endpoint per DGD, or however each DGD is exposed
One model endpoint, multiple DGDs	One model should be reachable through one public endpoint, but different request classes should land on different DGDs	Yes	One shared endpoint

Pattern 1: Multiple Model Endpoints Or Independent DGDs

Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint.

Typical examples:

DGD A: qwen-0.6b disaggregated deployment with its own local planner
DGD B: qwen-32b disaggregated deployment with its own local planner
one shared GlobalPlanner that all local planners delegate to

In this pattern:

each DGD keeps its own normal local planner
each local planner is configured with environment: "global-planner"
all those planners point at the same global_planner_namespace
each DGD keeps its own endpoint or frontend as needed
you do not need GlobalRouter

This is the pattern to use when the goal is centralized scaling control across multiple deployments or models.

Pattern 2: One Model Endpoint, Multiple DGDs

Use this pattern when all of the following are true:

You want one public endpoint for a single model.
You want different private pools for different request classes, such as short ISL vs. long ISL requests, or different latency targets.
You want each pool to autoscale independently.
You want routing and scale execution to be centralized instead of exposing multiple endpoints to clients.

Typical examples:

short-input requests are cheaper on a smaller prefill pool
long-input requests need a larger prefill pool
decode capacity should scale independently from prefill capacity

If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead.

What You Deploy

In the current implementation, the single-endpoint pattern is composed from multiple resources:

Resource	Purpose	Typical contents
Control DGD	Public entrypoint and centralized control plane	`Frontend`, `GlobalRouter`, `GlobalPlanner`
Prefill pool DGD(s)	Private prefill capacity pools	`LocalRouter`, prefill worker(s), `Planner`
Decode pool DGD(s)	Private decode capacity pools	`LocalRouter`, decode worker(s), `Planner`
Optional DGDR(s)	Generate or validate one optimized pool shape at a time	Model, workload, SLA, hardware inputs

Current workflow

A single DGDR does not generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually.

Architecture

Client
  |
  v
Frontend (single public endpoint)
  |
  v
GlobalRouter
  |
  +--> Prefill pool 0 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
  +--> Prefill pool 1 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
  |
  +--> Decode pool 0 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
  +--> Decode pool 1 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
Pool Planners
  |
  v
GlobalPlanner
  |
  v
Kubernetes scaling updates on the target DGDs

The Frontend exposes a single model endpoint. GlobalRouter selects the best pool for each request. Each pool-local Planner decides how much capacity its own pool needs. GlobalPlanner receives those scale requests and applies the Kubernetes replica changes centrally.

Prerequisites

Dynamo Kubernetes Platform installed. See Kubernetes Quickstart.
Prometheus deployed and scraping router metrics. The global planner examples assume cluster Prometheus is available.
Backend images available for your chosen framework (vllm, sglang, or trtllm).
Secrets for model access, such as a Hugging Face token secret.
A storage strategy for model weights if your workers should share a model cache PVC.

For throughput-based scaling, you also need profiling data for each pool. See Profiler Guide.

Inputs You Need To Decide Up Front

Before writing manifests, decide the following:

Input	Why it matters	Example
Model name	All pools in one hierarchy serve the same model	`meta-llama/Llama-3.3-70B-Instruct`
Backend	Worker args and profiling flow depend on it	`vllm`
Pool inventory	Number of specialized prefill and decode pools	2 prefill pools, 1 decode pool
Workload classes	Determines how many pool profiles you generate	short ISL, long ISL, long context decode
SLA targets	Guides profiling and routing decisions	`ttft: 200 ms`, `itl: 20 ms`
Worker shape	Tensor parallelism, GPUs per worker, and memory footprint	TP1 prefill vs. TP2 prefill
Routing policy	Maps requests to pools at runtime	low-ISL requests -> pool 0
Optional global budget	Caps total GPUs across managed pools	`--max-total-gpus 16`

Step 1: Profile Each Intended Pool Independently

Start by deciding what each pool should specialize in. Common examples:

Prefill pool 0: lower-cost pool for short prompts.
Prefill pool 1: larger pool for long prompts.
Decode pool 0: standard decode pool for most requests.

For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool.

Example DGDR skeleton:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: llama-prefill-short
5 spec:
6   model: meta-llama/Llama-3.3-70B-Instruct
7   backend: vllm
8   image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:<tag>
9   workload:
10     isl: 2048
11     osl: 256
12   sla:
13     ttft: 200.0
14     itl: 20.0
15   searchStrategy: rapid
16   autoApply: false

Repeat this once per planned pool, changing the workload and SLA inputs for each request class.

What to keep from each profiling result:

Worker shape (tensor-parallel-size, GPUs per worker, memory/caching settings).
Planner profile data directory or generated ConfigMaps.
Planner settings such as prefill_engine_num_gpu or decode_engine_num_gpu.
Any backend-specific flags that differ across pools.

See Planner Examples and Profiler Guide for DGDR details.

Step 2: Create The Control DGD

Deploy one control DGD that contains:

Frontend: the single public model endpoint.
GlobalRouter: chooses which pool receives each request.
GlobalPlanner: receives scale requests from pool planners and applies replica changes.

The vLLM example topology is in examples/global_planner/global-planner-vllm-test.yaml.

The GlobalPlanner section is minimal:

1 GlobalPlanner:
2   componentType: default
3   replicas: 1
4   extraPodSpec:
5     mainContainer:
6       image: ${DYNAMO_IMAGE}
7       command:
8         - python3
9         - -m
10         - dynamo.global_planner
11       args:
12         - --managed-namespaces
13         - ${K8S_NAMESPACE}-gp-prefill-0
14         - ${K8S_NAMESPACE}-gp-prefill-1
15         - ${K8S_NAMESPACE}-gp-decode-0

The values passed to --managed-namespaces are the pool planners’ Dynamo namespaces (caller_namespace), not raw Kubernetes namespaces. In many examples they share the same string prefix, but they are logically different identifiers.

Management modes: When --managed-namespaces is set (explicit mode), only the listed Dynamo namespaces are authorized to send scale requests, and only their corresponding DGDs count toward the GPU budget. DGD names are derived from the Dynamo namespace using the operator convention DYN_NAMESPACE = {k8s_namespace}-{dgd_name}. When omitted (implicit mode), any caller is accepted and all DGDs in the Kubernetes namespace count toward the GPU budget.

If you want the central executor to reject scale requests that exceed a total GPU budget, add --max-total-gpus. See examples/global_planner/global-planner-gpu-budget.yaml.

Step 3: Create One DGD Per Pool

Each private pool gets its own DGD. A pool DGD usually contains:

LocalRouter
one worker type (prefill or decode)
one Planner

The planner inside each pool must be configured for global-planner mode so it delegates scaling to the control stack:

1 {
2   "environment": "global-planner",
3   "global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl",
4   "backend": "vllm",
5   "mode": "prefill",
6   "enable_load_scaling": false,
7   "enable_throughput_scaling": true,
8   "throughput_metrics_source": "router",
9   "ttft": 2000,
10   "prefill_engine_num_gpu": 2,
11   "model_name": "${MODEL_NAME}",
12   "profile_results_dir": "/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"
13 }

global_planner_namespace must point to the control stack’s Dynamo namespace. In the reference manifests, that is the namespace string passed to the control Frontend and GlobalRouter.

Use:

mode: "prefill" for prefill-only pools
mode: "decode" for decode-only pools

The worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1.

In the reference vLLM example:

gp-prefill-0 uses a 1-GPU TP1 prefill worker
gp-prefill-1 uses a 2-GPU TP2 prefill worker
gp-decode-0 uses a 1-GPU TP1 decode worker

See global-planner-vllm-test.yaml.

Step 4: Configure GlobalRouter To Select Pools

GlobalRouter reads a JSON config that lists the pool namespaces and a routing grid for each request type.

Example:

1 {
2   "num_prefill_pools": 2,
3   "num_decode_pools": 1,
4   "prefill_pool_dynamo_namespaces": [
5     "${K8S_NAMESPACE}-gp-prefill-0",
6     "${K8S_NAMESPACE}-gp-prefill-1"
7   ],
8   "decode_pool_dynamo_namespaces": [
9     "${K8S_NAMESPACE}-gp-decode-0"
10   ],
11   "prefill_pool_selection_strategy": {
12     "ttft_min": 10,
13     "ttft_max": 3000,
14     "ttft_resolution": 2,
15     "isl_min": 0,
16     "isl_max": 32000,
17     "isl_resolution": 2,
18     "prefill_pool_mapping": [[0, 1], [0, 1]]
19   },
20   "decode_pool_selection_strategy": {
21     "itl_min": 10,
22     "itl_max": 500,
23     "itl_resolution": 2,
24     "context_length_min": 0,
25     "context_length_max": 32000,
26     "context_length_resolution": 2,
27     "decode_pool_mapping": [[0, 0], [0, 0]]
28   }
29 }

The prefill_pool_dynamo_namespaces and decode_pool_dynamo_namespaces entries are Dynamo namespaces that the pool-local routers register under.

Important runtime behavior:

Prefill pool selection uses ISL + TTFT target
Decode pool selection uses context length + ITL target
OSL is useful for designing and profiling pools, but it is not a direct routing key in the current GlobalRouter

Clients can pass request targets through extra_args:

1 {
2   "extra_args": {
3     "ttft_target": 200,
4     "itl_target": 20
5   }
6 }

For more details, see Global Router README.

Step 5: Deploy In Order

For a fresh cluster, the usual order is:

Install Dynamo platform and Prometheus.
Create secrets and PVCs needed by workers.
Create the GlobalRouter ConfigMap.
Apply the control DGD.
Apply the pool DGDs.
Wait for all DGDs to reach ready state.
Expose or port-forward the control Frontend.

Example:

$ export K8S_NAMESPACE=my-llama
$ export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
$ export DYNAMO_IMAGE=<dynamo-image>
$ export DYNAMO_VLLM_IMAGE=<vllm-image>
$ export STORAGE_CLASS_NAME=<rwx-storage-class>
$ 
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN=${HF_TOKEN} \
>   -n ${K8S_NAMESPACE}
$ 
$ envsubst < examples/global_planner/global-planner-vllm-test.yaml | \
>   kubectl apply -n ${K8S_NAMESPACE} -f -

The single user-facing endpoint is the Frontend in the control DGD, not the pool DGDs.

Step 6: Validate The Stack

Validate the deployment from outside in:

Confirm the control Frontend is healthy and serving the model endpoint.
Confirm GlobalRouter logs show requests being assigned to the expected pool namespaces.
Confirm pool-local planners are producing scale requests.
Confirm GlobalPlanner logs show accepted scale operations.
Confirm the target DGDs’ replica counts change as expected.

If you use Prometheus and Grafana, also inspect:

TTFT and ITL over time
per-pool worker counts
per-pool request mix
total GPU usage

Recommended Workflow For New Deployments

For most teams, the easiest way to build this deployment is:

Design your pool classes from expected traffic patterns.
Run one DGDR per pool class to generate or validate the pool configuration.
Copy the selected worker shape and planner settings into the final pool DGDs.
Build one control DGD with Frontend, GlobalRouter, and GlobalPlanner.
Route all client traffic through the control Frontend.

This keeps profiling and pool selection simple while still giving you one public endpoint for the model.

Current Limitations

Single-endpoint GlobalPlanner deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology.
GlobalRouter routes by ISL/TTFT and context-length/ITL grids, not directly by OSL.
In the single-endpoint pattern, all pools are expected to serve the same model.

Why Global Planner?

Without GlobalPlanner, each DGD’s local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:

apply centralized scaling policy across multiple DGDs
enforce shared constraints such as authorization or total GPU budget
coordinate scaling for a single-endpoint, multi-pool deployment

GlobalPlanner solves that by becoming the common scale-execution endpoint for multiple local planners.

Terminology

Planner: The dynamo.planner component that computes desired replica counts to maintain latency SLAs. See the Planner overview.
Local Planner: A pool-local instance of the Planner running inside a single DGD.
Global Planner: The centralized execution and policy layer that receives scale requests from local planners.
Single-endpoint multi-pool deployment: One model endpoint backed by multiple DGDs for the same model. This pattern uses both GlobalRouter and GlobalPlanner.

Deployment Patterns

Use GlobalPlanner in one of these two patterns:

Pattern	Use when	Needs `GlobalRouter`	Public endpoint shape
Multiple model endpoints or independent DGDs	Separate DGDs should share centralized scaling policy, such as authorization or total GPU budget	No	One endpoint per DGD, or however each DGD is exposed
One model endpoint, multiple DGDs	One model should be reachable through one public endpoint, but different request classes should land on different DGDs	Yes	One shared endpoint

Pattern 1: Multiple Model Endpoints Or Independent DGDs

Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint.

Typical examples:

DGD A: qwen-0.6b disaggregated deployment with its own local planner
DGD B: qwen-32b disaggregated deployment with its own local planner
one shared GlobalPlanner that all local planners delegate to

In this pattern:

each DGD keeps its own normal local planner
each local planner is configured with environment: "global-planner"
all those planners point at the same global_planner_namespace
each DGD keeps its own endpoint or frontend as needed
you do not need GlobalRouter

This is the pattern to use when the goal is centralized scaling control across multiple deployments or models.

Pattern 2: One Model Endpoint, Multiple DGDs

Use this pattern when all of the following are true:

You want one public endpoint for a single model.
You want different private pools for different request classes, such as short ISL vs. long ISL requests, or different latency targets.
You want each pool to autoscale independently.
You want routing and scale execution to be centralized instead of exposing multiple endpoints to clients.

Typical examples:

short-input requests are cheaper on a smaller prefill pool
long-input requests need a larger prefill pool
decode capacity should scale independently from prefill capacity

If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead.

What You Deploy

In the current implementation, the single-endpoint pattern is composed from multiple resources:

Resource	Purpose	Typical contents
Control DGD	Public entrypoint and centralized control plane	`Frontend`, `GlobalRouter`, `GlobalPlanner`
Prefill pool DGD(s)	Private prefill capacity pools	`LocalRouter`, prefill worker(s), `Planner`
Decode pool DGD(s)	Private decode capacity pools	`LocalRouter`, decode worker(s), `Planner`
Optional DGDR(s)	Generate or validate one optimized pool shape at a time	Model, workload, SLA, hardware inputs

Current workflow

A single DGDR does not generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually.

Architecture

Client
  |
  v
Frontend (single public endpoint)
  |
  v
GlobalRouter
  |
  +--> Prefill pool 0 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
  +--> Prefill pool 1 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
  |
  +--> Decode pool 0 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
  +--> Decode pool 1 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
Pool Planners
  |
  v
GlobalPlanner
  |
  v
Kubernetes scaling updates on the target DGDs

Prerequisites

Dynamo Kubernetes Platform installed. See Kubernetes Quickstart.
Prometheus deployed and scraping router metrics. The global planner examples assume cluster Prometheus is available.
Backend images available for your chosen framework (vllm, sglang, or trtllm).
Secrets for model access, such as a Hugging Face token secret.
A storage strategy for model weights if your workers should share a model cache PVC.

For throughput-based scaling, you also need profiling data for each pool. See Profiler Guide.

Inputs You Need To Decide Up Front

Before writing manifests, decide the following:

Input	Why it matters	Example
Model name	All pools in one hierarchy serve the same model	`meta-llama/Llama-3.3-70B-Instruct`
Backend	Worker args and profiling flow depend on it	`vllm`
Pool inventory	Number of specialized prefill and decode pools	2 prefill pools, 1 decode pool
Workload classes	Determines how many pool profiles you generate	short ISL, long ISL, long context decode
SLA targets	Guides profiling and routing decisions	`ttft: 200 ms`, `itl: 20 ms`
Worker shape	Tensor parallelism, GPUs per worker, and memory footprint	TP1 prefill vs. TP2 prefill
Routing policy	Maps requests to pools at runtime	low-ISL requests -> pool 0
Optional global budget	Caps total GPUs across managed pools	`--max-total-gpus 16`

Step 1: Profile Each Intended Pool Independently

Start by deciding what each pool should specialize in. Common examples:

Prefill pool 0: lower-cost pool for short prompts.
Prefill pool 1: larger pool for long prompts.
Decode pool 0: standard decode pool for most requests.

For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool.

Example DGDR skeleton:

1 apiVersion: nvidia.com/v1beta1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: llama-prefill-short
5 spec:
6   model: meta-llama/Llama-3.3-70B-Instruct
7   backend: vllm
8   image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:<tag>
9   workload:
10     isl: 2048
11     osl: 256
12   sla:
13     ttft: 200.0
14     itl: 20.0
15   searchStrategy: rapid
16   autoApply: false

Repeat this once per planned pool, changing the workload and SLA inputs for each request class.

What to keep from each profiling result:

Worker shape (tensor-parallel-size, GPUs per worker, memory/caching settings).
Planner profile data directory or generated ConfigMaps.
Planner settings such as prefill_engine_num_gpu or decode_engine_num_gpu.
Any backend-specific flags that differ across pools.

See Planner Examples and Profiler Guide for DGDR details.

Step 2: Create The Control DGD

Deploy one control DGD that contains:

Frontend: the single public model endpoint.
GlobalRouter: chooses which pool receives each request.
GlobalPlanner: receives scale requests from pool planners and applies replica changes.

The vLLM example topology is in examples/global_planner/global-planner-vllm-test.yaml.

The GlobalPlanner section is minimal:

1 GlobalPlanner:
2   componentType: default
3   replicas: 1
4   extraPodSpec:
5     mainContainer:
6       image: ${DYNAMO_IMAGE}
7       command:
8         - python3
9         - -m
10         - dynamo.global_planner
11       args:
12         - --managed-namespaces
13         - ${K8S_NAMESPACE}-gp-prefill-0
14         - ${K8S_NAMESPACE}-gp-prefill-1
15         - ${K8S_NAMESPACE}-gp-decode-0

If you want the central executor to reject scale requests that exceed a total GPU budget, add --max-total-gpus. See examples/global_planner/global-planner-gpu-budget.yaml.

Step 3: Create One DGD Per Pool

Each private pool gets its own DGD. A pool DGD usually contains:

LocalRouter
one worker type (prefill or decode)
one Planner

The planner inside each pool must be configured for global-planner mode so it delegates scaling to the control stack:

1 {
2   "environment": "global-planner",
3   "global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl",
4   "backend": "vllm",
5   "mode": "prefill",
6   "enable_load_scaling": false,
7   "enable_throughput_scaling": true,
8   "throughput_metrics_source": "router",
9   "ttft": 2000,
10   "prefill_engine_num_gpu": 2,
11   "model_name": "${MODEL_NAME}",
12   "profile_results_dir": "/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"
13 }

global_planner_namespace must point to the control stack’s Dynamo namespace. In the reference manifests, that is the namespace string passed to the control Frontend and GlobalRouter.

Use:

mode: "prefill" for prefill-only pools
mode: "decode" for decode-only pools

The worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1.

In the reference vLLM example:

gp-prefill-0 uses a 1-GPU TP1 prefill worker
gp-prefill-1 uses a 2-GPU TP2 prefill worker
gp-decode-0 uses a 1-GPU TP1 decode worker

See global-planner-vllm-test.yaml.

Step 4: Configure GlobalRouter To Select Pools

GlobalRouter reads a JSON config that lists the pool namespaces and a routing grid for each request type.

Example:

1 {
2   "num_prefill_pools": 2,
3   "num_decode_pools": 1,
4   "prefill_pool_dynamo_namespaces": [
5     "${K8S_NAMESPACE}-gp-prefill-0",
6     "${K8S_NAMESPACE}-gp-prefill-1"
7   ],
8   "decode_pool_dynamo_namespaces": [
9     "${K8S_NAMESPACE}-gp-decode-0"
10   ],
11   "prefill_pool_selection_strategy": {
12     "ttft_min": 10,
13     "ttft_max": 3000,
14     "ttft_resolution": 2,
15     "isl_min": 0,
16     "isl_max": 32000,
17     "isl_resolution": 2,
18     "prefill_pool_mapping": [[0, 1], [0, 1]]
19   },
20   "decode_pool_selection_strategy": {
21     "itl_min": 10,
22     "itl_max": 500,
23     "itl_resolution": 2,
24     "context_length_min": 0,
25     "context_length_max": 32000,
26     "context_length_resolution": 2,
27     "decode_pool_mapping": [[0, 0], [0, 0]]
28   }
29 }

The prefill_pool_dynamo_namespaces and decode_pool_dynamo_namespaces entries are Dynamo namespaces that the pool-local routers register under.

Important runtime behavior:

Prefill pool selection uses ISL + TTFT target
Decode pool selection uses context length + ITL target
OSL is useful for designing and profiling pools, but it is not a direct routing key in the current GlobalRouter

Clients can pass request targets through extra_args:

1 {
2   "extra_args": {
3     "ttft_target": 200,
4     "itl_target": 20
5   }
6 }

For more details, see Global Router README.

Step 5: Deploy In Order

For a fresh cluster, the usual order is:

Install Dynamo platform and Prometheus.
Create secrets and PVCs needed by workers.
Create the GlobalRouter ConfigMap.
Apply the control DGD.
Apply the pool DGDs.
Wait for all DGDs to reach ready state.
Expose or port-forward the control Frontend.

Example:

$ export K8S_NAMESPACE=my-llama
$ export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
$ export DYNAMO_IMAGE=<dynamo-image>
$ export DYNAMO_VLLM_IMAGE=<vllm-image>
$ export STORAGE_CLASS_NAME=<rwx-storage-class>
$ 
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN=${HF_TOKEN} \
>   -n ${K8S_NAMESPACE}
$ 
$ envsubst < examples/global_planner/global-planner-vllm-test.yaml | \
>   kubectl apply -n ${K8S_NAMESPACE} -f -

The single user-facing endpoint is the Frontend in the control DGD, not the pool DGDs.

Step 6: Validate The Stack

Validate the deployment from outside in:

Confirm the control Frontend is healthy and serving the model endpoint.
Confirm GlobalRouter logs show requests being assigned to the expected pool namespaces.
Confirm pool-local planners are producing scale requests.
Confirm GlobalPlanner logs show accepted scale operations.
Confirm the target DGDs’ replica counts change as expected.

If you use Prometheus and Grafana, also inspect:

TTFT and ITL over time
per-pool worker counts
per-pool request mix
total GPU usage

Recommended Workflow For New Deployments

For most teams, the easiest way to build this deployment is:

Design your pool classes from expected traffic patterns.
Run one DGDR per pool class to generate or validate the pool configuration.
Copy the selected worker shape and planner settings into the final pool DGDs.
Build one control DGD with Frontend, GlobalRouter, and GlobalPlanner.
Route all client traffic through the control Frontend.

This keeps profiling and pool selection simple while still giving you one public endpoint for the model.

Current Limitations

Single-endpoint GlobalPlanner deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology.
GlobalRouter routes by ISL/TTFT and context-length/ITL grids, not directly by OSL.
In the single-endpoint pattern, all pools are expected to serve the same model.

Global Planner Deployment Guide

Global Planner Deployment Guide

Why Global Planner?

Terminology

Deployment Patterns

Pattern 1: Multiple Model Endpoints Or Independent DGDs

Pattern 2: One Model Endpoint, Multiple DGDs

What You Deploy

Architecture

Prerequisites

Inputs You Need To Decide Up Front

Step 1: Profile Each Intended Pool Independently

Step 2: Create The Control DGD

Step 3: Create One DGD Per Pool

Step 4: Configure GlobalRouter To Select Pools

Step 5: Deploy In Order

Step 6: Validate The Stack

Recommended Workflow For New Deployments

Current Limitations

See Also

Why Global Planner?

Terminology

Deployment Patterns

Pattern 1: Multiple Model Endpoints Or Independent DGDs

Pattern 2: One Model Endpoint, Multiple DGDs

What You Deploy

Architecture

Prerequisites

Inputs You Need To Decide Up Front

Step 1: Profile Each Intended Pool Independently

Step 2: Create The Control DGD

Step 3: Create One DGD Per Pool

Step 4: Configure GlobalRouter To Select Pools

Step 5: Deploy In Order

Step 6: Validate The Stack

Recommended Workflow For New Deployments

Current Limitations

See Also

1	apiVersion: nvidia.com/v1beta1
2	kind: DynamoGraphDeploymentRequest
3	metadata:
4	name: llama-prefill-short
5	spec:
6	model: meta-llama/Llama-3.3-70B-Instruct
7	backend: vllm
8	image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:<tag>
9	workload:
10	isl: 2048
11	osl: 256
12	sla:
13	ttft: 200.0
14	itl: 20.0
15	searchStrategy: rapid
16	autoApply: false

1	GlobalPlanner:
2	componentType: default
3	replicas: 1
4	extraPodSpec:
5	mainContainer:
6	image: ${DYNAMO_IMAGE}
7	command:
8	- python3
9	- -m
10	- dynamo.global_planner
11	args:
12	- --managed-namespaces
13	- ${K8S_NAMESPACE}-gp-prefill-0
14	- ${K8S_NAMESPACE}-gp-prefill-1
15	- ${K8S_NAMESPACE}-gp-decode-0

1	{
2	"environment": "global-planner",
3	"global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl",
4	"backend": "vllm",
5	"mode": "prefill",
6	"enable_load_scaling": false,
7	"enable_throughput_scaling": true,
8	"throughput_metrics_source": "router",
9	"ttft": 2000,
10	"prefill_engine_num_gpu": 2,
11	"model_name": "${MODEL_NAME}",
12	"profile_results_dir": "/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"
13	}

1	{
2	"num_prefill_pools": 2,
3	"num_decode_pools": 1,
4	"prefill_pool_dynamo_namespaces": [
5	"${K8S_NAMESPACE}-gp-prefill-0",
6	"${K8S_NAMESPACE}-gp-prefill-1"
7	],
8	"decode_pool_dynamo_namespaces": [
9	"${K8S_NAMESPACE}-gp-decode-0"
10	],
11	"prefill_pool_selection_strategy": {
12	"ttft_min": 10,
13	"ttft_max": 3000,
14	"ttft_resolution": 2,
15	"isl_min": 0,
16	"isl_max": 32000,
17	"isl_resolution": 2,
18	"prefill_pool_mapping": [[0, 1], [0, 1]]
19	},
20	"decode_pool_selection_strategy": {
21	"itl_min": 10,
22	"itl_max": 500,
23	"itl_resolution": 2,
24	"context_length_min": 0,
25	"context_length_max": 32000,
26	"context_length_resolution": 2,
27	"decode_pool_mapping": [[0, 0], [0, 0]]
28	}
29	}

$	export K8S_NAMESPACE=my-llama
$	export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
$	export DYNAMO_IMAGE=<dynamo-image>
$	export DYNAMO_VLLM_IMAGE=<vllm-image>
$	export STORAGE_CLASS_NAME=<rwx-storage-class>
$
$	kubectl create secret generic hf-token-secret \
>	--from-literal=HF_TOKEN=${HF_TOKEN} \
>	-n ${K8S_NAMESPACE}
$
$	envsubst < examples/global_planner/global-planner-vllm-test.yaml \| \
>	kubectl apply -n ${K8S_NAMESPACE} -f -