This guide explains how to deploy GlobalPlanner and when to use it. GlobalPlanner is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint.
New to Planner? We recommend starting with a single-DGD deployment using either throughput-based or load-based scaling before adopting GlobalPlanner. See the Planner overview and Planner Guide to get started.
Without GlobalPlanner, each DGD’s local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:
GlobalPlanner solves that by becoming the common scale-execution endpoint for multiple local planners.
dynamo.planner component that computes desired replica counts to maintain latency SLAs. See the Planner overview.GlobalRouter and GlobalPlanner.Use GlobalPlanner in one of these two patterns:
Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint.
Typical examples:
qwen-0.6b disaggregated deployment with its own local plannerqwen-32b disaggregated deployment with its own local plannerGlobalPlanner that all local planners delegate toIn this pattern:
environment: "global-planner"global_planner_namespaceGlobalRouterThis is the pattern to use when the goal is centralized scaling control across multiple deployments or models.
Use this pattern when all of the following are true:
Typical examples:
If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead.
In the current implementation, the single-endpoint pattern is composed from multiple resources:
Current workflow
A single DGDR does not generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually.
The Frontend exposes a single model endpoint. GlobalRouter selects the best pool for each request. Each pool-local Planner decides how much capacity its own pool needs. GlobalPlanner receives those scale requests and applies the Kubernetes replica changes centrally.
vllm, sglang, or trtllm).For throughput-based scaling, you also need profiling data for each pool. See Profiler Guide.
Before writing manifests, decide the following:
Start by deciding what each pool should specialize in. Common examples:
For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool.
Example DGDR skeleton:
Repeat this once per planned pool, changing the workload and SLA inputs for each request class.
What to keep from each profiling result:
tensor-parallel-size, GPUs per worker, memory/caching settings).prefill_engine_num_gpu or decode_engine_num_gpu.See Planner Examples and Profiler Guide for DGDR details.
Deploy one control DGD that contains:
Frontend: the single public model endpoint.GlobalRouter: chooses which pool receives each request.GlobalPlanner: receives scale requests from pool planners and applies replica changes.The vLLM example topology is in examples/global_planner/global-planner-vllm-test.yaml.
The GlobalPlanner section is minimal:
The values passed to --managed-namespaces are the pool planners’ Dynamo namespaces (caller_namespace), not raw Kubernetes namespaces. In many examples they share the same string prefix, but they are logically different identifiers.
Management modes: When --managed-namespaces is set (explicit mode), only the listed Dynamo namespaces are authorized to send scale requests, and only their corresponding DGDs count toward the GPU budget. DGD names are derived from the Dynamo namespace using the operator convention DYN_NAMESPACE = {k8s_namespace}-{dgd_name}. When omitted (implicit mode), any caller is accepted and all DGDs in the Kubernetes namespace count toward the GPU budget.
If you want the central executor to reject scale requests that exceed a total GPU budget, add --max-total-gpus. See examples/global_planner/global-planner-gpu-budget.yaml.
Each private pool gets its own DGD. A pool DGD usually contains:
LocalRouterprefill or decode)PlannerThe planner inside each pool must be configured for global-planner mode so it delegates scaling to the control stack:
global_planner_namespace must point to the control stack’s Dynamo namespace. In the reference manifests, that is the namespace string passed to the control Frontend and GlobalRouter.
Use:
mode: "prefill" for prefill-only poolsmode: "decode" for decode-only poolsThe worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1.
In the reference vLLM example:
gp-prefill-0 uses a 1-GPU TP1 prefill workergp-prefill-1 uses a 2-GPU TP2 prefill workergp-decode-0 uses a 1-GPU TP1 decode workerSee global-planner-vllm-test.yaml.
GlobalRouter reads a JSON config that lists the pool namespaces and a routing grid for each request type.
Example:
The prefill_pool_dynamo_namespaces and decode_pool_dynamo_namespaces entries are Dynamo namespaces that the pool-local routers register under.
Important runtime behavior:
GlobalRouterClients can pass request targets through extra_args:
For more details, see Global Router README.
For a fresh cluster, the usual order is:
GlobalRouter ConfigMap.Frontend.Example:
The single user-facing endpoint is the Frontend in the control DGD, not the pool DGDs.
Validate the deployment from outside in:
Frontend is healthy and serving the model endpoint.GlobalRouter logs show requests being assigned to the expected pool namespaces.GlobalPlanner logs show accepted scale operations.If you use Prometheus and Grafana, also inspect:
For most teams, the easiest way to build this deployment is:
Frontend, GlobalRouter, and GlobalPlanner.Frontend.This keeps profiling and pool selection simple while still giving you one public endpoint for the model.
GlobalPlanner deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology.GlobalRouter routes by ISL/TTFT and context-length/ITL grids, not directly by OSL.