For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
      • Quickstart
      • Installation Guide
      • Model Deployment Guide
      • Model Caching
      • DGDR Reference
      • Dynamo Operator
      • Service Discovery
      • Webhooks
      • Minikube Setup
      • Managing Models with DynamoModel
      • Autoscaling
      • Rolling Update
      • Developing with Tilt
      • Inference Gateway (GAIE)
      • Snapshot
      • Shadow Engine Failover
      • Disagg Communication
      • Topology-Aware KV Transfer
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Topology-Aware KV Transfer
  • How It Works
  • Prerequisites
  • Required Same-Domain Routing
  • Capacity Planning Across Domains
  • Preferred Same-Domain Routing
  • Field Reference
  • Verify the Deployment
  • Troubleshooting
  • Pod Has No Copied Topology Label
  • Worker Exits While Waiting for Topology
  • Required Policy Fails Requests
  • Relationship to Topology Aware Scheduling
Kubernetes DeploymentDeployment Guide

Topology-Aware KV Transfer

Keep disaggregated prefill and decode KV-cache transfers within a selected topology domain

||View as Markdown|
Previous

Disagg Communication

Next

Metrics

Topology-Aware KV Transfer

Topology-aware KV transfer lets a disaggregated Dynamo deployment route decode requests toward workers that share the selected prefill worker’s topology domain, such as zone or rack. This reduces slow cross-domain KV-cache transfers when prefill and decode workers exchange KV data over NIXL.

Use this feature when:

  • Your deployment uses separate prefill and decode workers.
  • Your cluster exposes useful node labels, such as topology.kubernetes.io/zone or a rack/block label.
  • Same-domain KV transfer is required for correctness or strongly preferred for latency and bandwidth.

This page covers the Kubernetes operator path. For router and runtime behavior, see Router Topology-Aware KV Transfer. For RDMA/NIXL transport setup, see Disagg Communication.

How It Works

The operator configures worker pods from spec.experimental.kvTransferPolicy:

  • Adds a nvidia.com/topology-label-key annotation to worker pods.
  • Runs a topology-label controller that copies the configured node label onto the worker pod after scheduling.
  • Projects that pod label into /etc/dynamo/topology/<domain> with a Downward API volume.
  • Injects worker environment variables that tell the backend runtime which topology domain and enforcement policy to publish.

The frontend does not read this policy from its own environment. Workers publish the topology metadata in their ModelRuntimeConfig; the router reads it from runtime discovery.

Prerequisites

RequirementDetails
Disaggregated servingSeparate prefill and decode worker services.
KV routerThe frontend should use DYN_ROUTER_MODE=kv.
Node topology labelsEvery node that can host a worker must carry the configured labelKey.
Dynamo operatorThe operator must include topology-label controller and node-read RBAC.
KV transfer transportRDMA, EFA, or another NIXL-compatible transport should already be configured for production disaggregated deployments.

Confirm that the label you plan to use exists on worker nodes:

$kubectl get nodes -L topology.kubernetes.io/zone

Required Same-Domain Routing

enforcement: required constrains decode worker selection to workers whose topology value matches the selected prefill worker for the configured domain. If no decode worker satisfies the generated constraint, the router fails the request instead of silently crossing the domain.

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeployment
3metadata:
4 name: qwen3-disagg-zone
5spec:
6 experimental:
7 kvTransferPolicy:
8 labelKey: topology.kubernetes.io/zone
9 domain: zone
10 enforcement: required
11 components:
12 - name: Frontend
13 type: frontend
14 replicas: 1
15 podTemplate:
16 spec:
17 containers:
18 - name: main
19 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
20 env:
21 - name: DYN_ROUTER_MODE
22 value: kv
23 - name: VllmPrefillWorker
24 type: worker
25 replicas: 2
26 podTemplate:
27 spec:
28 containers:
29 - name: main
30 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
31 command: ["python3", "-m", "dynamo.vllm"]
32 args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "prefill"]
33 envFrom:
34 - secretRef:
35 name: hf-token-secret
36 resources:
37 limits:
38 nvidia.com/gpu: "1"
39 - name: VllmDecodeWorker
40 type: worker
41 replicas: 2
42 podTemplate:
43 spec:
44 containers:
45 - name: main
46 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
47 command: ["python3", "-m", "dynamo.vllm"]
48 args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "decode"]
49 envFrom:
50 - secretRef:
51 name: hf-token-secret
52 resources:
53 limits:
54 nvidia.com/gpu: "1"

enforcement defaults to required when omitted.

required is a decode-routing constraint, not a capacity planner. The DynamoGraphDeployment author or cluster administrator must ensure that every topology domain that can receive prefill workers also has sufficient same-domain decode capacity. If a domain has prefill workers but no matching decode workers, or too little decode capacity, the router cannot spill to another domain without violating the policy.

Capacity Planning Across Domains

Plan prefill and decode capacity per topology domain before enabling enforcement: required. For example, assume:

  • Two availability zones: az-1 and az-2.
  • The target fleet is 60 prefill workers and 120 decode workers.
  • The fleet should be split evenly across the two zones.
  • The target prefill-to-decode ratio is 1:2 in each zone.

That means each zone should run 30 prefill workers and 60 decode workers:

ZonePrefill workersDecode workersRatio
az-130601:2
az-230601:2

In a DynamoGraphDeployment, express this as separate prefill and decode components per zone. Pin each component to its zone and set kvTransferPolicy.enforcement to required so the router refuses cross-zone decode selection. The DGD author or cluster administrator must ensure each zone has enough schedulable capacity for its pinned replicas. Worker command and args are omitted here; configure each worker for prefill or decode mode as in the base disaggregated serving manifest:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeployment
3metadata:
4 name: qwen3-disagg-zone-capacity
5spec:
6 experimental:
7 kvTransferPolicy:
8 labelKey: topology.kubernetes.io/zone
9 domain: zone
10 enforcement: required
11 components:
12 - name: Frontend
13 type: frontend
14 replicas: 1
15 podTemplate:
16 spec:
17 containers:
18 - name: main
19 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
20 env:
21 - name: DYN_ROUTER_MODE
22 value: kv
23 - name: VllmPrefillWorkerAz1
24 type: worker
25 replicas: 30
26 podTemplate:
27 spec:
28 affinity:
29 nodeAffinity:
30 requiredDuringSchedulingIgnoredDuringExecution:
31 nodeSelectorTerms:
32 - matchExpressions:
33 - key: topology.kubernetes.io/zone
34 operator: In
35 values: ["az-1"]
36 containers:
37 - name: main
38 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
39 envFrom:
40 - secretRef:
41 name: hf-token-secret
42 - name: VllmDecodeWorkerAz1
43 type: worker
44 replicas: 60
45 podTemplate:
46 spec:
47 affinity:
48 nodeAffinity:
49 requiredDuringSchedulingIgnoredDuringExecution:
50 nodeSelectorTerms:
51 - matchExpressions:
52 - key: topology.kubernetes.io/zone
53 operator: In
54 values: ["az-1"]
55 containers:
56 - name: main
57 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
58 envFrom:
59 - secretRef:
60 name: hf-token-secret
61 - name: VllmPrefillWorkerAz2
62 type: worker
63 replicas: 30
64 podTemplate:
65 spec:
66 affinity:
67 nodeAffinity:
68 requiredDuringSchedulingIgnoredDuringExecution:
69 nodeSelectorTerms:
70 - matchExpressions:
71 - key: topology.kubernetes.io/zone
72 operator: In
73 values: ["az-2"]
74 containers:
75 - name: main
76 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
77 envFrom:
78 - secretRef:
79 name: hf-token-secret
80 - name: VllmDecodeWorkerAz2
81 type: worker
82 replicas: 60
83 podTemplate:
84 spec:
85 affinity:
86 nodeAffinity:
87 requiredDuringSchedulingIgnoredDuringExecution:
88 nodeSelectorTerms:
89 - matchExpressions:
90 - key: topology.kubernetes.io/zone
91 operator: In
92 values: ["az-2"]
93 containers:
94 - name: main
95 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
96 envFrom:
97 - secretRef:
98 name: hf-token-secret

Preferred Same-Domain Routing

enforcement: preferred keeps all decode workers eligible but biases worker selection toward the same topology domain.

1spec:
2 experimental:
3 kvTransferPolicy:
4 labelKey: topology.kubernetes.io/zone
5 domain: zone
6 enforcement: preferred
7 preferredWeight: 0.85

preferredWeight is required with enforcement: preferred. It must be between 0 and 1. A higher value creates a stronger same-domain preference, but it is not a probability and does not guarantee same-domain selection.

Field Reference

FieldRequiredDescription
labelKeyYesKubernetes node label key to copy onto worker pods, for example topology.kubernetes.io/zone.
domainYesLogical topology domain name published by workers, for example zone or rack. Must match ^[a-z0-9]([a-z0-9-]*[a-z0-9])?$.
enforcementNorequired or preferred. Defaults to required.
preferredWeightOnly with preferredBias weight from 0 to 1; only valid with enforcement: preferred.

The runtime uses domain, not the Kubernetes label key, when creating routing constraints. For example, labelKey: topology.kubernetes.io/zone and domain: zone produce worker topology metadata like:

1{
2 "topology_domains": {
3 "zone": "us-east-1a"
4 },
5 "kv_transfer_domain": "zone",
6 "kv_transfer_enforcement": "required"
7}

Verify the Deployment

After the DGD creates worker pods, verify the operator pipeline from node label to runtime topology file.

$export NAMESPACE=<namespace>
$export POD=<worker-pod>
$
$kubectl get pod "$POD" -n "$NAMESPACE" \
> -o jsonpath='{.metadata.annotations.nvidia\.com/topology-label-key}{"\n"}'
$
$kubectl get pod "$POD" -n "$NAMESPACE" \
> -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}'
$
$kubectl exec "$POD" -n "$NAMESPACE" -- \
> sh -c 'find /etc/dynamo/topology -maxdepth 1 -type f -print -exec cat {} \;'

Expected results:

  • The annotation value is the configured labelKey.
  • The worker pod has the copied topology label.
  • /etc/dynamo/topology/<domain> exists and contains the topology value.

Worker logs should include topology config during startup:

$kubectl logs "$POD" -n "$NAMESPACE" | grep -i "Topology config"

Troubleshooting

Pod Has No Copied Topology Label

Check whether the node has the configured label:

$NODE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.nodeName}')
$kubectl get node "$NODE" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}'

If the label is missing, the topology-label controller emits a warning event with reason TopologyLabelMissing and leaves topology metadata unavailable for that worker.

$kubectl get events -n "$NAMESPACE" \
> --field-selector involvedObject.name="$POD",reason=TopologyLabelMissing

Worker Exits While Waiting for Topology

When topology is enabled, the worker waits for the transfer-domain file to appear and contain data. If it stays empty, check:

  • spec.experimental.kvTransferPolicy.domain matches the projected file name.
  • spec.experimental.kvTransferPolicy.labelKey exists on the worker’s node.
  • The worker pod has the nvidia.com/topology-label-key annotation.
  • The topology-label controller is running and has node get RBAC.

Required Policy Fails Requests

With enforcement: required, decode routing fails if no decode worker has the same generated topology taint as the selected prefill worker. Verify both prefill and decode workers publish the same domain, and that each domain where prefill workers can be selected has enough matching decode workers for the expected p/d ratio.

Use preferred while validating a heterogeneous rollout if cross-domain routing is acceptable during partial capacity.

Relationship to Topology Aware Scheduling

Topology Aware Scheduling controls where Kubernetes places pods. Topology-aware KV transfer controls how Dynamo routes between already-running prefill and decode workers.

Use them together when possible:

  • Topology Aware Scheduling keeps workers placed inside useful topology boundaries.
  • Topology-aware KV transfer prevents the router from choosing a decode worker outside the selected prefill worker’s transfer domain.