Disaggregated Serving

Split prefill and decode into independently scalable worker pools
View as Markdown

Disaggregated serving separates the two main phases of LLM inference:

PhaseWhat it doesScaling pressure
PrefillProcesses the prompt and produces the initial KV cache.Input length, prompt reuse, context size
DecodeGenerates output tokens using the KV cache.Concurrency, output length, active KV memory

In an aggregated deployment, each worker does both phases. In a disaggregated deployment, prefill workers and decode workers are separate pools. Dynamo routes each request through prefill first, transfers or exposes the KV cache state to decode, and streams the response from the decode worker.

When It Helps

Disaggregated serving is most useful when prefill and decode need different resource shapes:

  • long prompts or retrieval-heavy traffic make prefill expensive
  • long generations or high concurrency make decode the bottleneck
  • you want to scale prefill and decode replicas independently
  • you want to pair prefill/decode separation with KV-aware routing
  • large models need different parallelism for prompt processing and generation

It is not automatically better for every workload. For small models, short prompts, low concurrency, or clusters without fast KV transfer, an aggregated deployment may be simpler and faster.

Mental Model

Disaggregated serving usually combines four pieces:

PieceRole
Frontend/routerAccepts OpenAI-compatible requests and coordinates routing.
Prefill workersRun the prompt phase and prepare KV transfer state.
Decode workersContinue generation after prefill completes.
KV transfer pathMoves or exposes KV cache state between prefill and decode workers.

KV-aware routing is related but separate. Disaggregated serving splits prefill and decode. KV-aware routing chooses workers based on cache locality. Many production deployments use both, but you can reason about them independently.

For router-specific behavior, see Router: Disaggregated Serving and KV Cache Aware Routing.

KV Transfer Is the Critical Path

Disaggregation only helps when decode workers can access the KV cache produced by prefill quickly. In cross-node or high-throughput deployments, the KV transfer path commonly depends on RDMA-capable networking through the backend’s transfer layer, such as NIXL/UCX. If RDMA is missing or silently falls back to TCP, TTFT and throughput can be dominated by KV movement rather than model compute.

Treat KV transfer as an early validation step, not a final tuning detail. Common failure modes include missing RDMA device-plugin resources, pods without the needed rdma/ib requests or IPC_LOCK capability, UCX/NIXL transport errors, mismatched model or KV cache settings between prefill and decode workers, and benchmarks that run through local port-forwarding instead of inside the cluster.

Symptoms usually look like high TTFT despite available prefill capacity, decode workers sitting idle while prefill workers are busy, or disaggregated throughput falling below an aggregated baseline after splitting workers across nodes.

Production cross-node disaggregated deployments usually require RDMA or an equivalent fast fabric for KV cache transfer. Without it, the backend may fall back to TCP and KV transfer can dominate TTFT and throughput. Validate the transfer path before spending time tuning replica counts.

Deploying Disaggregated with RDMA

Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA or another fast transfer path, this movement can become the main performance bottleneck.

Prerequisites for a production cross-node deployment:

  1. RDMA-capable network such as InfiniBand, RoCE, or an equivalent fast fabric.
  2. RDMA device plugin installed on the cluster so worker pods can request rdma/ib resources.
  3. ETCD and NATS deployed for Dynamo coordination.

The following example shows the RDMA-relevant fields in a disaggregated vLLM DynamoGraphDeployment. Start from a validated recipe when one exists, then adapt the resource requests, model, image, and parallelism for your cluster.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: dynamo-disagg
5 namespace: your-namespace
6spec:
7 backendFramework: vllm
8 pvcs:
9 - name: model-cache
10 create: false
11 services:
12 Frontend:
13 componentType: frontend
14 replicas: 1
15 volumeMounts:
16 - name: model-cache
17 mountPoint: /opt/models
18 envs:
19 - name: HF_HOME
20 value: /opt/models
21 extraPodSpec:
22 mainContainer:
23 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
24 imagePullPolicy: IfNotPresent
25
26 VLLMPrefillWorker:
27 envFromSecret: hf-token-secret
28 componentType: worker
29 subComponentType: prefill
30 replicas: 2
31 resources:
32 limits:
33 gpu: "2"
34 sharedMemory:
35 size: 16Gi
36 volumeMounts:
37 - name: model-cache
38 mountPoint: /opt/models
39 envs:
40 - name: HF_HOME
41 value: /opt/models
42 - name: UCX_TLS
43 value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
44 - name: UCX_RNDV_SCHEME
45 value: "get_zcopy"
46 - name: UCX_RNDV_THRESH
47 value: "0"
48 extraPodSpec:
49 mainContainer:
50 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
51 workingDir: /workspace
52 imagePullPolicy: IfNotPresent
53 securityContext:
54 capabilities:
55 add: ["IPC_LOCK"]
56 resources:
57 limits:
58 rdma/ib: "2"
59 requests:
60 rdma/ib: "2"
61 command: ["python3", "-m", "dynamo.vllm"]
62 args:
63 - --model
64 - "Qwen/Qwen3-32B-FP8"
65 - "--tensor-parallel-size"
66 - "2"
67 - "--kv-cache-dtype"
68 - "fp8"
69 - "--max-num-seqs"
70 - "1"
71 - --disaggregation-mode
72 - prefill
73
74 VLLMDecodeWorker:
75 envFromSecret: hf-token-secret
76 componentType: worker
77 subComponentType: decode
78 replicas: 1
79 resources:
80 limits:
81 gpu: "4"
82 sharedMemory:
83 size: 16Gi
84 volumeMounts:
85 - name: model-cache
86 mountPoint: /opt/models
87 envs:
88 - name: HF_HOME
89 value: /opt/models
90 - name: UCX_TLS
91 value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
92 - name: UCX_RNDV_SCHEME
93 value: "get_zcopy"
94 - name: UCX_RNDV_THRESH
95 value: "0"
96 extraPodSpec:
97 mainContainer:
98 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
99 workingDir: /workspace
100 imagePullPolicy: IfNotPresent
101 securityContext:
102 capabilities:
103 add: ["IPC_LOCK"]
104 resources:
105 limits:
106 rdma/ib: "4"
107 requests:
108 rdma/ib: "4"
109 command: ["python3", "-m", "dynamo.vllm"]
110 args:
111 - --model
112 - "Qwen/Qwen3-32B-FP8"
113 - "--tensor-parallel-size"
114 - "4"
115 - "--kv-cache-dtype"
116 - "fp8"
117 - "--max-num-seqs"
118 - "1024"
119 - --disaggregation-mode
120 - decode

Critical RDMA settings:

SettingPurpose
rdma/ib: "N"Requests RDMA resources for the worker pod. In most disaggregated vLLM deployments, match this to the worker TP size.
IPC_LOCK capabilityAllows RDMA memory registration and pinned-memory use.
UCX_TLSEnables RDMA-capable UCX transports such as rc_x and dc_x, plus CUDA transports for GPU buffers.
UCX_RNDV_SCHEME=get_zcopyEnables zero-copy RDMA transfers for large KV-cache movement.

After deployment, check the worker logs for UCX/NIXL initialization:

$kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"

Expected output includes:

NIXL INFO Backend UCX was instantiated

If logs only show TCP transports, RDMA is not active. Check the RDMA device plugin, worker rdma/ib resource requests, security context, and UCX settings. For full transport setup and troubleshooting, see the Disaggregated Communication Guide.

Deployment Paths

Choose the path that matches how much control you need:

Starting pointUse when
Dynamo RecipesA recipe matches your model, backend, hardware, and serving mode. Start here for validated baselines and perf.yaml benchmarks.
Direct DynamoGraphDeploymentYou already know the prefill/decode layout, images, parallelism, and KV transfer settings.
DGDRYou want Dynamo to generate a DGD from model, backend, hardware, workload, and SLA intent.
Sizing with AIConfiguratorYou want to compare aggregated vs. disaggregated layouts and estimate prefill/decode sizing before deployment.

Good recipe starting points include:

For the Kubernetes resource model, see the Deployment Overview.

Backend Examples

Each built-in backend has examples that show the concrete worker flags and transfer settings:

BackendExamples
vLLMDeployment examples, including disagg.yaml, disagg_router.yaml, and disagg_planner.yaml
TensorRT-LLMDeployment examples, including disaggregated, router, and planner variants
SGLangDeployment examples, including NIXL-based disaggregated serving

Operational Notes

Disaggregated deployments add a data-movement path between workers. Before moving to production, verify:

  • KV transfer backend and network fabric are configured for your backend
  • RDMA resources, UCX/NIXL settings, and security context are active when your deployment depends on RDMA
  • prefill and decode workers agree on model, dtype, block size, and KV layout
  • pods have the required GPU, shared memory, and network resources
  • frontend/router flags match your routing strategy
  • benchmarks run inside the cluster, not through local port-forwarding, when validating high-load performance

Use Dynamo Benchmarking to compare aggregated and disaggregated configurations with the same workload.

Next Steps

  1. Start from a matching Dynamo Recipe when available.
  2. Read the backend-specific deployment example for your engine.
  3. Use Sizing with AIConfigurator or DGDR when you need help choosing prefill/decode sizing.
  4. Validate the result with Dynamo Benchmarking.