Disaggregated Serving | NVIDIA Dynamo Documentation

Disaggregated serving separates the two main phases of LLM inference:

Phase	What it does	Scaling pressure
Prefill	Processes the prompt and produces the initial KV cache.	Input length, prompt reuse, context size
Decode	Generates output tokens using the KV cache.	Concurrency, output length, active KV memory

In an aggregated deployment, each worker does both phases. In a disaggregated deployment, prefill workers and decode workers are separate pools. Dynamo routes each request through prefill first, transfers or exposes the KV cache state to decode, and streams the response from the decode worker.

When It Helps

Disaggregated serving is most useful when prefill and decode need different resource shapes:

long prompts or retrieval-heavy traffic make prefill expensive
long generations or high concurrency make decode the bottleneck
you want to scale prefill and decode replicas independently
you want to pair prefill/decode separation with KV-aware routing
large models need different parallelism for prompt processing and generation

It is not automatically better for every workload. For small models, short prompts, low concurrency, or clusters without fast KV transfer, an aggregated deployment may be simpler and faster.

Mental Model

Disaggregated serving usually combines four pieces:

Piece	Role
Frontend/router	Accepts OpenAI-compatible requests and coordinates routing.
Prefill workers	Run the prompt phase and prepare KV transfer state.
Decode workers	Continue generation after prefill completes.
KV transfer path	Moves or exposes KV cache state between prefill and decode workers.

KV-aware routing is related but separate. Disaggregated serving splits prefill and decode. KV-aware routing chooses workers based on cache locality. Many production deployments use both, but you can reason about them independently.

For router-specific behavior, see Router: Disaggregated Serving and KV Cache Aware Routing.

KV Transfer Is the Critical Path

Disaggregation only helps when decode workers can access the KV cache produced by prefill quickly. In cross-node or high-throughput deployments, the KV transfer path commonly depends on RDMA-capable networking through the backend’s transfer layer, such as NIXL/UCX. If RDMA is missing or silently falls back to TCP, TTFT and throughput can be dominated by KV movement rather than model compute.

Treat KV transfer as an early validation step, not a final tuning detail. Common failure modes include missing RDMA device-plugin resources, pods without the needed rdma/ib requests or IPC_LOCK capability, UCX/NIXL transport errors, mismatched model or KV cache settings between prefill and decode workers, and benchmarks that run through local port-forwarding instead of inside the cluster.

Symptoms usually look like high TTFT despite available prefill capacity, decode workers sitting idle while prefill workers are busy, or disaggregated throughput falling below an aggregated baseline after splitting workers across nodes.

Production cross-node disaggregated deployments usually require RDMA or an equivalent fast fabric for KV cache transfer. Without it, the backend may fall back to TCP and KV transfer can dominate TTFT and throughput. Validate the transfer path before spending time tuning replica counts.

Deploying Disaggregated with RDMA

Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA or another fast transfer path, this movement can become the main performance bottleneck.

Prerequisites for a production cross-node deployment:

RDMA-capable network such as InfiniBand, RoCE, or an equivalent fast fabric.
RDMA device plugin installed on the cluster so worker pods can request rdma/ib resources.
ETCD and NATS deployed for Dynamo coordination.

The following example shows the RDMA-relevant fields in a disaggregated vLLM DynamoGraphDeployment. Start from a validated recipe when one exists, then adapt the resource requests, model, image, and parallelism for your cluster.

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: dynamo-disagg
5   namespace: your-namespace
6 spec:
7   backendFramework: vllm
8   pvcs:
9     - name: model-cache
10       create: false
11   services:
12     Frontend:
13       componentType: frontend
14       replicas: 1
15       volumeMounts:
16         - name: model-cache
17           mountPoint: /opt/models
18       envs:
19         - name: HF_HOME
20           value: /opt/models
21       extraPodSpec:
22         mainContainer:
23           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
24           imagePullPolicy: IfNotPresent
25 
26     VLLMPrefillWorker:
27       envFromSecret: hf-token-secret
28       componentType: worker
29       subComponentType: prefill
30       replicas: 2
31       resources:
32         limits:
33           gpu: "2"
34       sharedMemory:
35         size: 16Gi
36       volumeMounts:
37         - name: model-cache
38           mountPoint: /opt/models
39       envs:
40         - name: HF_HOME
41           value: /opt/models
42         - name: UCX_TLS
43           value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
44         - name: UCX_RNDV_SCHEME
45           value: "get_zcopy"
46         - name: UCX_RNDV_THRESH
47           value: "0"
48       extraPodSpec:
49         mainContainer:
50           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
51           workingDir: /workspace
52           imagePullPolicy: IfNotPresent
53           securityContext:
54             capabilities:
55               add: ["IPC_LOCK"]
56           resources:
57             limits:
58               rdma/ib: "2"
59             requests:
60               rdma/ib: "2"
61           command: ["python3", "-m", "dynamo.vllm"]
62           args:
63             - --model
64             - "Qwen/Qwen3-32B-FP8"
65             - "--tensor-parallel-size"
66             - "2"
67             - "--kv-cache-dtype"
68             - "fp8"
69             - "--max-num-seqs"
70             - "1"
71             - --disaggregation-mode
72             - prefill
73 
74     VLLMDecodeWorker:
75       envFromSecret: hf-token-secret
76       componentType: worker
77       subComponentType: decode
78       replicas: 1
79       resources:
80         limits:
81           gpu: "4"
82       sharedMemory:
83         size: 16Gi
84       volumeMounts:
85         - name: model-cache
86           mountPoint: /opt/models
87       envs:
88         - name: HF_HOME
89           value: /opt/models
90         - name: UCX_TLS
91           value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
92         - name: UCX_RNDV_SCHEME
93           value: "get_zcopy"
94         - name: UCX_RNDV_THRESH
95           value: "0"
96       extraPodSpec:
97         mainContainer:
98           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
99           workingDir: /workspace
100           imagePullPolicy: IfNotPresent
101           securityContext:
102             capabilities:
103               add: ["IPC_LOCK"]
104           resources:
105             limits:
106               rdma/ib: "4"
107             requests:
108               rdma/ib: "4"
109           command: ["python3", "-m", "dynamo.vllm"]
110           args:
111             - --model
112             - "Qwen/Qwen3-32B-FP8"
113             - "--tensor-parallel-size"
114             - "4"
115             - "--kv-cache-dtype"
116             - "fp8"
117             - "--max-num-seqs"
118             - "1024"
119             - --disaggregation-mode
120             - decode

Critical RDMA settings:

Setting	Purpose
`rdma/ib: "N"`	Requests RDMA resources for the worker pod. In most disaggregated vLLM deployments, match this to the worker TP size.
`IPC_LOCK` capability	Allows RDMA memory registration and pinned-memory use.
`UCX_TLS`	Enables RDMA-capable UCX transports such as `rc_x` and `dc_x`, plus CUDA transports for GPU buffers.
`UCX_RNDV_SCHEME=get_zcopy`	Enables zero-copy RDMA transfers for large KV-cache movement.

After deployment, check the worker logs for UCX/NIXL initialization:

$ kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"

Expected output includes:

NIXL INFO Backend UCX was instantiated

If logs only show TCP transports, RDMA is not active. Check the RDMA device plugin, worker rdma/ib resource requests, security context, and UCX settings. For full transport setup and troubleshooting, see the Disaggregated Communication Guide.

Deployment Paths

Choose the path that matches how much control you need:

Starting point	Use when
Dynamo Recipes	A recipe matches your model, backend, hardware, and serving mode. Start here for validated baselines and `perf.yaml` benchmarks.
Direct `DynamoGraphDeployment`	You already know the prefill/decode layout, images, parallelism, and KV transfer settings.
DGDR	You want Dynamo to generate a DGD from model, backend, hardware, workload, and SLA intent.
Sizing with AIConfigurator	You want to compare aggregated vs. disaggregated layouts and estimate prefill/decode sizing before deployment.

Good recipe starting points include:

For the Kubernetes resource model, see the Deployment Overview.

Backend Examples

Each built-in backend has examples that show the concrete worker flags and transfer settings:

Backend	Examples
vLLM	Deployment examples, including `disagg.yaml`, `disagg_router.yaml`, and `disagg_planner.yaml`
TensorRT-LLM	Deployment examples, including disaggregated, router, and planner variants
SGLang	Deployment examples, including NIXL-based disaggregated serving

Operational Notes

Disaggregated deployments add a data-movement path between workers. Before moving to production, verify:

KV transfer backend and network fabric are configured for your backend
RDMA resources, UCX/NIXL settings, and security context are active when your deployment depends on RDMA
prefill and decode workers agree on model, dtype, block size, and KV layout
pods have the required GPU, shared memory, and network resources
frontend/router flags match your routing strategy
benchmarks run inside the cluster, not through local port-forwarding, when validating high-load performance

Use Dynamo Benchmarking to compare aggregated and disaggregated configurations with the same workload.

Next Steps

Start from a matching Dynamo Recipe when available.
Read the backend-specific deployment example for your engine.
Use Sizing with AIConfigurator or DGDR when you need help choosing prefill/decode sizing.
Validate the result with Dynamo Benchmarking.