Disaggregated Inference Communication Guide
Best practices for prefill/decode worker communication on Kubernetes
Best practices for prefill/decode worker communication on Kubernetes
This guide explains how prefill and decode workers communicate in Dynamo’s disaggregated inference architecture on Kubernetes. It answers the frequently asked question: Why can’t prefill and decode workers use NVLink to communicate on the same node?
NVLink is a direct GPU-to-GPU interconnect that operates at the hardware level. It requires:
cudaDeviceEnablePeerAccess() can be calledKubernetes pods violate all three requirements:
Process Isolation: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B’s memory space.
GPU Partitioning: The Kubernetes device plugin assigns specific GPUs to each pod via CUDA_VISIBLE_DEVICES. Pod A’s GPU 0 and Pod B’s GPU 0 are physically different devices.
Process/Namespace Isolation: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so cudaDeviceEnablePeerAccess() can be called.
Memory Registration: NVLink transfers use cudaMemcpy with peer access enabled. This requires calling cudaDeviceEnablePeerAccess() - impossible across process boundaries.
NVLink works within a pod for parallelism strategies (TP, EP) where all GPUs are in the same process:
When prefill and decode workers are on the same physical node:
Options (best to worst):
Best Practice: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.
When prefill and decode workers are on different nodes:
Requirements for optimal cross-node performance:
UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.
Excluding transports: Use ^ prefix to exclude (e.g., UCX_TLS=^mm excludes memory mapping).
Note: When specifying UCX_TLS explicitly with GPU memory, you must include cuda_copy or cuda_ipc for UCX to recognize GPU buffers.
Recommendation: Use get_zcopy with threshold 0 for KV cache transfers (always large).
⚠️ AWS EFA Exception: Do NOT use
get_zcopyon AWS with Ubuntu 24.04 + Kernel ≥6.8. See AWS EFA Configuration for required settings.
Note: UCX statistics (UCX_STATS_DEST, UCX_STATS_TRIGGER) require UCX compiled with --enable-stats flag, which is not enabled in default builds.
For clusters with InfiniBand RDMA (e.g., ConnectX NICs), use UCX with the rc (Reliable Connection) transport. This is the standard path for on-premises and bare-metal Kubernetes clusters.
RDMA Resources:
Request one rdma/ib device per GPU. The RDMA device plugin injects /dev/infiniband/* devices automatically:
No pod annotations are needed. InfiniBand devices are injected by the device plugin.
Security Context:
Add IPC_LOCK and SYS_RESOURCE capabilities. IPC_LOCK allows RDMA memory pinning, SYS_RESOURCE allows memlock limit escalation:
Environment Variables (worker containers):
Note:
UCX_IB_ADDR_TYPE=ethis the most common missing setting when bringing up NIXL disagg on InfiniBand clusters. If NIXL init succeeds but transfers fail withNIXL_ERR_REMOTE_DISCONNECT, this is likely the cause.
Known Issue — Bonded IB devices:
Some clusters expose bonded InfiniBand devices (e.g., mlx5_bond_0) with LID=0. If UCX selects a bonded device, transfers may fail. Verify device LIDs and select a non-bonded device:
NIXL supports libfabric as the backend for AWS EFA deployments. This is the recommended approach for disaggregated inference on AWS, achieving ~9.6 GB/s KV transfer bandwidth. See the AWS EFA with NIXL documentation for complete setup instructions.
Requirements:
/opt/amazon/efa)nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1-efa-amd64)Kernel Compatibility:
GDRCopy v2.5.1 has a build failure on kernel 6.15+ due to a vm_flags_set redefinition. Pin your Ubuntu EKS AMI to kernel 6.14 or earlier until GDRCopy v2.5.2 is available in GPU Operator.
Pod Anti-Affinity (Required):
EFA is designed for cross-node communication. Prefill and decode workers must be scheduled on different nodes to avoid EAGAIN errors during KV transfer.
Note: Anti-affinity only needs to be configured on one side (here, the decode worker). The Kubernetes scheduler enforces the constraint symmetrically—if decode cannot be placed with prefill, they will end up on different nodes regardless of which pod has the rule.
EFA Resource Requests:
Request EFA interfaces in your pod spec. The p5.48xlarge instance has 32 EFA interfaces (32 network cards × 1 interface each) with 3200 Gbps total bandwidth. The number of interfaces to allocate per worker depends on your deployment:
Example with 4 EFA interfaces (validated configuration):
Note: NIXL/libfabric automatically stripes traffic across all allocated EFA interfaces. The 4-interface configuration achieved ~9.6 GB/s in testing, which is sufficient for Llama-3.1-8B KV cache transfers at ISL=8000. Increase the count if your workload requires higher bandwidth (e.g., larger models or higher TP).
Environment Variables:
vLLM Configuration:
Verification:
Expected Log Output:
RDMA Device Plugin: Exposes rdma/ib or vpc.amazonaws.com/efa resources to Kubernetes
RDMA Network: One of:
GPUDirect RDMA (optional but recommended):
nvidia-peermem kernel module loaded (InfiniBand/RoCE)Expected output shows InfiniBand or RoCE devices:
Look for GPU memory support:
If you only see host: GPUDirect RDMA is not working. KV transfers will use host staging.
Expected bandwidth:
Deploy the NIXL benchmark to validate end-to-end KV transfer performance:
This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.
Good output:
Bad output (RDMA not working):
Check Grafana dashboards for:
Red flags indicating RDMA issues:
Unsupported operation errors in logsNote: For AWS EFA deployments, use libfabric with GDRCopy to enable GPUDirect RDMA. UCX on AWS EFA does not support GPUDirect on kernel ≥6.8 and results in severely degraded performance. See AWS EFA Configuration for setup instructions.
Use disaggregated architecture when:
Use aggregated architecture when:
The KV transfer overhead is amortized across output tokens. Measured data from Llama-3.1-8B-Instruct on AWS p5.48xlarge with NIXL+libfabric:
Key Insight: The KV transfer overhead via libfabric+EFA is only ~37ms. Combined with 41% faster decode (ITL), disaggregated inference delivers 22% higher throughput for prefill-bound workloads.
Disagg advantage scales with input length (ISL) (all at OSL=50, concurrency=10):
Symptoms: TTFT degrades from expected 200-500ms to 10+ seconds
Root Cause: RDMA not active, falling back to TCP
Diagnosis:
Solutions:
rdma/ib resource requests to pod specIPC_LOCK capabilitySymptoms: Logs show Unexpected UCX error: Unsupported operation
Root Cause: UCX attempting GPU RDMA on hardware that doesn’t support it
Solutions:
ucx_info -d | grep cudaUCX_RNDV_THRESH=inf to disable GPU RDMAnvidia-peermem module is loadedSymptoms: 3x performance degradation on AWS despite EFA configured
Root Cause: GPU Direct RDMA not functional on kernel ≥6.8 with EFA when using UCX
Solution: Use libfabric instead of UCX for AWS EFA deployments. Libfabric with GDRCopy provides efficient GPU Direct RDMA operations on AWS. See the AWS EFA Configuration section for setup instructions.
Alternative options (if libfabric is not available):
Symptoms: Decode worker logs show repeated EAGAIN errors:
Root Cause: Prefill and decode workers are scheduled on the same node. AWS EFA is designed for cross-node communication and does not function correctly for intra-node transfers.
Diagnosis:
If both prefill and decode workers show the same NODE, this is the problem.
Solution: Add pod anti-affinity rules to ensure workers are scheduled on different nodes:
Note: Use
nvidia.com/dynamo-componentas the label key, notapp.kubernetes.io/component. The Dynamo operator uses this label to identify component types.
Symptoms: NIXL backend creation fails immediately with NIXL_ERR_BACKEND. UCX logs show:
Or:
Root Causes:
Bonded IB device with LID=0: UCX selects mlx5_bond_0 by default, but bonded devices may have LID=0 (invalid for UD transport). Fix: set UCX_NET_DEVICES to a non-bonded device with a valid LID.
UCX/OFED version mismatch: The container’s UCX mlx5 library may be compiled against a different devx ABI than the host kernel driver. Any transport using IB (rc, cuda_ipc with IB) triggers the devx crash.
Missing RDMA device injection: If rdma/ib is not requested in the pod spec, no IB devices are injected into the container.
Diagnosis:
Solutions:
rdma/ib resources (1 per GPU) in the pod specUCX_NET_DEVICES to a non-bonded device if mlx5_bond_0 has LID=0Symptoms: Sporadic getXferStatus: backend 'UCX' returned error status
Diagnosis:
Common causes:
rdma/ib resources visible: kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'kubectl logs <pod> | grep "Backend"For UCX deployments:
ucx_info -d | grep "Transport: rc"ucx_info -d | grep "memory types.*cuda"For libfabric deployments (AWS EFA):
fi_info -p efals /dev/gdrdrv