Disaggregated Inference Communication Guide
Best practices for prefill/decode worker communication on Kubernetes
This guide explains how prefill and decode workers communicate in Dynamo’s disaggregated inference architecture on Kubernetes. It answers the frequently asked question: Why can’t prefill and decode workers use NVLink to communicate on the same node?
Summary
- NVLink cannot be used between Kubernetes pods due to process isolation and GPU partitioning
- RDMA (InfiniBand/RoCE) is required for production disaggregated deployments
- Without RDMA, expect 200-500x performance degradation in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
- UCX is the communication layer that NIXL uses to transfer KV cache between workers
Architecture Overview
Communication Stack
Component Responsibilities
Why NVLink Cannot Be Used Between Pods
The Fundamental Constraint
NVLink is a direct GPU-to-GPU interconnect that operates at the hardware level. It requires:
- Same process - Both GPUs must be visible to a single process so
cudaDeviceEnablePeerAccess()can be called - Direct memory access - Process must have permission to access both GPU memory regions
- Peer-to-peer mapping - CUDA runtime must establish memory mappings between GPUs
Kubernetes pods violate all three requirements:
Technical Explanation
-
Process Isolation: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B’s memory space.
-
GPU Partitioning: The Kubernetes device plugin assigns specific GPUs to each pod via
CUDA_VISIBLE_DEVICES. Pod A’s GPU 0 and Pod B’s GPU 0 are physically different devices. -
Process/Namespace Isolation: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so
cudaDeviceEnablePeerAccess()can be called. -
Memory Registration: NVLink transfers use
cudaMemcpywith peer access enabled. This requires callingcudaDeviceEnablePeerAccess()- impossible across process boundaries.
Where NVLink DOES Work
NVLink works within a pod for parallelism strategies (TP, EP) where all GPUs are in the same process:
Supported Communication Options
Transport Comparison
Same-Node Communication
When prefill and decode workers are on the same physical node:
Options (best to worst):
- InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
- RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
- Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU
- TCP (fallback) → GPU→CPU→TCP→CPU→GPU
Best Practice: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.
Cross-Node Communication
When prefill and decode workers are on different nodes:
Requirements for optimal cross-node performance:
- InfiniBand or RoCE network fabric
- GPUDirect RDMA enabled (GPU memory registered with NIC)
- Proper UCX configuration
UCX Configuration Reference
Environment Variables
UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.
Core Transport Selection
Excluding transports: Use ^ prefix to exclude (e.g., UCX_TLS=^mm excludes memory mapping).
Note: When specifying UCX_TLS explicitly with GPU memory, you must include cuda_copy or cuda_ipc for UCX to recognize GPU buffers.
Rendezvous Protocol Settings
Recommendation: Use get_zcopy with threshold 0 for KV cache transfers (always large).
⚠️ AWS EFA Exception: Do NOT use
get_zcopyon AWS with Ubuntu 24.04 + Kernel ≥6.8. See AWS EFA Configuration for required settings.
Memory Registration
Debugging and Diagnostics
Note: UCX statistics (UCX_STATS_DEST, UCX_STATS_TRIGGER) require UCX compiled with --enable-stats flag, which is not enabled in default builds.
Complete Production Configuration
AWS EFA Configuration
⚠️ Critical: Zero-Copy RDMA causes crashes on AWS Kernel 6.8+
On AWS Ubuntu 24.04 with Kernel ≥6.8, using
UCX_RNDV_SCHEME=get_zcopytriggers a fatalNIXL_ERR_BACKENDcrash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support inefa_nv_peermem.You MUST use the configuration below — do not copy the standard InfiniBand settings.
Note: NIXL is migrating from UCX to libfabric for AWS The Dynamo team is transitioning NIXL to use libfabric instead of UCX for AWS EFA deployments. This change is driven by:
- Better topology awareness: libfabric provides hierarchical topology awareness similar to NCCL
- Native EFA support: libfabric is the recommended communication layer for AWS EFA
Current status: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline.
Required AWS EFA Configuration (Ubuntu 24.04 + Kernel ≥6.8):
Why these settings are mandatory:
UCX_RNDV_SCHEME=autoprevents UCX from forcing zero-copy RDMA on CUDA buffersUCX_RNDV_THRESH=8192ensures large KV cache transfers use host-staging instead of GPU-direct (which fails)- Using
get_zcopyor threshold0will causeremote invalid RD requesterrors and worker crashes
Known Limitations:
- GPU Direct RDMA is non-functional on AWS EFA with Ubuntu 24.04 + kernel ≥6.8
- Expect 3x performance degradation compared to InfiniBand (host-staged transfers)
- For optimal disaggregated performance, consider clusters with InfiniBand/RoCE, or wait for libfabric support on AWS
Deployment Configuration
Kubernetes Resource Requirements
Required Capabilities and Resources
Infrastructure Prerequisites
-
RDMA Device Plugin: Exposes
rdma/ibresources to Kubernetes -
InfiniBand/RoCE Network: Physical RDMA fabric connecting nodes
-
GPUDirect RDMA (optional but recommended):
- NVIDIA driver with GPUDirect enabled
nvidia-peermemkernel module loaded- NIC firmware supporting GPUDirect
Diagnostics and Performance Validation
Pre-Deployment Validation
1. Verify RDMA Availability
Expected output shows InfiniBand or RoCE devices:
2. Check UCX Transport Capabilities
Look for GPU memory support:
If you only see host: GPUDirect RDMA is not working. KV transfers will use host staging.
3. Test UCX Performance
Expected bandwidth:
- InfiniBand HDR: 20-25 GB/s per port
- RoCE 100GbE: 10-12 GB/s
- TCP fallback: 1-2 GB/s
NIXL Benchmark Tool
Deploy the NIXL benchmark to validate end-to-end KV transfer performance:
This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.
Runtime Diagnostics
Verify NIXL Backend Initialization
Good output:
Bad output (RDMA not working):
Monitor Transfer Performance
Check Grafana dashboards for:
- NIXL transfer bandwidth: Should show GB/s, not MB/s
- KV cache transfer latency: Should be under 500ms for typical workloads
Red flags indicating RDMA issues:
- Transfer bandwidth under 1 GB/s
- TTFT > 10 seconds
Unsupported operationerrors in logs
Common Diagnostic Commands
Performance Expectations
KV Cache Transfer Overhead
Note: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see AWS EFA Configuration for details).
When Disaggregated Makes Sense
Use disaggregated architecture when:
- Output sequence length (OSL) > 1000 tokens (overhead amortized)
- You need independent scaling of prefill vs decode capacity
- Prefill and decode have different hardware requirements
Use aggregated architecture when:
- Low-latency TTFT is critical
- Short outputs (OSL under 500 tokens)
- RDMA is not available
Break-Even Analysis
The KV transfer overhead is amortized across output tokens. Example data from Llama-3.1-8B-Instruct on AWS p5.48xlarge:
Troubleshooting Guide
Problem: TTFT is 10+ seconds
Symptoms: TTFT degrades from expected 200-500ms to 10+ seconds
Root Cause: RDMA not active, falling back to TCP
Diagnosis:
Solutions:
- Verify RDMA device plugin is installed
- Add
rdma/ibresource requests to pod spec - Add
IPC_LOCKcapability - Set UCX environment variables
Problem: “Unsupported operation” errors
Symptoms: Logs show Unexpected UCX error: Unsupported operation
Root Cause: UCX attempting GPU RDMA on hardware that doesn’t support it
Solutions:
- Check if GPUDirect RDMA is enabled:
ucx_info -d | grep cuda - If not supported, set
UCX_RNDV_THRESH=infto disable GPU RDMA - Verify
nvidia-peermemmodule is loaded
Problem: AWS EFA not using GPU Direct
Symptoms: 3x performance degradation on AWS despite EFA configured
Root Cause: GPU Direct RDMA not functional on kernel ≥6.8 with EFA
Current Status: This is a known limitation. Options:
- Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
- Accept host-staging performance penalty
- Wait for AWS to update EFA DMA-BUF support
Problem: Intermittent transfer failures
Symptoms: Sporadic getXferStatus: backend 'UCX' returned error status
Diagnosis:
Common causes:
- Network congestion or packet loss
- Mismatched UCX versions between pods
- RDMA resource exhaustion
Quick Reference
Minimum Viable RDMA Configuration
Diagnostic Checklist
-
rdma/ibresources visible:kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}' - UCX sees RDMA devices:
ucx_info -d | grep "Transport: rc" - UCX sees GPU memory:
ucx_info -d | grep "memory types.*cuda" - NIXL initialized with UCX:
kubectl logs <pod> | grep "Backend UCX" - Transfer bandwidth > 1 GB/s (check Grafana metrics)