--- sidebar-title: Disagg Communication subtitle: Best practices for prefill/decode worker communication on Kubernetes --- # Disaggregated Inference Communication Guide This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?** ## Summary - **NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning - **RDMA (InfiniBand/RoCE) is required** for production disaggregated deployments - **Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA - **UCX is the communication layer** that NIXL uses to transfer KV cache between workers --- ## Architecture Overview ### Communication Stack Disaggregated inference communication stack showing NIXL, UCX, and transport layers ### Component Responsibilities | Component | Role | Location | |-----------|------|----------| | **NIXL** | High-level KV cache transfer API | Dynamo runtime library | | **UCX** | Low-level communication framework | System library | | **Transports** | Physical data movement | Hardware/kernel drivers | --- ## Why NVLink Cannot Be Used Between Pods ### The Fundamental Constraint NVLink is a **direct GPU-to-GPU interconnect** that operates at the hardware level. It requires: 1. **Same process** - Both GPUs must be visible to a single process so `cudaDeviceEnablePeerAccess()` can be called 2. **Direct memory access** - Process must have permission to access both GPU memory regions 3. **Peer-to-peer mapping** - CUDA runtime must establish memory mappings between GPUs **Kubernetes pods violate all three requirements:** Why NVLink cannot work between Kubernetes pods due to process isolation ### Technical Explanation 1. **Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space. 2. **GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices. 3. **Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called. 4. **Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries. ### Where NVLink DOES Work NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process: ```yaml # Decode worker with TP=4 uses NVLink between its 4 GPUs VLLMDecodeWorker: resources: limits: gpu: "4" # All 4 GPUs visible to single process args: - --tensor-parallel-size - "4" # NVLink used for TP/EP communication within pod ``` --- ## Supported Communication Options ### Transport Comparison | Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct | |-----------|-----------|---------|-----------|------------|------------| | **NVLink** | 450-900 GB/s | ~µs | ✅ (intra-pod only) | ❌ | ✅ | | **InfiniBand RDMA** | 20-50 GB/s | ~1 µs | ✅ | ✅ | ✅ (with GPUDirect) | | **RoCE RDMA** | 10-25 GB/s | ~2 µs | ✅ | ✅ | ✅ (with GPUDirect) | | **TCP** | 1-3 GB/s | ~50 µs | ✅ | ✅ | ❌ (host staging) | ### Same-Node Communication When prefill and decode workers are on the **same physical node**: Same-node RDMA communication between prefill and decode pods **Options (best to worst):** 1. InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU 2. RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU 3. Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU 4. TCP (fallback) → GPU→CPU→TCP→CPU→GPU **Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes. ### Cross-Node Communication When prefill and decode workers are on **different nodes**: Cross-node RDMA communication between prefill and decode pods on separate nodes **Requirements for optimal cross-node performance:** - InfiniBand or RoCE network fabric - GPUDirect RDMA enabled (GPU memory registered with NIC) - Proper UCX configuration --- ## UCX Configuration Reference ### Environment Variables UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods. #### Core Transport Selection ```yaml env: - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" ``` | Transport | Description | When to Use | |-----------|-------------|-------------| | `rc_x` | Reliable Connection (accelerated) | Primary RDMA transport | | `rc` | Reliable Connection (standard) | Fallback RDMA | | `dc_x` | Dynamically Connected (accelerated) | Scalable RDMA (many endpoints) | | `dc` | Dynamically Connected (standard) | Fallback scalable RDMA | | `cuda_copy` | GPU↔Host memory staging | Required for GPU buffers | | `cuda_ipc` | CUDA IPC (same-node, same-pod) | Intra-pod GPU transfers | | `tcp` | TCP sockets | Fallback when RDMA unavailable | | `srd` | Scalable Reliable Datagram (AWS EFA) | AWS-specific (provided by EFA, not core UCX) | **Excluding transports**: Use `^` prefix to exclude (e.g., `UCX_TLS=^mm` excludes memory mapping). **Note**: When specifying `UCX_TLS` explicitly with GPU memory, you must include `cuda_copy` or `cuda_ipc` for UCX to recognize GPU buffers. #### Rendezvous Protocol Settings ```yaml env: - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" ``` | Variable | Value | Description | |----------|-------|-------------| | `UCX_RNDV_SCHEME` | `get_zcopy` | Zero-copy RDMA GET (receiver pulls data) | | `UCX_RNDV_SCHEME` | `put_zcopy` | Zero-copy RDMA PUT (sender pushes data) | | `UCX_RNDV_SCHEME` | `auto` | Let UCX choose based on message size | | `UCX_RNDV_THRESH` | `0` | Use rendezvous for all message sizes | | `UCX_RNDV_THRESH` | `8192` | Use rendezvous for messages ≥8KB | | `UCX_RNDV_THRESH` | `auto` | Let UCX calculate optimal threshold | **Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large). > **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings. #### Memory Registration ```yaml env: - name: UCX_IB_REG_METHODS value: "odp,rcache" ``` | Method | Description | |--------|-------------| | `odp` | On-Demand Paging (dynamic registration) | | `rcache` | Registration cache (reuse registrations) | | `direct` | Direct registration (each transfer) | #### Debugging and Diagnostics ```yaml env: - name: UCX_LOG_LEVEL value: "info" # Options: fatal, error, warn, info, debug, trace, data, func - name: UCX_LOG_FILE value: "/tmp/ucx.log" # Optional: log to file instead of stdout ``` **Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds. ### Complete Production Configuration ```yaml env: # Transport selection - RDMA with GPU support - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" # Rendezvous for large transfers - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" # Memory registration optimization - name: UCX_IB_REG_METHODS value: "odp,rcache" # RDMA settings - name: UCX_IB_GID_INDEX value: "3" # RoCE v2 GID index (cluster-specific) ``` ### AWS EFA Configuration > **⚠️ Critical: Zero-Copy RDMA causes crashes on AWS Kernel 6.8+** > > On AWS Ubuntu 24.04 with Kernel ≥6.8, using `UCX_RNDV_SCHEME=get_zcopy` triggers a fatal `NIXL_ERR_BACKEND` crash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support in `efa_nv_peermem`. > > **You MUST use the configuration below** — do not copy the standard InfiniBand settings. > **Note: NIXL is migrating from UCX to libfabric for AWS** > The Dynamo team is transitioning NIXL to use **libfabric** instead of UCX for AWS EFA deployments. This change is driven by: > - **Better topology awareness**: libfabric provides hierarchical topology awareness similar to NCCL > - **Native EFA support**: libfabric is the recommended communication layer for AWS EFA > > **Current status**: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline. **Required AWS EFA Configuration** (Ubuntu 24.04 + Kernel ≥6.8): ```yaml env: - name: UCX_TLS value: "srd,cuda_copy,tcp" # SRD is EFA's RDMA transport - name: UCX_RNDV_SCHEME value: "auto" # DO NOT use get_zcopy - causes crashes - name: UCX_RNDV_THRESH value: "8192" # Avoid CUDA zero-copy for large transfers ``` **Why these settings are mandatory**: - `UCX_RNDV_SCHEME=auto` prevents UCX from forcing zero-copy RDMA on CUDA buffers - `UCX_RNDV_THRESH=8192` ensures large KV cache transfers use host-staging instead of GPU-direct (which fails) - Using `get_zcopy` or threshold `0` will cause `remote invalid RD request` errors and worker crashes **Known Limitations**: - GPU Direct RDMA is non-functional on AWS EFA with Ubuntu 24.04 + kernel ≥6.8 - Expect 3x performance degradation compared to InfiniBand (host-staged transfers) - For optimal disaggregated performance, consider clusters with InfiniBand/RoCE, or wait for libfabric support on AWS --- ## Deployment Configuration ### Kubernetes Resource Requirements ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: services: VLLMPrefillWorker: resources: limits: gpu: "2" extraPodSpec: mainContainer: securityContext: capabilities: add: ["IPC_LOCK"] # Required for RDMA memory pinning resources: limits: rdma/ib: "2" # RDMA resources (match TP size) requests: rdma/ib: "2" ``` ### Required Capabilities and Resources | Setting | Purpose | Notes | |---------|---------|-------| | `IPC_LOCK` capability | Pin memory for RDMA | Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers | | `rdma/ib` resources | RDMA NIC access | Provided by RDMA device plugin | | `sharedMemory.size` | IPC between processes | 16Gi for vLLM, 80Gi for TRT-LLM | ### Infrastructure Prerequisites 1. **RDMA Device Plugin**: Exposes `rdma/ib` resources to Kubernetes ```bash kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}' ``` 2. **InfiniBand/RoCE Network**: Physical RDMA fabric connecting nodes 3. **GPUDirect RDMA** (optional but recommended): - NVIDIA driver with GPUDirect enabled - `nvidia-peermem` kernel module loaded - NIC firmware supporting GPUDirect --- ## Diagnostics and Performance Validation ### Pre-Deployment Validation #### 1. Verify RDMA Availability ```bash # Check RDMA devices on node kubectl debug node/ -it --image=ubuntu:22.04 -- bash ibv_devinfo ``` Expected output shows InfiniBand or RoCE devices: ```text hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 28.35.2000 ... ``` #### 2. Check UCX Transport Capabilities ```bash # Inside a Dynamo worker pod ucx_info -d ``` Look for GPU memory support: ```text # Memory domain: mlx5_0 # Component: ib # memory types: host (access,reg,cache), cuda (access,reg,cache) # ^^^^ GPU memory supported ``` **If you only see `host`**: GPUDirect RDMA is not working. KV transfers will use host staging. #### 3. Test UCX Performance ```bash # Server (on decode worker pod) ucx_perftest -t tag_bw -n 100 -s 134217728 # Client (on prefill worker pod) ucx_perftest -t tag_bw -n 100 -s 134217728 ``` **Expected bandwidth**: - InfiniBand HDR: 20-25 GB/s per port - RoCE 100GbE: 10-12 GB/s - TCP fallback: 1-2 GB/s ### NIXL Benchmark Tool Deploy the NIXL benchmark to validate end-to-end KV transfer performance: ```bash cd deploy/pre-deployment/nixl ./build_and_deploy.sh ``` This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL. ### Runtime Diagnostics #### Verify NIXL Backend Initialization ```bash kubectl logs | grep -i "NIXL\|UCX" ``` **Good output**: ```text NIXL INFO Backend UCX was instantiated ``` **Bad output** (RDMA not working): ```text UCX WARN no RDMA transports available NIXL INFO falling back to TCP transport ``` #### Monitor Transfer Performance Check Grafana dashboards for: - **NIXL transfer bandwidth**: Should show GB/s, not MB/s - **KV cache transfer latency**: Should be under 500ms for typical workloads **Red flags indicating RDMA issues**: - Transfer bandwidth under 1 GB/s - TTFT > 10 seconds - `Unsupported operation` errors in logs ### Common Diagnostic Commands ```bash # Check UCX transport selection kubectl exec -- env | grep UCX # Verify RDMA device visibility kubectl exec -- ls /dev/infiniband/ # Check GPUDirect RDMA status (on node) kubectl debug node/ -it --image=ubuntu:22.04 -- \ nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr" # Test basic connectivity between pods kubectl exec -- ping -c 3 ``` --- ## Performance Expectations ### KV Cache Transfer Overhead | Configuration | TTFT Overhead | Source | |---------------|---------------|--------| | Aggregated (baseline) | 0 | No KV transfer needed | | Disagg + InfiniBand RDMA with GPUDirect | +200-500ms | *Expected* based on hardware specs | | Disagg + RoCE RDMA with GPUDirect | +300-800ms | *Expected* based on hardware specs | | Disagg + Host-staged (no GPUDirect) | +1-3s | *Expected* - CPU bottleneck | | Disagg + AWS EFA (without GPUDirect) | ~3x slower than aggregated | *Measured* on AWS p5.48xlarge | | Disagg + TCP fallback | **+90-100s** | *Measured* ~98s TTFT on AWS p5.48xlarge | > **Note**: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see [AWS EFA Configuration](#aws-efa-configuration) for details). ### When Disaggregated Makes Sense **Use disaggregated architecture when:** - Output sequence length (OSL) > 1000 tokens (overhead amortized) - You need independent scaling of prefill vs decode capacity - Prefill and decode have different hardware requirements **Use aggregated architecture when:** - Low-latency TTFT is critical - Short outputs (OSL under 500 tokens) - RDMA is not available ### Break-Even Analysis The KV transfer overhead is amortized across output tokens. Example data from **Llama-3.1-8B-Instruct** on AWS p5.48xlarge: ```text Total Latency = TTFT + (OSL × ITL) Example (Llama-3.1-8B, ISL=4000): - Aggregated: 218ms + (OSL × 8.0ms) - Disaggregated: 2400ms + (OSL × 7.8ms) Break-even: 2400 - 218 = 2182ms overhead 2182ms / (8.0 - 7.8)ms per token = 10,910 tokens At OSL=2000: Disagg is 1.1x slower (acceptable) At OSL=100: Disagg is 3.1x slower (not recommended) ``` --- ## Troubleshooting Guide ### Problem: TTFT is 10+ seconds **Symptoms**: TTFT degrades from expected 200-500ms to 10+ seconds **Root Cause**: RDMA not active, falling back to TCP **Diagnosis**: ```bash kubectl logs | grep -i "transport\|UCX\|TCP" ``` **Solutions**: 1. Verify RDMA device plugin is installed 2. Add `rdma/ib` resource requests to pod spec 3. Add `IPC_LOCK` capability 4. Set UCX environment variables ### Problem: "Unsupported operation" errors **Symptoms**: Logs show `Unexpected UCX error: Unsupported operation` **Root Cause**: UCX attempting GPU RDMA on hardware that doesn't support it **Solutions**: 1. Check if GPUDirect RDMA is enabled: `ucx_info -d | grep cuda` 2. If not supported, set `UCX_RNDV_THRESH=inf` to disable GPU RDMA 3. Verify `nvidia-peermem` module is loaded ### Problem: AWS EFA not using GPU Direct **Symptoms**: 3x performance degradation on AWS despite EFA configured **Root Cause**: GPU Direct RDMA not functional on kernel ≥6.8 with EFA **Current Status**: This is a known limitation. Options: 1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15) 2. Accept host-staging performance penalty 3. Wait for AWS to update EFA DMA-BUF support ### Problem: Intermittent transfer failures **Symptoms**: Sporadic `getXferStatus: backend 'UCX' returned error status` **Diagnosis**: ```bash # Enable UCX debug logging kubectl set env deployment/ UCX_LOG_LEVEL=debug kubectl logs | grep -i error ``` **Common causes**: - Network congestion or packet loss - Mismatched UCX versions between pods - RDMA resource exhaustion --- ## Quick Reference ### Minimum Viable RDMA Configuration ```yaml env: - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: rdma/ib: "2" requests: rdma/ib: "2" ``` ### Diagnostic Checklist - [ ] `rdma/ib` resources visible: `kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'` - [ ] UCX sees RDMA devices: `ucx_info -d | grep "Transport: rc"` - [ ] UCX sees GPU memory: `ucx_info -d | grep "memory types.*cuda"` - [ ] NIXL initialized with UCX: `kubectl logs | grep "Backend UCX"` - [ ] Transfer bandwidth > 1 GB/s (check Grafana metrics) --- ## Related Documentation - [Disaggregated Serving Architecture](/dynamo/dev/design-docs/disaggregated-serving) - [AIConfigurator Deployment Guide](/dynamo/dev/user-guides/disaggregated-serving) - [NIXL Benchmark Deployment](../../deploy/pre-deployment/nixl/README.md) - [KV Cache Transfer Methods](/dynamo/dev/additional-resources/tensor-rt-llm-details/kv-cache-transfer)