> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Disaggregated Inference Communication Guide

This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?**

## Summary

- **NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning
- **RDMA (InfiniBand/RoCE) is required** for production disaggregated deployments
- **Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
- **UCX is the communication layer** that NIXL uses to transfer KV cache between workers

---

## Architecture Overview

### Communication Stack

<Frame>
  <img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/788f7efb1c158d5f51a0dcc86156104b7d127d4002312096a9190e690113b1af/pages-v1.0.2/assets/img/disagg-comm-stack.svg" alt="Disaggregated inference communication stack showing NIXL, UCX, and transport layers" />
</Frame>

### Component Responsibilities

| Component | Role | Location |
|-----------|------|----------|
| **NIXL** | High-level KV cache transfer API | Dynamo runtime library |
| **UCX** | Low-level communication framework | System library |
| **Transports** | Physical data movement | Hardware/kernel drivers |

---

## Why NVLink Cannot Be Used Between Pods

### The Fundamental Constraint

NVLink is a **direct GPU-to-GPU interconnect** that operates at the hardware level. It requires:

1. **Same process** - Both GPUs must be visible to a single process so `cudaDeviceEnablePeerAccess()` can be called
2. **Direct memory access** - Process must have permission to access both GPU memory regions
3. **Peer-to-peer mapping** - CUDA runtime must establish memory mappings between GPUs

**Kubernetes pods violate all three requirements:**

<Frame>
  <img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/aa3eb60364dcc4251ec02a3c60e351b6753b94f8b95aaec0e54f25d282d3c155/pages-v1.0.2/assets/img/disagg-nvlink-limitation.svg" alt="Why NVLink cannot work between Kubernetes pods due to process isolation" />
</Frame>

### Technical Explanation

1. **Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space.

2. **GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices.

3. **Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called.

4. **Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries.

### Where NVLink DOES Work

NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process:

```yaml
# Decode worker with TP=4 uses NVLink between its 4 GPUs
VLLMDecodeWorker:
  resources:
    limits:
      gpu: "4"   # All 4 GPUs visible to single process
  args:
    - --tensor-parallel-size
    - "4"        # NVLink used for TP/EP communication within pod
```

---

## Supported Communication Options

### Transport Comparison

| Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct |
|-----------|-----------|---------|-----------|------------|------------|
| **NVLink** | 450-900 GB/s | ~µs | ✅ (intra-pod only) | ❌ | ✅ |
| **InfiniBand RDMA** | 20-50 GB/s | ~1 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **RoCE RDMA** | 10-25 GB/s | ~2 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **TCP** | 1-3 GB/s | ~50 µs | ✅ | ✅ | ❌ (host staging) |

### Same-Node Communication

When prefill and decode workers are on the **same physical node**:

<Frame>
  <img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/789ed86d31914a13c80cd1210f3c8cf4f445456fa3939841c010ce0e10b2228c/pages-v1.0.2/assets/img/disagg-same-node.svg" alt="Same-node RDMA communication between prefill and decode pods" />
</Frame>

**Options (best to worst):**
1. InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
2. RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
3. Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU
4. TCP (fallback) → GPU→CPU→TCP→CPU→GPU

**Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.

### Cross-Node Communication

When prefill and decode workers are on **different nodes**:

<Frame>
  <img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/e7295f41b6a962ab1bc52f43ebddaa1a9d5094042aab37f21e5b95cab8b9dd48/pages-v1.0.2/assets/img/disagg-cross-node.svg" alt="Cross-node RDMA communication between prefill and decode pods on separate nodes" />
</Frame>

**Requirements for optimal cross-node performance:**
- InfiniBand or RoCE network fabric
- GPUDirect RDMA enabled (GPU memory registered with NIC)
- Proper UCX configuration

---

## UCX Configuration Reference

### Environment Variables

UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.

#### Core Transport Selection

```yaml
env:
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
```

| Transport | Description | When to Use |
|-----------|-------------|-------------|
| `rc_x` | Reliable Connection (accelerated) | Primary RDMA transport |
| `rc` | Reliable Connection (standard) | Fallback RDMA |
| `dc_x` | Dynamically Connected (accelerated) | Scalable RDMA (many endpoints) |
| `dc` | Dynamically Connected (standard) | Fallback scalable RDMA |
| `cuda_copy` | GPU↔Host memory staging | Required for GPU buffers |
| `cuda_ipc` | CUDA IPC (same-node, same-pod) | Intra-pod GPU transfers |
| `tcp` | TCP sockets | Fallback when RDMA unavailable |
| `srd` | Scalable Reliable Datagram (AWS EFA) | AWS-specific (provided by EFA, not core UCX) |

**Excluding transports**: Use `^` prefix to exclude (e.g., `UCX_TLS=^mm` excludes memory mapping).

**Note**: When specifying `UCX_TLS` explicitly with GPU memory, you must include `cuda_copy` or `cuda_ipc` for UCX to recognize GPU buffers.

#### Rendezvous Protocol Settings

```yaml
env:
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"
```

| Variable | Value | Description |
|----------|-------|-------------|
| `UCX_RNDV_SCHEME` | `get_zcopy` | Zero-copy RDMA GET (receiver pulls data) |
| `UCX_RNDV_SCHEME` | `put_zcopy` | Zero-copy RDMA PUT (sender pushes data) |
| `UCX_RNDV_SCHEME` | `auto` | Let UCX choose based on message size |
| `UCX_RNDV_THRESH` | `0` | Use rendezvous for all message sizes |
| `UCX_RNDV_THRESH` | `8192` | Use rendezvous for messages ≥8KB |
| `UCX_RNDV_THRESH` | `auto` | Let UCX calculate optimal threshold |

**Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large).

> **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings.

#### Memory Registration

```yaml
env:
  - name: UCX_IB_REG_METHODS
    value: "odp,rcache"
```

| Method | Description |
|--------|-------------|
| `odp` | On-Demand Paging (dynamic registration) |
| `rcache` | Registration cache (reuse registrations) |
| `direct` | Direct registration (each transfer) |

#### Debugging and Diagnostics

```yaml
env:
  - name: UCX_LOG_LEVEL
    value: "info"        # Options: fatal, error, warn, info, debug, trace, data, func
  - name: UCX_LOG_FILE
    value: "/tmp/ucx.log" # Optional: log to file instead of stdout
```

**Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds.

### Complete Production Configuration

```yaml
env:
  # Transport selection - RDMA with GPU support
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"

  # Rendezvous for large transfers
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"

  # Memory registration optimization
  - name: UCX_IB_REG_METHODS
    value: "odp,rcache"

  # RDMA settings
  - name: UCX_IB_GID_INDEX
    value: "3"           # RoCE v2 GID index (cluster-specific)
```

### AWS EFA Configuration

> **⚠️ Critical: Zero-Copy RDMA causes crashes on AWS Kernel 6.8+**
>
> On AWS Ubuntu 24.04 with Kernel ≥6.8, using `UCX_RNDV_SCHEME=get_zcopy` triggers a fatal `NIXL_ERR_BACKEND` crash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support in `efa_nv_peermem`.
>
> **You MUST use the configuration below** — do not copy the standard InfiniBand settings.

> **Note: NIXL is migrating from UCX to libfabric for AWS**
> The Dynamo team is transitioning NIXL to use **libfabric** instead of UCX for AWS EFA deployments. This change is driven by:
> - **Better topology awareness**: libfabric provides hierarchical topology awareness similar to NCCL
> - **Native EFA support**: libfabric is the recommended communication layer for AWS EFA
>
> **Current status**: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline.

**Required AWS EFA Configuration** (Ubuntu 24.04 + Kernel ≥6.8):

```yaml
env:
  - name: UCX_TLS
    value: "srd,cuda_copy,tcp"    # SRD is EFA's RDMA transport
  - name: UCX_RNDV_SCHEME
    value: "auto"                  # DO NOT use get_zcopy - causes crashes
  - name: UCX_RNDV_THRESH
    value: "8192"                  # Avoid CUDA zero-copy for large transfers
```

**Why these settings are mandatory**:
- `UCX_RNDV_SCHEME=auto` prevents UCX from forcing zero-copy RDMA on CUDA buffers
- `UCX_RNDV_THRESH=8192` ensures large KV cache transfers use host-staging instead of GPU-direct (which fails)
- Using `get_zcopy` or threshold `0` will cause `remote invalid RD request` errors and worker crashes

**Known Limitations**:
- GPU Direct RDMA is non-functional on AWS EFA with Ubuntu 24.04 + kernel ≥6.8
- Expect 3x performance degradation compared to InfiniBand (host-staged transfers)
- For optimal disaggregated performance, consider clusters with InfiniBand/RoCE, or wait for libfabric support on AWS

---

## Deployment Configuration

### Kubernetes Resource Requirements

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
  services:
    VLLMPrefillWorker:
      resources:
        limits:
          gpu: "2"
      extraPodSpec:
        mainContainer:
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]      # Required for RDMA memory pinning
          resources:
            limits:
              rdma/ib: "2"           # RDMA resources (match TP size)
            requests:
              rdma/ib: "2"
```

### Required Capabilities and Resources

| Setting | Purpose | Notes |
|---------|---------|-------|
| `IPC_LOCK` capability | Pin memory for RDMA | Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers |
| `rdma/ib` resources | RDMA NIC access | Provided by RDMA device plugin |
| `sharedMemory.size` | IPC between processes | 16Gi for vLLM, 80Gi for TRT-LLM |

### Infrastructure Prerequisites

1. **RDMA Device Plugin**: Exposes `rdma/ib` resources to Kubernetes
   ```bash
   kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}'
   ```

2. **InfiniBand/RoCE Network**: Physical RDMA fabric connecting nodes

3. **GPUDirect RDMA** (optional but recommended):
   - NVIDIA driver with GPUDirect enabled
   - `nvidia-peermem` kernel module loaded
   - NIC firmware supporting GPUDirect

---

## Diagnostics and Performance Validation

### Pre-Deployment Validation

#### 1. Verify RDMA Availability

```bash
# Check RDMA devices on node
kubectl debug node/<node-name> -it --image=ubuntu:22.04 -- bash
ibv_devinfo
```

Expected output shows InfiniBand or RoCE devices:
```text
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.35.2000
        ...
```

#### 2. Check UCX Transport Capabilities

```bash
# Inside a Dynamo worker pod
ucx_info -d
```

Look for GPU memory support:
```text
# Memory domain: mlx5_0
#     Component: ib
#     memory types: host (access,reg,cache), cuda (access,reg,cache)
#                                            ^^^^ GPU memory supported
```

**If you only see `host`**: GPUDirect RDMA is not working. KV transfers will use host staging.

#### 3. Test UCX Performance

```bash
# Server (on decode worker pod)
ucx_perftest -t tag_bw -n 100 -s 134217728

# Client (on prefill worker pod)
ucx_perftest <server-ip> -t tag_bw -n 100 -s 134217728
```

**Expected bandwidth**:
- InfiniBand HDR: 20-25 GB/s per port
- RoCE 100GbE: 10-12 GB/s
- TCP fallback: 1-2 GB/s

### NIXL Benchmark Tool

Deploy the NIXL benchmark to validate end-to-end KV transfer performance:

```bash
cd deploy/pre-deployment/nixl
./build_and_deploy.sh
```

This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.

### Runtime Diagnostics

#### Verify NIXL Backend Initialization

```bash
kubectl logs <worker-pod> | grep -i "NIXL\|UCX"
```

**Good output**:
```text
NIXL INFO Backend UCX was instantiated
```

**Bad output** (RDMA not working):
```text
UCX WARN no RDMA transports available
NIXL INFO falling back to TCP transport
```

#### Monitor Transfer Performance

Check Grafana dashboards for:
- **NIXL transfer bandwidth**: Should show GB/s, not MB/s
- **KV cache transfer latency**: Should be under 500ms for typical workloads

**Red flags indicating RDMA issues**:
- Transfer bandwidth under 1 GB/s
- TTFT > 10 seconds
- `Unsupported operation` errors in logs

### Common Diagnostic Commands

```bash
# Check UCX transport selection
kubectl exec <pod> -- env | grep UCX

# Verify RDMA device visibility
kubectl exec <pod> -- ls /dev/infiniband/

# Check GPUDirect RDMA status (on node)
kubectl debug node/<node> -it --image=ubuntu:22.04 -- \
  nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr"

# Test basic connectivity between pods
kubectl exec <prefill-pod> -- ping -c 3 <decode-pod-ip>
```

---

## Performance Expectations

### KV Cache Transfer Overhead

| Configuration | TTFT Overhead | Source |
|---------------|---------------|--------|
| Aggregated (baseline) | 0 | No KV transfer needed |
| Disagg + InfiniBand RDMA with GPUDirect | +200-500ms | *Expected* based on hardware specs |
| Disagg + RoCE RDMA with GPUDirect | +300-800ms | *Expected* based on hardware specs |
| Disagg + Host-staged (no GPUDirect) | +1-3s | *Expected* - CPU bottleneck |
| Disagg + AWS EFA (without GPUDirect) | ~3x slower than aggregated | *Measured* on AWS p5.48xlarge |
| Disagg + TCP fallback | **+90-100s** | *Measured* ~98s TTFT on AWS p5.48xlarge |

> **Note**: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see [AWS EFA Configuration](#aws-efa-configuration) for details).

### When Disaggregated Makes Sense

**Use disaggregated architecture when:**
- Output sequence length (OSL) > 1000 tokens (overhead amortized)
- You need independent scaling of prefill vs decode capacity
- Prefill and decode have different hardware requirements

**Use aggregated architecture when:**
- Low-latency TTFT is critical
- Short outputs (OSL under 500 tokens)
- RDMA is not available

### Break-Even Analysis

The KV transfer overhead is amortized across output tokens. Example data from **Llama-3.1-8B-Instruct** on AWS p5.48xlarge:

```text
Total Latency = TTFT + (OSL × ITL)

Example (Llama-3.1-8B, ISL=4000):
- Aggregated:    218ms + (OSL × 8.0ms)
- Disaggregated: 2400ms + (OSL × 7.8ms)

Break-even: 2400 - 218 = 2182ms overhead
            2182ms / (8.0 - 7.8)ms per token = 10,910 tokens

At OSL=2000: Disagg is 1.1x slower (acceptable)
At OSL=100:  Disagg is 3.1x slower (not recommended)
```

---

## Troubleshooting Guide

### Problem: TTFT is 10+ seconds

**Symptoms**: TTFT degrades from expected 200-500ms to 10+ seconds

**Root Cause**: RDMA not active, falling back to TCP

**Diagnosis**:
```bash
kubectl logs <worker-pod> | grep -i "transport\|UCX\|TCP"
```

**Solutions**:
1. Verify RDMA device plugin is installed
2. Add `rdma/ib` resource requests to pod spec
3. Add `IPC_LOCK` capability
4. Set UCX environment variables

### Problem: "Unsupported operation" errors

**Symptoms**: Logs show `Unexpected UCX error: Unsupported operation`

**Root Cause**: UCX attempting GPU RDMA on hardware that doesn't support it

**Solutions**:
1. Check if GPUDirect RDMA is enabled: `ucx_info -d | grep cuda`
2. If not supported, set `UCX_RNDV_THRESH=inf` to disable GPU RDMA
3. Verify `nvidia-peermem` module is loaded

### Problem: AWS EFA not using GPU Direct

**Symptoms**: 3x performance degradation on AWS despite EFA configured

**Root Cause**: GPU Direct RDMA not functional on kernel ≥6.8 with EFA

**Current Status**: This is a known limitation. Options:
1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
2. Accept host-staging performance penalty
3. Wait for AWS to update EFA DMA-BUF support

### Problem: Intermittent transfer failures

**Symptoms**: Sporadic `getXferStatus: backend 'UCX' returned error status`

**Diagnosis**:
```bash
# Enable UCX debug logging
kubectl set env deployment/<worker> UCX_LOG_LEVEL=debug
kubectl logs <worker-pod> | grep -i error
```

**Common causes**:
- Network congestion or packet loss
- Mismatched UCX versions between pods
- RDMA resource exhaustion

---

## Quick Reference

### Minimum Viable RDMA Configuration

```yaml
env:
  - name: UCX_TLS
    value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
  - name: UCX_RNDV_SCHEME
    value: "get_zcopy"
  - name: UCX_RNDV_THRESH
    value: "0"

securityContext:
  capabilities:
    add: ["IPC_LOCK"]

resources:
  limits:
    rdma/ib: "2"
  requests:
    rdma/ib: "2"
```

### Diagnostic Checklist

- [ ] `rdma/ib` resources visible: `kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'`
- [ ] UCX sees RDMA devices: `ucx_info -d | grep "Transport: rc"`
- [ ] UCX sees GPU memory: `ucx_info -d | grep "memory types.*cuda"`
- [ ] NIXL initialized with UCX: `kubectl logs <pod> | grep "Backend UCX"`
- [ ] Transfer bandwidth > 1 GB/s (check Grafana metrics)

---

## Related Documentation

- [Disaggregated Serving Architecture](/dynamo/design-docs/disaggregated-serving)
- [AIConfigurator Deployment Guide](/dynamo/user-guides/disaggregated-serving)
- [NIXL Benchmark Deployment](../../deploy/pre-deployment/nixl/README.md)
- [KV Cache Transfer Methods](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer)