Disaggregated Inference Communication Guide

Best practices for prefill/decode worker communication on Kubernetes

View as Markdown

This guide explains how prefill and decode workers communicate in Dynamo’s disaggregated inference architecture on Kubernetes. It answers the frequently asked question: Why can’t prefill and decode workers use NVLink to communicate on the same node?

Summary

  • NVLink cannot be used between Kubernetes pods due to process isolation and GPU partitioning
  • RDMA (InfiniBand/RoCE) is required for production disaggregated deployments
  • Without RDMA, expect 200-500x performance degradation in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
  • UCX is the communication layer that NIXL uses to transfer KV cache between workers

Architecture Overview

Communication Stack

Disaggregated inference communication stack showing NIXL, UCX, and transport layers

Component Responsibilities

ComponentRoleLocation
NIXLHigh-level KV cache transfer APIDynamo runtime library
UCXLow-level communication frameworkSystem library
TransportsPhysical data movementHardware/kernel drivers

The Fundamental Constraint

NVLink is a direct GPU-to-GPU interconnect that operates at the hardware level. It requires:

  1. Same process - Both GPUs must be visible to a single process so cudaDeviceEnablePeerAccess() can be called
  2. Direct memory access - Process must have permission to access both GPU memory regions
  3. Peer-to-peer mapping - CUDA runtime must establish memory mappings between GPUs

Kubernetes pods violate all three requirements:

Why NVLink cannot work between Kubernetes pods due to process isolation

Technical Explanation

  1. Process Isolation: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B’s memory space.

  2. GPU Partitioning: The Kubernetes device plugin assigns specific GPUs to each pod via CUDA_VISIBLE_DEVICES. Pod A’s GPU 0 and Pod B’s GPU 0 are physically different devices.

  3. Process/Namespace Isolation: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so cudaDeviceEnablePeerAccess() can be called.

  4. Memory Registration: NVLink transfers use cudaMemcpy with peer access enabled. This requires calling cudaDeviceEnablePeerAccess() - impossible across process boundaries.

NVLink works within a pod for parallelism strategies (TP, EP) where all GPUs are in the same process:

1# Decode worker with TP=4 uses NVLink between its 4 GPUs
2VLLMDecodeWorker:
3 resources:
4 limits:
5 gpu: "4" # All 4 GPUs visible to single process
6 args:
7 - --tensor-parallel-size
8 - "4" # NVLink used for TP/EP communication within pod

Supported Communication Options

Transport Comparison

TransportBandwidthLatencySame-NodeCross-NodeGPU Direct
NVLink450-900 GB/s~µs✅ (intra-pod only)
InfiniBand RDMA20-50 GB/s~1 µs✅ (with GPUDirect)
RoCE RDMA10-25 GB/s~2 µs✅ (with GPUDirect)
TCP1-3 GB/s~50 µs❌ (host staging)

Same-Node Communication

When prefill and decode workers are on the same physical node:

Same-node RDMA communication between prefill and decode pods

Options (best to worst):

  1. InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
  2. RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
  3. Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU
  4. TCP (fallback) → GPU→CPU→TCP→CPU→GPU

Best Practice: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.

Cross-Node Communication

When prefill and decode workers are on different nodes:

Cross-node RDMA communication between prefill and decode pods on separate nodes

Requirements for optimal cross-node performance:

  • InfiniBand or RoCE network fabric
  • GPUDirect RDMA enabled (GPU memory registered with NIC)
  • Proper UCX configuration

UCX Configuration Reference

Environment Variables

UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.

Core Transport Selection

1env:
2 - name: UCX_TLS
3 value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
TransportDescriptionWhen to Use
rc_xReliable Connection (accelerated)Primary RDMA transport
rcReliable Connection (standard)Fallback RDMA
dc_xDynamically Connected (accelerated)Scalable RDMA (many endpoints)
dcDynamically Connected (standard)Fallback scalable RDMA
cuda_copyGPU↔Host memory stagingRequired for GPU buffers
cuda_ipcCUDA IPC (same-node, same-pod)Intra-pod GPU transfers
tcpTCP socketsFallback when RDMA unavailable
srdScalable Reliable Datagram (AWS EFA)AWS-specific (provided by EFA, not core UCX)

Excluding transports: Use ^ prefix to exclude (e.g., UCX_TLS=^mm excludes memory mapping).

Note: When specifying UCX_TLS explicitly with GPU memory, you must include cuda_copy or cuda_ipc for UCX to recognize GPU buffers.

Rendezvous Protocol Settings

1env:
2 - name: UCX_RNDV_SCHEME
3 value: "get_zcopy"
4 - name: UCX_RNDV_THRESH
5 value: "0"
VariableValueDescription
UCX_RNDV_SCHEMEget_zcopyZero-copy RDMA GET (receiver pulls data)
UCX_RNDV_SCHEMEput_zcopyZero-copy RDMA PUT (sender pushes data)
UCX_RNDV_SCHEMEautoLet UCX choose based on message size
UCX_RNDV_THRESH0Use rendezvous for all message sizes
UCX_RNDV_THRESH8192Use rendezvous for messages ≥8KB
UCX_RNDV_THRESHautoLet UCX calculate optimal threshold

Recommendation: Use get_zcopy with threshold 0 for KV cache transfers (always large).

⚠️ AWS EFA Exception: Do NOT use get_zcopy on AWS with Ubuntu 24.04 + Kernel ≥6.8. See AWS EFA Configuration for required settings.

Memory Registration

1env:
2 - name: UCX_IB_REG_METHODS
3 value: "odp,rcache"
MethodDescription
odpOn-Demand Paging (dynamic registration)
rcacheRegistration cache (reuse registrations)
directDirect registration (each transfer)

Debugging and Diagnostics

1env:
2 - name: UCX_LOG_LEVEL
3 value: "info" # Options: fatal, error, warn, info, debug, trace, data, func
4 - name: UCX_LOG_FILE
5 value: "/tmp/ucx.log" # Optional: log to file instead of stdout

Note: UCX statistics (UCX_STATS_DEST, UCX_STATS_TRIGGER) require UCX compiled with --enable-stats flag, which is not enabled in default builds.

Complete Production Configuration

1env:
2 # Transport selection - RDMA with GPU support
3 - name: UCX_TLS
4 value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
5
6 # Rendezvous for large transfers
7 - name: UCX_RNDV_SCHEME
8 value: "get_zcopy"
9 - name: UCX_RNDV_THRESH
10 value: "0"
11
12 # Memory registration optimization
13 - name: UCX_IB_REG_METHODS
14 value: "odp,rcache"
15
16 # RDMA settings
17 - name: UCX_IB_GID_INDEX
18 value: "3" # RoCE v2 GID index (cluster-specific)

AWS EFA Configuration

⚠️ Critical: Zero-Copy RDMA causes crashes on AWS Kernel 6.8+

On AWS Ubuntu 24.04 with Kernel ≥6.8, using UCX_RNDV_SCHEME=get_zcopy triggers a fatal NIXL_ERR_BACKEND crash. The EFA provider cannot register CUDA memory due to incomplete DMA-BUF support in efa_nv_peermem.

You MUST use the configuration below — do not copy the standard InfiniBand settings.

Note: NIXL is migrating from UCX to libfabric for AWS The Dynamo team is transitioning NIXL to use libfabric instead of UCX for AWS EFA deployments. This change is driven by:

  • Better topology awareness: libfabric provides hierarchical topology awareness similar to NCCL
  • Native EFA support: libfabric is the recommended communication layer for AWS EFA

Current status: UCX over EFA works but is not recommended for production. Published AWS examples are functional but not performant. Check with the Dynamo team for libfabric availability timeline.

Required AWS EFA Configuration (Ubuntu 24.04 + Kernel ≥6.8):

1env:
2 - name: UCX_TLS
3 value: "srd,cuda_copy,tcp" # SRD is EFA's RDMA transport
4 - name: UCX_RNDV_SCHEME
5 value: "auto" # DO NOT use get_zcopy - causes crashes
6 - name: UCX_RNDV_THRESH
7 value: "8192" # Avoid CUDA zero-copy for large transfers

Why these settings are mandatory:

  • UCX_RNDV_SCHEME=auto prevents UCX from forcing zero-copy RDMA on CUDA buffers
  • UCX_RNDV_THRESH=8192 ensures large KV cache transfers use host-staging instead of GPU-direct (which fails)
  • Using get_zcopy or threshold 0 will cause remote invalid RD request errors and worker crashes

Known Limitations:

  • GPU Direct RDMA is non-functional on AWS EFA with Ubuntu 24.04 + kernel ≥6.8
  • Expect 3x performance degradation compared to InfiniBand (host-staged transfers)
  • For optimal disaggregated performance, consider clusters with InfiniBand/RoCE, or wait for libfabric support on AWS

Deployment Configuration

Kubernetes Resource Requirements

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3spec:
4 services:
5 VLLMPrefillWorker:
6 resources:
7 limits:
8 gpu: "2"
9 extraPodSpec:
10 mainContainer:
11 securityContext:
12 capabilities:
13 add: ["IPC_LOCK"] # Required for RDMA memory pinning
14 resources:
15 limits:
16 rdma/ib: "2" # RDMA resources (match TP size)
17 requests:
18 rdma/ib: "2"

Required Capabilities and Resources

SettingPurposeNotes
IPC_LOCK capabilityPin memory for RDMABypasses RLIMIT_MEMLOCK; required for ibv_reg_mr() to pin GPU/host buffers
rdma/ib resourcesRDMA NIC accessProvided by RDMA device plugin
sharedMemory.sizeIPC between processes16Gi for vLLM, 80Gi for TRT-LLM

Infrastructure Prerequisites

  1. RDMA Device Plugin: Exposes rdma/ib resources to Kubernetes

    $kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}'
  2. InfiniBand/RoCE Network: Physical RDMA fabric connecting nodes

  3. GPUDirect RDMA (optional but recommended):

    • NVIDIA driver with GPUDirect enabled
    • nvidia-peermem kernel module loaded
    • NIC firmware supporting GPUDirect

Diagnostics and Performance Validation

Pre-Deployment Validation

1. Verify RDMA Availability

$# Check RDMA devices on node
$kubectl debug node/<node-name> -it --image=ubuntu:22.04 -- bash
$ibv_devinfo

Expected output shows InfiniBand or RoCE devices:

hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.35.2000
...

2. Check UCX Transport Capabilities

$# Inside a Dynamo worker pod
$ucx_info -d

Look for GPU memory support:

# Memory domain: mlx5_0
# Component: ib
# memory types: host (access,reg,cache), cuda (access,reg,cache)
# ^^^^ GPU memory supported

If you only see host: GPUDirect RDMA is not working. KV transfers will use host staging.

3. Test UCX Performance

$# Server (on decode worker pod)
$ucx_perftest -t tag_bw -n 100 -s 134217728
$
$# Client (on prefill worker pod)
$ucx_perftest <server-ip> -t tag_bw -n 100 -s 134217728

Expected bandwidth:

  • InfiniBand HDR: 20-25 GB/s per port
  • RoCE 100GbE: 10-12 GB/s
  • TCP fallback: 1-2 GB/s

NIXL Benchmark Tool

Deploy the NIXL benchmark to validate end-to-end KV transfer performance:

$cd deploy/pre-deployment/nixl
$./build_and_deploy.sh

This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.

Runtime Diagnostics

Verify NIXL Backend Initialization

$kubectl logs <worker-pod> | grep -i "NIXL\|UCX"

Good output:

NIXL INFO Backend UCX was instantiated

Bad output (RDMA not working):

UCX WARN no RDMA transports available
NIXL INFO falling back to TCP transport

Monitor Transfer Performance

Check Grafana dashboards for:

  • NIXL transfer bandwidth: Should show GB/s, not MB/s
  • KV cache transfer latency: Should be under 500ms for typical workloads

Red flags indicating RDMA issues:

  • Transfer bandwidth under 1 GB/s
  • TTFT > 10 seconds
  • Unsupported operation errors in logs

Common Diagnostic Commands

$# Check UCX transport selection
$kubectl exec <pod> -- env | grep UCX
$
$# Verify RDMA device visibility
$kubectl exec <pod> -- ls /dev/infiniband/
$
$# Check GPUDirect RDMA status (on node)
$kubectl debug node/<node> -it --image=ubuntu:22.04 -- \
> nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr"
$
$# Test basic connectivity between pods
$kubectl exec <prefill-pod> -- ping -c 3 <decode-pod-ip>

Performance Expectations

KV Cache Transfer Overhead

ConfigurationTTFT OverheadSource
Aggregated (baseline)0No KV transfer needed
Disagg + InfiniBand RDMA with GPUDirect+200-500msExpected based on hardware specs
Disagg + RoCE RDMA with GPUDirect+300-800msExpected based on hardware specs
Disagg + Host-staged (no GPUDirect)+1-3sExpected - CPU bottleneck
Disagg + AWS EFA (without GPUDirect)~3x slower than aggregatedMeasured on AWS p5.48xlarge
Disagg + TCP fallback+90-100sMeasured ~98s TTFT on AWS p5.48xlarge

Note: InfiniBand/RoCE numbers with GPUDirect are expected values based on hardware specifications and have not been validated. AWS measurements reflect EFA without functional GPUDirect RDMA (see AWS EFA Configuration for details).

When Disaggregated Makes Sense

Use disaggregated architecture when:

  • Output sequence length (OSL) > 1000 tokens (overhead amortized)
  • You need independent scaling of prefill vs decode capacity
  • Prefill and decode have different hardware requirements

Use aggregated architecture when:

  • Low-latency TTFT is critical
  • Short outputs (OSL under 500 tokens)
  • RDMA is not available

Break-Even Analysis

The KV transfer overhead is amortized across output tokens. Example data from Llama-3.1-8B-Instruct on AWS p5.48xlarge:

Total Latency = TTFT + (OSL × ITL)
Example (Llama-3.1-8B, ISL=4000):
- Aggregated: 218ms + (OSL × 8.0ms)
- Disaggregated: 2400ms + (OSL × 7.8ms)
Break-even: 2400 - 218 = 2182ms overhead
2182ms / (8.0 - 7.8)ms per token = 10,910 tokens
At OSL=2000: Disagg is 1.1x slower (acceptable)
At OSL=100: Disagg is 3.1x slower (not recommended)

Troubleshooting Guide

Problem: TTFT is 10+ seconds

Symptoms: TTFT degrades from expected 200-500ms to 10+ seconds

Root Cause: RDMA not active, falling back to TCP

Diagnosis:

$kubectl logs <worker-pod> | grep -i "transport\|UCX\|TCP"

Solutions:

  1. Verify RDMA device plugin is installed
  2. Add rdma/ib resource requests to pod spec
  3. Add IPC_LOCK capability
  4. Set UCX environment variables

Problem: “Unsupported operation” errors

Symptoms: Logs show Unexpected UCX error: Unsupported operation

Root Cause: UCX attempting GPU RDMA on hardware that doesn’t support it

Solutions:

  1. Check if GPUDirect RDMA is enabled: ucx_info -d | grep cuda
  2. If not supported, set UCX_RNDV_THRESH=inf to disable GPU RDMA
  3. Verify nvidia-peermem module is loaded

Problem: AWS EFA not using GPU Direct

Symptoms: 3x performance degradation on AWS despite EFA configured

Root Cause: GPU Direct RDMA not functional on kernel ≥6.8 with EFA

Current Status: This is a known limitation. Options:

  1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
  2. Accept host-staging performance penalty
  3. Wait for AWS to update EFA DMA-BUF support

Problem: Intermittent transfer failures

Symptoms: Sporadic getXferStatus: backend 'UCX' returned error status

Diagnosis:

$# Enable UCX debug logging
$kubectl set env deployment/<worker> UCX_LOG_LEVEL=debug
$kubectl logs <worker-pod> | grep -i error

Common causes:

  • Network congestion or packet loss
  • Mismatched UCX versions between pods
  • RDMA resource exhaustion

Quick Reference

Minimum Viable RDMA Configuration

1env:
2 - name: UCX_TLS
3 value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
4 - name: UCX_RNDV_SCHEME
5 value: "get_zcopy"
6 - name: UCX_RNDV_THRESH
7 value: "0"
8
9securityContext:
10 capabilities:
11 add: ["IPC_LOCK"]
12
13resources:
14 limits:
15 rdma/ib: "2"
16 requests:
17 rdma/ib: "2"

Diagnostic Checklist

  • rdma/ib resources visible: kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'
  • UCX sees RDMA devices: ucx_info -d | grep "Transport: rc"
  • UCX sees GPU memory: ucx_info -d | grep "memory types.*cuda"
  • NIXL initialized with UCX: kubectl logs <pod> | grep "Backend UCX"
  • Transfer bandwidth > 1 GB/s (check Grafana metrics)