Disaggregated Inference Communication Guide | NVIDIA Dynamo Documentation

This guide explains how prefill and decode workers communicate in Dynamo’s disaggregated inference architecture on Kubernetes. It answers the frequently asked question: Why can’t prefill and decode workers use NVLink to communicate on the same node?

Summary

NVLink cannot be used between Kubernetes pods due to process isolation and GPU partitioning
RDMA (InfiniBand, RoCE, or AWS EFA) is required for production disaggregated deployments
Without RDMA, expect 200-500x performance degradation in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
UCX or libfabric are the communication layers that NIXL uses to transfer KV cache between workers
Topology-aware KV transfer can constrain or bias decode routing so KV transfers stay within a selected topology domain such as zone or rack. See Topology-Aware KV Transfer.

Architecture Overview

Communication Stack

Component Responsibilities

Component	Role	Location
NIXL	High-level KV cache transfer API	Dynamo runtime library
UCX or libfabric	Low-level communication framework	System library
Transports	Physical data movement	Hardware/kernel drivers

Why NVLink Cannot Be Used Between Pods

The Fundamental Constraint

NVLink is a direct GPU-to-GPU interconnect that operates at the hardware level. It requires:

Same process - Both GPUs must be visible to a single process so cudaDeviceEnablePeerAccess() can be called
Direct memory access - Process must have permission to access both GPU memory regions
Peer-to-peer mapping - CUDA runtime must establish memory mappings between GPUs

Kubernetes pods violate all three requirements:

Why NVLink cannot work between Kubernetes pods due to process isolation

Technical Explanation

Process Isolation: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B’s memory space.
GPU Partitioning: The Kubernetes device plugin assigns specific GPUs to each pod via CUDA_VISIBLE_DEVICES. Pod A’s GPU 0 and Pod B’s GPU 0 are physically different devices.
Process/Namespace Isolation: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so cudaDeviceEnablePeerAccess() can be called.
Memory Registration: NVLink transfers use cudaMemcpy with peer access enabled. This requires calling cudaDeviceEnablePeerAccess() - impossible across process boundaries.

Where NVLink DOES Work

NVLink works within a pod for parallelism strategies (TP, EP) where all GPUs are in the same process:

1 # Decode worker with TP=4 uses NVLink between its 4 GPUs
2 VLLMDecodeWorker:
3   resources:
4     limits:
5       gpu: "4"   # All 4 GPUs visible to single process
6   args:
7     - --tensor-parallel-size
8     - "4"        # NVLink used for TP/EP communication within pod

Supported Communication Options

Transport Comparison

Transport	Bandwidth	Latency	Same-Node	Cross-Node	GPU Direct
NVLink	450-900 GB/s	~µs	✅ (intra-pod only)	❌	✅
InfiniBand RDMA	20-50 GB/s	~1 µs	✅	✅	✅ (with GPUDirect)
RoCE RDMA	10-25 GB/s	~2 µs	✅	✅	✅ (with GPUDirect)
TCP	1-3 GB/s	~50 µs	✅	✅	❌ (host staging)

Same-Node Communication

When prefill and decode workers are on the same physical node:

Same-node RDMA communication between prefill and decode pods

Options (best to worst):

InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU
TCP (fallback) → GPU→CPU→TCP→CPU→GPU

Best Practice: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.

Cross-Node Communication

When prefill and decode workers are on different nodes:

Cross-node RDMA communication between prefill and decode pods on separate nodes

Requirements for optimal cross-node performance:

RDMA network fabric (InfiniBand, RoCE, or AWS EFA)
GPUDirect RDMA enabled (GPU memory registered with NIC)
Proper UCX or libfabric configuration

UCX Configuration Reference

Environment Variables

UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.

Core Transport Selection

1 env:
2   - name: UCX_TLS
3     value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"

Transport	Description	When to Use
`rc_x`	Reliable Connection (accelerated)	Primary RDMA transport
`rc`	Reliable Connection (standard)	Fallback RDMA
`dc_x`	Dynamically Connected (accelerated)	Scalable RDMA (many endpoints)
`dc`	Dynamically Connected (standard)	Fallback scalable RDMA
`cuda_copy`	GPU↔Host memory staging	Required for GPU buffers
`cuda_ipc`	CUDA IPC (same-node, same-pod)	Intra-pod GPU transfers
`tcp`	TCP sockets	Fallback when RDMA unavailable
`srd`	Scalable Reliable Datagram (AWS EFA)	AWS-specific (provided by EFA, not core UCX)

Excluding transports: Use ^ prefix to exclude (e.g., UCX_TLS=^mm excludes memory mapping).

Note: When specifying UCX_TLS explicitly with GPU memory, you must include cuda_copy or cuda_ipc for UCX to recognize GPU buffers.

Rendezvous Protocol Settings

1 env:
2   - name: UCX_RNDV_SCHEME
3     value: "get_zcopy"
4   - name: UCX_RNDV_THRESH
5     value: "0"

Variable	Value	Description
`UCX_RNDV_SCHEME`	`get_zcopy`	Zero-copy RDMA GET (receiver pulls data)
`UCX_RNDV_SCHEME`	`put_zcopy`	Zero-copy RDMA PUT (sender pushes data)
`UCX_RNDV_SCHEME`	`auto`	Let UCX choose based on message size
`UCX_RNDV_THRESH`	`0`	Use rendezvous for all message sizes
`UCX_RNDV_THRESH`	`8192`	Use rendezvous for messages ≥8KB
`UCX_RNDV_THRESH`	`auto`	Let UCX calculate optimal threshold

Recommendation: Use get_zcopy with threshold 0 for KV cache transfers (always large).

⚠️ AWS EFA Exception: Do NOT use get_zcopy on AWS with Ubuntu 24.04 + Kernel ≥6.8. See AWS EFA Configuration for required settings.

Memory Registration

1 env:
2   - name: UCX_IB_REG_METHODS
3     value: "odp,rcache"

Method	Description
`odp`	On-Demand Paging (dynamic registration)
`rcache`	Registration cache (reuse registrations)
`direct`	Direct registration (each transfer)

Debugging and Diagnostics

1 env:
2   - name: UCX_LOG_LEVEL
3     value: "info"        # Options: fatal, error, warn, info, debug, trace, data, func
4   - name: UCX_LOG_FILE
5     value: "/tmp/ucx.log" # Optional: log to file instead of stdout

Note: UCX statistics (UCX_STATS_DEST, UCX_STATS_TRIGGER) require UCX compiled with --enable-stats flag, which is not enabled in default builds.

Complete Production Configuration

1 env:
2   # Transport selection - RDMA with GPU support
3   - name: UCX_TLS
4     value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
5 
6   # Rendezvous for large transfers
7   - name: UCX_RNDV_SCHEME
8     value: "get_zcopy"
9   - name: UCX_RNDV_THRESH
10     value: "0"
11 
12   # Memory registration optimization
13   - name: UCX_IB_REG_METHODS
14     value: "odp,rcache"
15 
16   # RDMA settings
17   - name: UCX_IB_GID_INDEX
18     value: "3"           # RoCE v2 GID index (cluster-specific)

InfiniBand Configuration

For clusters with InfiniBand RDMA (e.g., ConnectX NICs), use UCX with the rc (Reliable Connection) transport. This is the standard path for on-premises and bare-metal Kubernetes clusters.

RDMA Resources:

Request one rdma/ib device per GPU. The RDMA device plugin injects /dev/infiniband/* devices automatically:

1 resources:
2   limits:
3     gpu: "4"
4     custom:
5       rdma/ib: "4"

No pod annotations are needed. InfiniBand devices are injected by the device plugin.

Security Context:

Add IPC_LOCK and SYS_RESOURCE capabilities. IPC_LOCK allows RDMA memory pinning, SYS_RESOURCE allows memlock limit escalation:

1 securityContext:
2   runAsUser: 0
3   capabilities:
4     add:
5       - IPC_LOCK
6       - SYS_RESOURCE

Environment Variables (worker containers):

1 env:
2   # --- UCX (RDMA transport) ---
3   - name: UCX_TLS
4     value: "rc_x,rc,cuda_copy,cuda_ipc"
5   - name: UCX_NET_DEVICES
6     value: "<ib-device>:1"       # e.g. "mlx5_0:1" — run `ibv_devinfo` to find your device
7   - name: UCX_IB_ADDR_TYPE
8     value: "eth"                 # required for cross-pod IB on Kubernetes
9   - name: UCX_RNDV_SCHEME
10     value: "get_zcopy"
11   - name: UCX_RNDV_THRESH
12     value: "0"
13   - name: UCX_RC_TIMEOUT
14     value: "600s"
15   - name: UCX_KEEPALIVE_INTERVAL
16     value: "300s"

Variable	Description
`UCX_TLS`	`rc_x` (accelerated RC) listed first for optimal RDMA performance
`UCX_NET_DEVICES`	Bind to a specific IB device. Run `ibv_devinfo` inside a pod to list available devices. Use a non-bonded device with a valid LID.
`UCX_IB_ADDR_TYPE`	Must be `eth` for cross-pod communication on Kubernetes. Without this, UCX uses LID-based addressing which does not route between pods.
`UCX_RNDV_SCHEME`	`get_zcopy` enables zero-copy RDMA GET, optimal for large KV cache transfers

Note: UCX_IB_ADDR_TYPE=eth is the most common missing setting when bringing up NIXL disagg on InfiniBand clusters. If NIXL init succeeds but transfers fail with NIXL_ERR_REMOTE_DISCONNECT, this is likely the cause.

Known Issue — Bonded IB devices:

Some clusters expose bonded InfiniBand devices (e.g., mlx5_bond_0) with LID=0. If UCX selects a bonded device, transfers may fail. Verify device LIDs and select a non-bonded device:

$ # Inside a pod with rdma/ib resources:
$ ibv_devinfo | grep -E "hca_id|lid"
$ # Use a device with a non-zero LID in UCX_NET_DEVICES

AWS EFA Configuration

NIXL supports libfabric as the backend for AWS EFA deployments. This is the recommended approach for disaggregated inference on AWS, achieving ~9.6 GB/s KV transfer bandwidth. See the AWS EFA with NIXL documentation for complete setup instructions.

Requirements:

EFA installer version 1.47.0 or later
Libfabric (installed via EFA installer at /opt/amazon/efa)
GDRCopy for GPU Direct RDMA operations (GPU Operator v26.x installs this automatically)
EFA-enabled container image (e.g., nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1-efa-amd64)

Kernel Compatibility:

GDRCopy v2.5.1 has a build failure on kernel 6.15+ due to a vm_flags_set redefinition. Pin your Ubuntu EKS AMI to kernel 6.14 or earlier until GDRCopy v2.5.2 is available in GPU Operator.

Kernel Version	GDRCopy v2.5.1	GDRCopy v2.5.2
6.14 and below	✅ Works	✅ Works
6.15+	❌ Build fails	✅ Works

Pod Anti-Affinity (Required):

EFA is designed for cross-node communication. Prefill and decode workers must be scheduled on different nodes to avoid EAGAIN errors during KV transfer.

1 VllmDecodeWorker:
2   extraPodSpec:
3     affinity:
4       podAntiAffinity:
5         requiredDuringSchedulingIgnoredDuringExecution:
6           - labelSelector:
7               matchExpressions:
8                 - key: nvidia.com/dynamo-component
9                   operator: In
10                   values:
11                     - VllmPrefillWorker
12             topologyKey: kubernetes.io/hostname

Note: Anti-affinity only needs to be configured on one side (here, the decode worker). The Kubernetes scheduler enforces the constraint symmetrically—if decode cannot be placed with prefill, they will end up on different nodes regardless of which pod has the rule.

EFA Resource Requests:

Request EFA interfaces in your pod spec. The p5.48xlarge instance has 32 EFA interfaces (32 network cards × 1 interface each) with 3200 Gbps total bandwidth. The number of interfaces to allocate per worker depends on your deployment:

Deployment	EFA per Worker	Rationale
1P + 1D per node pair	4	Achieved ~9.6 GB/s; leaves 24 interfaces for other pods
Multi-worker per node	2-4	Balance between workers sharing the node
Maximum bandwidth	8-16	For very large KV cache transfers or TP>1

Example with 4 EFA interfaces (validated configuration):

1 extraPodSpec:
2   mainContainer:
3     securityContext:
4       capabilities:
5         add: ["IPC_LOCK"]
6     resources:
7       limits:
8         vpc.amazonaws.com/efa: "4"
9       requests:
10         vpc.amazonaws.com/efa: "4"

Note: NIXL/libfabric automatically stripes traffic across all allocated EFA interfaces. The 4-interface configuration achieved ~9.6 GB/s in testing, which is sufficient for Llama-3.1-8B KV cache transfers at ISL=8000. Increase the count if your workload requires higher bandwidth (e.g., larger models or higher TP).

Environment Variables:

1 env:
2   - name: NIXL_LOG_LEVEL
3     value: "INFO"
4   - name: LD_LIBRARY_PATH
5     value: "/usr/local/nixl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib64:$(LD_LIBRARY_PATH)"

vLLM Configuration:

$ vllm serve <your-model> \
>     --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cuda","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

Parameter	Value	Purpose
`kv_connector`	`NixlConnector`	Enables NIXL for KV-cache transfer
`kv_role`	`kv_both`	Symmetric functionality (producer and consumer)
`kv_buffer_device`	`cuda`	Uses GPU memory for KV-cache buffer
`backends`	`["LIBFABRIC"]`	Routes NIXL traffic over EFA

Verification:

$ # Confirm EFA/libfabric installation
$ fi_info -p efa -t FI_EP_RDM
$ 
$ # Verify GDRCopy device
$ ls -la /dev/gdrdrv
$ 
$ # Check NIXL initialization in pod logs (should show 32 EFA devices on p5.48xlarge)
$ kubectl logs <worker-pod> | grep -i "NIXL\|libfabric\|efa"

Expected Log Output:

NIXL  INFO Loaded backend plugin: LIBFABRIC
NIXL  INFO Found 32 fabric devices

Deployment Configuration

Kubernetes Resource Requirements

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 spec:
4   services:
5     VLLMPrefillWorker:
6       resources:
7         limits:
8           gpu: "2"
9       extraPodSpec:
10         mainContainer:
11           securityContext:
12             capabilities:
13               add: ["IPC_LOCK"]      # Required for RDMA memory pinning
14           resources:
15             limits:
16               rdma/ib: "2"           # RDMA resources (match TP size)
17             requests:
18               rdma/ib: "2"

Required Capabilities and Resources

Setting	Purpose	Notes
`IPC_LOCK` capability	Pin memory for RDMA	Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers
`rdma/ib` resources	RDMA NIC access	Provided by RDMA device plugin
`sharedMemory.size`	IPC between processes	16Gi for vLLM, 80Gi for TRT-LLM

Infrastructure Prerequisites

RDMA Device Plugin: Exposes rdma/ib or vpc.amazonaws.com/efa resources to Kubernetes

$ # InfiniBand/RoCE
$ kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}'
$ # AWS EFA
$ kubectl get nodes -o jsonpath='{.items[*].status.allocatable.vpc\.amazonaws\.com/efa}'

RDMA Network: One of:
- InfiniBand or RoCE fabric
- AWS EFA (Elastic Fabric Adapter)
GPUDirect RDMA (optional but recommended):
- NVIDIA driver with GPUDirect enabled
- nvidia-peermem kernel module loaded (InfiniBand/RoCE)
- GDRCopy installed (AWS EFA with libfabric)

Diagnostics and Performance Validation

Pre-Deployment Validation

1. Verify RDMA Availability

$ # Check RDMA devices on node
$ kubectl debug node/<node-name> -it --image=ubuntu:22.04 -- bash
$ ibv_devinfo

Expected output shows InfiniBand or RoCE devices:

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.35.2000
        ...

2. Check UCX Transport Capabilities

$ # Inside a Dynamo worker pod
$ ucx_info -d

Look for GPU memory support:

# Memory domain: mlx5_0
#     Component: ib
#     memory types: host (access,reg,cache), cuda (access,reg,cache)
#                                            ^^^^ GPU memory supported

If you only see host: GPUDirect RDMA is not working. KV transfers will use host staging.

3. Test UCX Performance

$ # Server (on decode worker pod)
$ ucx_perftest -t tag_bw -n 100 -s 134217728
$ 
$ # Client (on prefill worker pod)
$ ucx_perftest <server-ip> -t tag_bw -n 100 -s 134217728

Expected bandwidth:

InfiniBand HDR: 20-25 GB/s per port
RoCE 100GbE: 10-12 GB/s
TCP fallback: 1-2 GB/s

NIXL Benchmark Tool

Deploy the NIXL benchmark to validate end-to-end KV transfer performance:

$ cd deploy/pre-deployment/nixl
$ ./build_and_deploy.sh

This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.

Runtime Diagnostics

Verify NIXL Backend Initialization

$ kubectl logs <worker-pod> | grep -i "NIXL\|UCX"

Good output:

NIXL INFO Backend UCX was instantiated

Bad output (RDMA not working):

UCX WARN no RDMA transports available
NIXL INFO falling back to TCP transport

Monitor Transfer Performance

Check Grafana dashboards for:

NIXL transfer bandwidth: Should show GB/s, not MB/s
KV cache transfer latency: Should be under 500ms for typical workloads

Red flags indicating RDMA issues:

Transfer bandwidth under 1 GB/s
TTFT > 10 seconds
Unsupported operation errors in logs

Common Diagnostic Commands

$ # Check UCX transport selection
$ kubectl exec <pod> -- env | grep UCX
$ 
$ # Verify RDMA device visibility
$ kubectl exec <pod> -- ls /dev/infiniband/
$ 
$ # Check GPUDirect RDMA status (on node)
$ kubectl debug node/<node> -it --image=ubuntu:22.04 -- \
>   nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr"
$ 
$ # Test basic connectivity between pods
$ kubectl exec <prefill-pod> -- ping -c 3 <decode-pod-ip>

Performance Expectations

KV Cache Transfer Overhead

Configuration	TTFT Overhead (avg)	KV Transfer BW	Source
Aggregated (baseline)	0	N/A	No KV transfer needed
Disagg + InfiniBand RDMA with GPUDirect	+200-500ms	20-50 GB/s	Expected based on hardware specs
Disagg + RoCE RDMA with GPUDirect	+300-800ms	10-25 GB/s	Expected based on hardware specs
Disagg + AWS EFA with libfabric + GDRCopy	+37ms	~9.6 GB/s	Measured on AWS p5.48xlarge (Llama-3.1-8B, ISL=8000, OSL=50)
Disagg + Host-staged (no GPUDirect)	+1-3s	1-3 GB/s	Expected - CPU bottleneck
Disagg + AWS EFA with UCX (without GPUDirect)	~3x slower than aggregated	~1 GB/s	Measured on AWS p5.48xlarge
Disagg + TCP fallback	+90-100s	~100 MB/s	Measured ~98s TTFT on AWS p5.48xlarge

Note: For AWS EFA deployments, use libfabric with GDRCopy to enable GPUDirect RDMA. UCX on AWS EFA does not support GPUDirect on kernel ≥6.8 and results in severely degraded performance. See AWS EFA Configuration for setup instructions.

When Disaggregated Makes Sense

Use disaggregated architecture when:

Input sequence length (ISL) ≥ 4000 tokens (14-22% throughput gain)
You need independent scaling of prefill vs decode capacity
Prefill and decode have different hardware requirements

Use aggregated architecture when:

Low-latency TTFT is critical
Input sequences under 2000 tokens (minimal disagg benefit)
RDMA is not available

Break-Even Analysis

The KV transfer overhead is amortized across output tokens. Measured data from Llama-3.1-8B-Instruct on AWS p5.48xlarge with NIXL+libfabric:

KV Transfer Overhead (TTFT min, unqueued):
- Aggregated:    ~173ms
- Disaggregated: ~210ms
- KV transfer cost: ~37ms
Performance at ISL=8000, OSL=50, concurrency=10:
- ITL improvement: 41% faster per-token generation
- Throughput gain: 22% higher output throughput

Key Insight: The KV transfer overhead via libfabric+EFA is only ~37ms. Combined with 41% faster decode (ITL), disaggregated inference delivers 22% higher throughput for prefill-bound workloads.

Metric	Aggregated	Disaggregated	Difference
TTFT (min, unqueued)	173 ms	210 ms	+37ms
TTFT (p95)	2097 ms	1752 ms	-16%
ITL (avg)	28.5 ms	16.9 ms	-41%
Output throughput (ISL=8000, OSL=50)	204 tok/s	248 tok/s	+22%

Disagg advantage scales with input length (ISL) (all at OSL=50, concurrency=10):

ISL	Throughput Δ	ITL Δ	Recommendation
1000	~0%	-7%	Use aggregated
2000	+3%	-11%	Either works
4000	+14%	-18%	Disagg preferred
8000	+22%	-41%	Disagg strongly preferred

Troubleshooting Guide

Problem: TTFT is 10+ seconds

Symptoms: TTFT degrades from expected 200-500ms to 10+ seconds

Root Cause: RDMA not active, falling back to TCP

Diagnosis:

$ kubectl logs <worker-pod> | grep -i "transport\|UCX\|TCP"

Solutions:

Verify RDMA device plugin is installed
Add rdma/ib resource requests to pod spec
Add IPC_LOCK capability
Set UCX environment variables

Problem: “Unsupported operation” errors

Symptoms: Logs show Unexpected UCX error: Unsupported operation

Root Cause: UCX attempting GPU RDMA on hardware that doesn’t support it

Solutions:

Check if GPUDirect RDMA is enabled: ucx_info -d | grep cuda
If not supported, set UCX_RNDV_THRESH=inf to disable GPU RDMA
Verify nvidia-peermem module is loaded

Problem: AWS EFA not using GPU Direct

Symptoms: 3x performance degradation on AWS despite EFA configured

Root Cause: GPU Direct RDMA not functional on kernel ≥6.8 with EFA when using UCX

Solution: Use libfabric instead of UCX for AWS EFA deployments. Libfabric with GDRCopy provides efficient GPU Direct RDMA operations on AWS. See the AWS EFA Configuration section for setup instructions.

Alternative options (if libfabric is not available):

Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
Accept host-staging performance penalty

Problem: EFA EAGAIN errors (fi_read still retrying)

Symptoms: Decode worker logs show repeated EAGAIN errors:

fi_read still retrying EAGAIN on rail 0
fi_read still retrying EAGAIN on rail 1
...

Root Cause: Prefill and decode workers are scheduled on the same node. AWS EFA is designed for cross-node communication and does not function correctly for intra-node transfers.

Diagnosis:

$ # Check if workers are on the same node
$ kubectl get pods -o wide | grep vllm

If both prefill and decode workers show the same NODE, this is the problem.

Solution: Add pod anti-affinity rules to ensure workers are scheduled on different nodes:

1 VllmDecodeWorker:
2   extraPodSpec:
3     affinity:
4       podAntiAffinity:
5         requiredDuringSchedulingIgnoredDuringExecution:
6           - labelSelector:
7               matchExpressions:
8                 - key: nvidia.com/dynamo-component
9                   operator: In
10                   values:
11                     - VllmPrefillWorker
12             topologyKey: kubernetes.io/hostname

Note: Use nvidia.com/dynamo-component as the label key, not app.kubernetes.io/component. The Dynamo operator uses this label to identify component types.

Problem: NIXL_ERR_BACKEND at create_backend on InfiniBand

Symptoms: NIXL backend creation fails immediately with NIXL_ERR_BACKEND. UCX logs show:

mlx5dv_devx_obj_destroy(SRQ) failed: Invalid argument
mlx5dv_devx_obj_destroy(CQ) failed: Invalid argument

Or:

select.c: no active messages transport: Unsupported operation

Root Causes:

Bonded IB device with LID=0: UCX selects mlx5_bond_0 by default, but bonded devices may have LID=0 (invalid for UD transport). Fix: set UCX_NET_DEVICES to a non-bonded device with a valid LID.
UCX/OFED version mismatch: The container’s UCX mlx5 library may be compiled against a different devx ABI than the host kernel driver. Any transport using IB (rc, cuda_ipc with IB) triggers the devx crash.
Missing RDMA device injection: If rdma/ib is not requested in the pod spec, no IB devices are injected into the container.

Diagnosis:

$ # Check which IB devices are visible and their LIDs
$ ibv_devinfo | grep -E "hca_id|lid"
$ 
$ # Verify rdma/ib was requested
$ kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources}'
$ 
$ # Check /dev/infiniband exists
$ ls -la /dev/infiniband/

Solutions:

Request rdma/ib resources (1 per GPU) in the pod spec
Set UCX_NET_DEVICES to a non-bonded device if mlx5_bond_0 has LID=0
Ensure the container image’s UCX build matches the host OFED version

Problem: Intermittent transfer failures

Symptoms: Sporadic getXferStatus: backend 'UCX' returned error status

Diagnosis:

$ # Enable UCX debug logging
$ kubectl set env deployment/<worker> UCX_LOG_LEVEL=debug
$ kubectl logs <worker-pod> | grep -i error

Common causes:

Network congestion or packet loss
Mismatched UCX versions between pods
RDMA resource exhaustion

Quick Reference

Minimum Viable RDMA Configuration

1 env:
2   - name: UCX_TLS
3     value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
4   - name: UCX_RNDV_SCHEME
5     value: "get_zcopy"
6   - name: UCX_RNDV_THRESH
7     value: "0"
8 
9 securityContext:
10   capabilities:
11     add: ["IPC_LOCK"]
12 
13 resources:
14   limits:
15     rdma/ib: "2"
16   requests:
17     rdma/ib: "2"

Diagnostic Checklist

rdma/ib resources visible: kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'
NIXL initialized: kubectl logs <pod> | grep "Backend"
Transfer bandwidth > 1 GB/s (check Grafana metrics)

For UCX deployments:

UCX sees RDMA devices: ucx_info -d | grep "Transport: rc"
UCX sees GPU memory: ucx_info -d | grep "memory types.*cuda"

For libfabric deployments (AWS EFA):

EFA devices available: fi_info -p efa
GDRCopy installed: ls /dev/gdrdrv

Summary

Architecture Overview

Communication Stack

Component Responsibilities

Why NVLink Cannot Be Used Between Pods

The Fundamental Constraint

Technical Explanation

Where NVLink DOES Work

Supported Communication Options

Transport Comparison

Same-Node Communication

Cross-Node Communication

UCX Configuration Reference

Environment Variables

Core Transport Selection

Rendezvous Protocol Settings

Memory Registration

Debugging and Diagnostics

Complete Production Configuration

InfiniBand Configuration

AWS EFA Configuration

Deployment Configuration

Kubernetes Resource Requirements

Required Capabilities and Resources

Infrastructure Prerequisites

Diagnostics and Performance Validation

Pre-Deployment Validation

1. Verify RDMA Availability

2. Check UCX Transport Capabilities

3. Test UCX Performance

NIXL Benchmark Tool

Runtime Diagnostics

Verify NIXL Backend Initialization

Monitor Transfer Performance

Common Diagnostic Commands

Performance Expectations

KV Cache Transfer Overhead

When Disaggregated Makes Sense

Break-Even Analysis

Troubleshooting Guide

Problem: TTFT is 10+ seconds

Problem: “Unsupported operation” errors

Problem: AWS EFA not using GPU Direct

Problem: EFA EAGAIN errors (fi_read still retrying)

Problem: NIXL_ERR_BACKEND at create_backend on InfiniBand

Problem: Intermittent transfer failures

Quick Reference

Minimum Viable RDMA Configuration

Diagnostic Checklist

Related Documentation