For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • KV Cache Transfer
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Using NIXL for KV Cache Transfer
  • Default Method: NIXL
  • Specify Backends for NIXL
  • Alternative Method: UCX
  • AWS EFA
  • NIXL Plugin ABI Mismatch on Decode Multinode
  • ComputeDomain for GB200 NVL72
  • Verifying EFA is Active
Additional ResourcesTensorRT-LLM Details

KV Cache Transfer

||View as Markdown|
Previous

Dynamo Docs Guide

For general TensorRT-LLM features and configuration, see the Reference Guide.


In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

Using NIXL for KV Cache Transfer

Start the disaggregated service: See Disaggregated Serving to learn how to start the deployment.

Default Method: NIXL

By default, TensorRT-LLM uses NIXL (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. NIXL is NVIDIA’s high-performance communication library designed for efficient data transfer in distributed GPU environments.

Specify Backends for NIXL

TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently supports both backends. For AWS EFA deployments, UCX with SRD transport is the tested and recommended backend (see AWS EFA below).

Alternative Method: UCX

TensorRT-LLM can also leverage UCX (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set cache_transceiver_config.backend: UCX in your engine configuration YAML file.

The environment variable TRTLLM_USE_UCX_KVCACHE=1 with cache_transceiver_config.backend: DEFAULT does not enable UCX. You must explicitly set backend: UCX in the configuration.

AWS EFA

On AWS, UCX uses the SRD (Scalable Reliable Datagram) transport over EFA devices. NIXL discovers EFA rdmap* devices automatically through UCX — no NIXL-level configuration changes are needed.

Image options:

  • Pre-built EFA image (AMD64 only): A dedicated EFA image with the EFA SDK baked in is available on NGC. This is recommended for AMD64 instances (e.g. p5.48xlarge):
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.1-efa-amd64

See Release Artifacts for all available EFA images.

  • Host-mount approach (ARM64 / GB200): No pre-built EFA ARM64 image is published. Use the standard tensorrtllm-runtime image and mount the EFA SDK from the host node. This is what we tested on GB200 NVL72:
1volumeMounts:
2 - name: efa-sdk
3 mountPath: /opt/amazon/efa
4volumes:
5 - name: efa-sdk
6 hostPath:
7 path: /opt/amazon/efa

EFA resource requests:

1resources:
2 requests:
3 vpc.amazonaws.com/efa: "4"
4 limits:
5 vpc.amazonaws.com/efa: "4"

Required environment variables for EFA workers (set on both prefill and decode):

1env:
2 - name: FI_PROVIDER
3 value: "efa"
4 - name: FI_EFA_USE_DEVICE_RDMA
5 value: "1"
6 - name: FI_EFA_ENABLE_SHM_TRANSFER
7 value: "0"
8 - name: LD_LIBRARY_PATH
9 value: "/opt/amazon/efa/lib:/usr/local/lib:/usr/lib"
FI_EFA_ENABLE_SHM_TRANSFER must be 0. SHM transfers break NIXL GPU buffer registrations.

Security context: AWS EFA currently requires privileged mode:

1securityContext:
2 privileged: true

NIXL Plugin ABI Mismatch on Decode Multinode

When running multinode decode, the decode leader launches workers via mpirun -> mgmn_worker_node, which loads TRT-LLM’s bundled NIXL rather than the system nixl_cu13. The container’s default NIXL_PLUGIN_DIR points to system plugins that are ABI-incompatible with TRT-LLM’s bundled NIXL. Override this on the decode service only:

1env:
2 - name: NIXL_PLUGIN_DIR
3 value: "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/libs/nixl/plugins"

Do not set this on prefill workers — they use nixl_cu13 which is compatible with the system plugins.

ComputeDomain for GB200 NVL72

On GB200 NVL72 racks, NCCL requires a ComputeDomain CR for proper cuMem/NVLS initialization. Without it, workers fail with NCCL error 'unhandled system error' during model loading.

1apiVersion: resource.nvidia.com/v1beta1
2kind: ComputeDomain
3metadata:
4 name: my-compute-domain
5spec:
6 numNodes: 3 # total nodes across prefill + decode
7 channel:
8 resourceClaimTemplate:
9 name: my-compute-domain-channel

Both prefill and decode services must include ResourceClaims:

1resources:
2 claims:
3 - name: compute-domain-channel
4extraPodSpec:
5 resourceClaims:
6 - name: compute-domain-channel
7 resourceClaimTemplateName: my-compute-domain-channel

Required NCCL environment variables for GB200:

1env:
2 - name: NCCL_MNNVL_ENABLE
3 value: "1"
4 - name: NCCL_CUMEM_ENABLE
5 value: "1"
6 - name: NCCL_NVLS_ENABLE
7 value: "1"
8 - name: NVIDIA_GDRCOPY
9 value: "1"

Verifying EFA is Active

After deployment, confirm NIXL is using SRD over EFA in the worker logs:

$kubectl logs <prefill-pod> | grep -iE "NixlTransfer|srd|rdmap"

Expected output:

NixlTransferAgent using NIXL backend: UCX
ucp_context_2 self cfg#1 rma_am(srd/rdmap40s0:1) am(srd/rdmap40s0:1 srd/rdmap62s0:1 ...)
NixlTransferAgent mAddress: 100.x.x.x:32939
  • srd/rdmap* confirms SRD transport over EFA devices
  • Multiple rdmap entries correspond to one EFA device per GPU