For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • KV Cache Transfer
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Using NIXL for KV Cache Transfer
  • Default Method: NIXL
  • Specify Backends for NIXL
  • Alternative Method: UCX
Additional ResourcesTensorRT-LLM Details

KV Cache Transfer

||View as Markdown|
Edit this page
Previous

Dynamo Docs Guide

For general TensorRT-LLM features and configuration, see the Reference Guide.


In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

Using NIXL for KV Cache Transfer

Start the disaggregated service: See Disaggregated Serving to learn how to start the deployment.

Default Method: NIXL

By default, TensorRT-LLM uses NIXL (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. NIXL is NVIDIA’s high-performance communication library designed for efficient data transfer in distributed GPU environments.

Specify Backends for NIXL

TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.

Alternative Method: UCX

TensorRT-LLM can also leverage UCX (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set cache_transceiver_config.backend: UCX in your engine configuration YAML file.

The environment variable TRTLLM_USE_UCX_KVCACHE=1 with cache_transceiver_config.backend: DEFAULT does not enable UCX. You must explicitly set backend: UCX in the configuration.