For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
      • Reference Guide
      • Frontend Processor Fallback
      • Examples
      • KV Cache Offloading
      • Observability
      • vLLM-Omni
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • KV Cache Offloading
  • KVBM
  • LMCache
  • FlexKV
  • See Also
BackendsvLLM

KV Cache Offloading

CPU and disk offloading integrations for vLLM in Dynamo
||View as Markdown|
Previous

Examples

Next

Prometheus

KV Cache Offloading

Dynamo supports multiple KV cache offloading backends for vLLM, allowing you to extend effective KV cache capacity beyond GPU memory using CPU RAM and disk storage. Each backend integrates through vLLM’s connector interface and works with both aggregated and disaggregated serving.

BackendSource
KVBMDynamo
LMCacheGitHub
FlexKVGitHub

KVBM

KVBM (KV Block Manager) is Dynamo’s built-in KV cache offloading system. It provides a three-layer architecture (LLM runtime, logical block management, NIXL transport) with support for CPU and disk cache tiers, and integrates natively with Dynamo’s KV-aware routing and disaggregated serving.

DeploymentLaunch Script
Aggregatedagg_kvbm.sh
Aggregated + KV routingagg_kvbm_router.sh
Disaggregated (1P1D)disagg_kvbm.sh
Disaggregated (2P2D)disagg_kvbm_2p2d.sh
Disaggregated + KV routingdisagg_kvbm_router.sh

For configuration details, see the KVBM Guide.

LMCache

LMCache is an open-source KV cache engine that provides prefill-once, reuse-everywhere caching with multi-level storage backends (CPU RAM, local storage, Redis, GDS, InfiniStore/Mooncake).

DeploymentLaunch Script
Aggregatedagg_lmcache.sh
Aggregated (multiprocess metrics)agg_lmcache_multiproc.sh
Disaggregateddisagg_lmcache.sh

For configuration details, see the LMCache Integration Guide.

FlexKV

FlexKV is a scalable, distributed KV cache runtime developed by Tencent Cloud’s TACO team. It supports multi-level caching (GPU, CPU, SSD), distributed KV cache reuse across nodes, and high-performance I/O via io_uring and GPUDirect Storage.

DeploymentLaunch Script
Aggregatedagg_flexkv.sh
Aggregated + KV routingagg_flexkv_router.sh
Disaggregateddisagg_flexkv.sh

For configuration details, see the FlexKV Integration Guide.

See Also

  • KVBM Design: Architecture and design of Dynamo’s built-in KV cache offloading
  • Routing Concepts: Routing requests based on KV cache state
  • Disaggregated Serving: Prefill/decode separation architecture