For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
      • KVBM Guide
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • When to Use KV Cache Offloading
  • Feature Support Matrix
  • Architecture
  • Next Steps
Components

KVBM

||View as Markdown|
Edit this page
Previous

Profiler Examples

Next

KVBM Guide

The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.

KVBM offers:

  • A unified memory API spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
  • Support for block lifecycles (allocate → register → match) with event-based state transitions
  • Integration with NIXL, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks

Get started: See the KVBM Guide for installation and deployment instructions.

When to Use KV Cache Offloading

KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.

Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:

ScenarioBenefit
Long sessions and multi-turn conversationsPreserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput
High concurrencyIdle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits
Shared or repeated contentReuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing
Memory- or cost-constrained deploymentsOffloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware

Feature Support Matrix

FeatureSupport
BackendLocal✅
Kubernetes✅
LLM FrameworkvLLM✅
TensorRT-LLM✅
SGLang❌
Serving TypeAggregated✅
Disaggregated✅

Architecture

KVBM Architecture High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem

KVBM has three primary logical layers:

LLM Inference Runtime Layer — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.

KVBM Logic Layer — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.

NIXL Layer — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.

Learn more: See the KVBM Design Document for detailed architecture, components, and data flows.

Next Steps

  • KVBM Guide — Installation, configuration, and deployment instructions
  • KVBM Design — Architecture deep dive, components, and data flows
  • LMCache Integration — Use LMCache with Dynamo vLLM backend
  • FlexKV Integration — Use FlexKV for KV cache management
  • SGLang HiCache — Enable SGLang’s hierarchical cache with NIXL
  • NIXL Documentation — NIXL communication library details