This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in SGLang and vLLM, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see Further Reading.
Internal Components of Dynamo KVBM
sequence_hash.
KVBM Data Flows from device to other memory hierarchies

Internal architecture and key modules in the Dynamo KVBM
The KvBlockManager<H, D> acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
KvBlockManager<H, D> owns:
BlockPool<Device>BlockPool<Host>Implementation-wise, KvBlockManagerState holds the logic: it’s initialized by KvBlockManagerConfig, which merges runtime, model, and layout configurations. NixlOptions injects remote awareness.
Each block is a 2D array [num_layers][page_size × inner_dim]. The BlockLayout trait abstracts the memory layout. The default implementation, FullyContiguous, stores all layers for all blocks in one region with alignment-aware stride computation:
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
DeviceStorage → CUDA device bufferPinnedStorage → page-locked host memorySystemStorage → CPU heap memory (fallback/test)NixlStorage → remote memory through NIXL RDMA handles (includes storage)Each layout is constructed using a LayoutConfig, and storage is either passed directly or allocated using a StorageAllocator.
Each BlockPool<T> (where T is DeviceStorage, PinnedStorage, etc.) tracks two sub-pools:
When a token block is requested (e.g., get_mutable_block()), the allocator pops from InactivePool, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
The state machine (BlockState) tracks block lifecycle transitions:
A sequence requests a new KV block:
init_sequence() → Transitions to Partialcommit() → State becomes Completeregister() → Block is hashed and moved to Registered. Blocks can now be used for lookup.drop() of RAII handle returns block to ResetThe system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an EventManager. On registration and drop:
PublishHandle triggers Register eventsThis pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
The NIXL agent exposes remote memory buffers using NixlBlockSet, RemoteBlocks, and layout descriptors. Key operations include:
nixl_register(): Registers memory region with NIXL runtimeserialize() / deserialize(): Converts layout and memory into transferable descriptorsimport_remote_blockset(): Loads remote node’s block layouts into the managerget_remote_blocks_mutable(): Fetches transferable memory views from another nodeRemoteBlocks is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL:
1. Agent Creation & Memory Registration
Each worker independently sets up a NixlAgent:
nixl_register()Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
2. Metadata Exchange
After memory registration, workers exchange serialized layout metadata, encapsulated in a SerializedNixlBlockLayout.
Why is this step critical?
FullyContiguous layouts, their internal slicing and alignment assumptions differdeserialize()This enables NIXL to:
Without this step, remote fetches would result in data corruption or misaligned tokens.
3. Serialization & Deserialization: Making Layouts Portable
In the serialization stage, KVBM exports and FullyContiguous::serialize() encodes:
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state.
In the deserialization stage, SerializedNixlBlockLayout::deserialize() rehydrates this into:
It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
4. Ownership Handles and Lifetime Tracking
Memory ownership in NIXL is tightly coupled with RAII-based handles:
PublishHandle which wraps a RegistrationHandleThis mechanism avoids:
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency.
You can integrate KVBM with a storage backend by extending or wrapping NixlEnabledStorage to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
registerVolume(descriptor): Register a logical volume for KV cache dataunregisterVolume(): Cleanly deregister and release volume mappingsget() / put(): Block-level APIs used by KVBM to fetch and store token blocksThese abstractions allow backends to be integrated without tying into the host’s file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized.
To support external storage optimizations without modifying KVBM logic, we provide an event plane (supporting NATS and ZMQ transports) that emits lifecycle events for all block operations:
Each KVEvent (~100 bytes) contains:
For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load).
This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. This is optional for KVBM integration with a backend.
External storage systems are not tightly coupled with Dynamo’s execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
registerVolume() APIsget() and put())To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream:
prefix_hash, sequence_hash, and associated metadataWith real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies:
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
KVBM integrates with inference frameworks (SGLang, TensorRT-LLM, vLLM) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
There are two components of the interface:

Typical integration of KVBM with inference frameworks (vLLM shown as example)

Onboarding blocks from Host to Device

Onboarding blocks from Disk to Device

Offloading blocks from Device to Host & Disk