--- title: KVBM Guide subtitle: Enable KV offloading using KV Block Manager (KVBM) for Dynamo deployments --- The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM. KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems. ## Quick Start ## Run KVBM Standalone KVBM can be used independently without using the rest of the Dynamo stack: ```bash pip install kvbm ``` See the [support matrix](/dynamo/resources/support-matrix) for version compatibility. ### Build from Source To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/v1.0.1/lib/bindings/kvbm/README.md#build-from-source). ## Run KVBM in Dynamo with vLLM ### Docker Setup ```bash # Start up etcd for KVBM leader/worker registration and discovery docker compose -f deploy/docker-compose.yml up -d # Build a dynamo vLLM container (KVBM is built in by default) python container/render.py --framework vllm --target runtime --output-short-filename docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . # Launch the container container/run.sh --image dynamo:latest-vllm-runtime -it --mount-workspace --use-nixl-gds ``` ### Aggregated Serving ```bash cd $DYNAMO_HOME/examples/backends/vllm ./launch/agg_kvbm.sh ``` #### Verify Deployment ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello, how are you?"}], "stream": false, "max_tokens": 10 }' ``` #### Alternative: Using Direct vllm serve You can also use `vllm serve` directly with KVBM: ```bash vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B ``` ## Run KVBM in Dynamo with TensorRT-LLM **Prerequisites:** - Ensure `etcd` and `nats` are running before starting - KVBM only supports TensorRT-LLM's PyTorch backend - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits - KVBM requires TensorRT-LLM v1.2.0rc2 or newer ### Docker Setup ```bash # Start up etcd for KVBM leader/worker registration and discovery docker compose -f deploy/docker-compose.yml up -d # Build a dynamo TRTLLM container (KVBM is built in by default) python container/render.py --framework trtllm --target runtime --output-short-filename docker build -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile . # Launch the container container/run.sh --image dynamo:latest-trtllm-runtime -it --mount-workspace --use-nixl-gds ``` ### Aggregated Serving ```bash # Write the LLM API config cat > "/tmp/kvbm_llm_api_config.yaml" < **Learn more:** See the [SGLang HiCache Integration Guide](/dynamo/integrations/sg-lang-hi-cache) for detailed configuration, deployment examples, and troubleshooting. ## Disaggregated Serving with KVBM KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache. ### Disaggregated Serving with vLLM ```bash # 1P1D - one prefill worker and one decode worker # NOTE: requires at least 2 GPUs cd $DYNAMO_HOME/examples/backends/vllm ./launch/disagg_kvbm.sh # 2P2D - two prefill workers and two decode workers # NOTE: requires at least 4 GPUs cd $DYNAMO_HOME/examples/backends/vllm ./launch/disagg_kvbm_2p2d.sh ``` ### Disaggregated Serving with TRT-LLM ```bash # Launch prefill worker with KVBM python3 -m dynamo.trtllm \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --extra-engine-args /tmp/kvbm_llm_api_config.yaml \ --disaggregation-mode prefill & ``` ## Configuration ### Cache Tier Configuration Configure KVBM cache tiers using environment variables: ```bash # Option 1: CPU cache only (GPU -> CPU offloading) export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory # Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading) export DYN_KVBM_CPU_CACHE_GB=4 export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk # [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading) # NOTE: Experimental, may not provide optimal performance # NOTE: Disk offload filtering not supported with this option export DYN_KVBM_DISK_CACHE_GB=8 ``` You can also specify exact block counts instead of GB: - `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS` - `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS` > [!NOTE] KVBM is a write-through cache and it is possible to misconfigure. Each of the capacities should increase as you enable more tiers. As an example, if you configure your GPU device to have 100GB of memory dedicated for KV cache storage, then configure `DYN_KVBM_CPU_CACHE_GB >= 100`. The same goes for configuring the disk cache; `DYN_KVBM_DISK_CACHE_GB >= DYN_KVBM_CPU_CACHE_GB`. If the cpu cache is configured to be less than the device cache, then _there will be no benefit from KVBM_. In many cases you will see performance degradation as KVBM will churn by offloading blocks from the GPU to CPU after every forward pass. To know what your minimum value for `DYN_KVBM_CPU_CACHE_GB` should be for your setup, consult your llm engine's kv cache configuration. ### SSD Lifespan Protection When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step. To disable disk offload filtering: ```bash export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true ``` ## Enable and View KVBM Metrics ### Setup Monitoring Stack ```bash # Start basic services (etcd & natsd), along with Prometheus and Grafana docker compose -f deploy/docker-observability.yml up -d ``` ### Enable Metrics for vLLM ```bash DYN_KVBM_METRICS=true \ DYN_KVBM_CPU_CACHE_GB=20 \ python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --enforce-eager \ --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_connector_module_path":"kvbm.vllm_integration.connector","kv_role":"kv_both"}' ``` ### Enable Metrics for TensorRT-LLM ```bash DYN_KVBM_METRICS=true \ DYN_KVBM_CPU_CACHE_GB=20 \ python3 -m dynamo.trtllm \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --extra-engine-args /tmp/kvbm_llm_api_config.yaml & ``` ### Firewall Configuration (Optional) ```bash # If firewall blocks KVBM metrics ports sudo ufw allow 6880/tcp ``` ### View Metrics Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**. ### Available Metrics | Metric | Description | |--------|-------------| | `kvbm_matched_tokens` | Number of matched tokens | | `kvbm_offload_blocks_d2h` | Offload blocks from device to host | | `kvbm_offload_blocks_h2d` | Offload blocks from host to disk | | `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) | | `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device | | `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device | | `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) | | `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) | ## Benchmarking KVBM Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance. ### Setup ```bash git clone https://github.com/LMCache/LMBenchmark.git cd LMBenchmark/synthetic-multi-round-qa ``` ### Run Benchmark ```bash # Synthetic multi-turn chat dataset # Arguments: model, endpoint, output prefix, qps ./long_input_short_output_run.sh \ "Qwen/Qwen3-0.6B" \ "http://localhost:8000" \ "benchmark_kvbm" \ 1 ``` Average TTFT and other performance numbers will be in the output. > **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard. ### Baseline Comparison #### vLLM Baseline (without KVBM) ```bash vllm serve Qwen/Qwen3-0.6B ``` #### TensorRT-LLM Baseline (without KVBM) ```bash # Create config without kv_connector_config cat > "/tmp/llm_api_config.yaml" <