FlexKV Integration in Dynamo#
Introduction#
FlexKV is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud’s TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang.
Key Features#
Multi-level caching: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
Distributed KV cache reuse: Share KV cache across multiple nodes using distributed RadixTree
High-performance I/O: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
Asynchronous operations: Get and put operations can overlap with computation through prefetching
Prerequisites#
Dynamo installed with vLLM support
Infrastructure services running:
docker compose -f deploy/docker-compose.yml up -d
FlexKV dependencies (for SSD offloading):
apt install liburing-dev libxxhash-dev
Quick Start#
Enable FlexKV#
Set the DYNAMO_USE_FLEXKV environment variable and use the --connector flexkv flag:
export DYNAMO_USE_FLEXKV=1
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
Aggregated Serving#
Basic Setup#
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Start vLLM worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
With KV-Aware Routing#
For multi-worker deployments with KV-aware routing to maximize cache reuse:
# Terminal 1: Start frontend with KV router
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &
# Terminal 2: Worker 1
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
CUDA_VISIBLE_DEVICES=0 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &
# Terminal 3: Worker 2
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
Disaggregated Serving#
FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Decode worker (without FlexKV)
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl &
# Terminal 3: Prefill worker (with FlexKV)
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--connector nixl flexkv
Configuration#
Environment Variables#
Variable |
Description |
Default |
|---|---|---|
|
Enable FlexKV integration |
|
|
CPU memory cache size in GB |
Required |
|
Path to FlexKV YAML config file |
Not set |
|
IPC port for FlexKV server |
Auto |
CPU-Only Offloading#
For simple CPU memory offloading:
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
CPU + SSD Tiered Offloading#
For multi-tier offloading with SSD storage, create a configuration file:
cat > ./flexkv_config.yml <<EOF
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
Configuration Options#
Option |
Description |
|---|---|
|
CPU memory cache size in GB |
|
SSD cache size in GB |
|
SSD cache directories (semicolon-separated for multiple SSDs) |
|
Enable GPU Direct Storage for SSD I/O |
Note: For full configuration options, see the FlexKV Configuration Reference.
Distributed KV Cache Reuse#
FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:
Distributed RadixTree: Each node maintains a local snapshot of the global index
Lease Mechanism: Ensures data validity during cross-node transfers
RDMA-based Transfer: Uses Mooncake Transfer Engine for high-performance KV cache transfer
For setup instructions, see the FlexKV Distributed Reuse Guide.
Architecture#
FlexKV consists of three core modules:
StorageEngine#
Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.
GlobalCacheEngine#
The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
RadixTree for prefix matching
Memory pool to track space usage and trigger eviction
TransferEngine#
The data plane that executes data transfers:
Multi-threading for parallel transfers
High-performance I/O (io_uring, GDS)
Asynchronous operations overlapping with computation
Verify Deployment#
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'