FlexKV | NVIDIA Dynamo Documentation

Introduction

FlexKV is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud’s TACO team and NVIDIA in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vllm.

Key Features

Multi-level caching: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
Distributed KV cache reuse: Share KV cache across multiple nodes using distributed RadixTree
High-performance I/O: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
Asynchronous operations: Get and put operations can overlap with computation through prefetching

Prerequisites

Dynamo installed with vLLM support

Infrastructure services running:

$ docker compose -f deploy/docker-compose.yml up -d

FlexKV dependencies (for SSD offloading):
```
$ apt install liburing-dev libxxhash-dev
```

Quick Start

Enable FlexKV

Set the DYNAMO_USE_FLEXKV environment variable and use the --kv-transfer-config flag:

$ export DYNAMO_USE_FLEXKV=1
$ python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'

Aggregated Serving

Basic Setup

$ # Terminal 1: Start frontend
$ python -m dynamo.frontend &
$ 
$ # Terminal 2: Start vLLM worker with FlexKV
$ DYNAMO_USE_FLEXKV=1 \
> FLEXKV_CPU_CACHE_GB=32 \
>   python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'

With KV-Aware Routing

For multi-worker deployments with KV-aware routing to maximize cache reuse:

$ # Terminal 1: Start frontend with KV router
$ python -m dynamo.frontend \
>     --router-mode kv \
>     --router-reset-states &
$ 
$ # Terminal 2: Worker 1
$ DYNAMO_USE_FLEXKV=1 \
> FLEXKV_CPU_CACHE_GB=32 \
> FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
> CUDA_VISIBLE_DEVICES=0 \
> python -m dynamo.vllm \
>     --model Qwen/Qwen3-0.6B \
>     --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \
>     --gpu-memory-utilization 0.2 \
>     --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &
$ 
$ # Terminal 3: Worker 2
$ DYNAMO_USE_FLEXKV=1 \
> FLEXKV_CPU_CACHE_GB=32 \
> FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
> CUDA_VISIBLE_DEVICES=1 \
> python -m dynamo.vllm \
>     --model Qwen/Qwen3-0.6B \
>     --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \
>     --gpu-memory-utilization 0.2 \
>     --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'

Disaggregated Serving

FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.

$ # Terminal 1: Start frontend
$ python -m dynamo.frontend &
$ 
$ # Terminal 2: Decode worker (without FlexKV)
$ CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' &
$ 
$ # Terminal 3: Prefill worker (with FlexKV)
$ DYN_VLLM_KV_EVENT_PORT=20081 \
> VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
> DYNAMO_USE_FLEXKV=1 \
> FLEXKV_CPU_CACHE_GB=32 \
> CUDA_VISIBLE_DEVICES=1 \
>   python -m dynamo.vllm \
>   --model Qwen/Qwen3-0.6B \
>   --disaggregation-mode prefill \
>   --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'

Configuration

Environment Variables

Variable	Description	Default
`DYNAMO_USE_FLEXKV`	Enable FlexKV integration	`0` (disabled)
`FLEXKV_CPU_CACHE_GB`	CPU memory cache size in GB	Required
`FLEXKV_CONFIG_PATH`	Path to FlexKV YAML config file	Not set
`FLEXKV_SERVER_RECV_PORT`	IPC port for FlexKV server	Auto

CPU-Only Offloading

For simple CPU memory offloading:

$ unset FLEXKV_CONFIG_PATH
$ export FLEXKV_CPU_CACHE_GB=32

CPU + SSD Tiered Offloading

For multi-tier offloading with SSD storage, create a configuration file:

$ cat > ./flexkv_config.yml <<EOF
$ cpu_cache_gb: 32
$ ssd_cache_gb: 1024
$ ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
$ enable_gds: false
$ EOF
$ 
$ export FLEXKV_CONFIG_PATH="./flexkv_config.yml"

Configuration Options

Option	Description
`cpu_cache_gb`	CPU memory cache size in GB
`ssd_cache_gb`	SSD cache size in GB
`ssd_cache_dir`	SSD cache directories (semicolon-separated for multiple SSDs)
`enable_gds`	Enable GPU Direct Storage for SSD I/O

Note: For full configuration options, see the FlexKV Configuration Reference.

Distributed KV Cache Reuse

FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:

Distributed RadixTree: Each node maintains a local snapshot of the global index
Lease Mechanism: Ensures data validity during cross-node transfers
RDMA-based Transfer: Uses Mooncake Transfer Engine for high-performance KV cache transfer

For setup instructions, see the FlexKV Distributed Reuse Guide.

Architecture

FlexKV consists of three core modules:

StorageEngine

Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.

GlobalCacheEngine

The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:

RadixTree for prefix matching
Memory pool to track space usage and trigger eviction

TransferEngine

The data plane that executes data transfers:

Multi-threading for parallel transfers
High-performance I/O (io_uring, GDS)
Asynchronous operations overlapping with computation

Verify Deployment

$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen3-0.6B",
>     "messages": [{"role": "user", "content": "Hello!"}],
>     "stream": false,
>     "max_tokens": 30
>   }'