FlexKV is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud’s TACO team and NVIDIA in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vllm.
Set the DYNAMO_USE_FLEXKV environment variable and use the --kv-transfer-config flag:
For multi-worker deployments with KV-aware routing to maximize cache reuse:
FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.
For simple CPU memory offloading:
For multi-tier offloading with SSD storage, create a configuration file:
Note: For full configuration options, see the FlexKV Configuration Reference.
FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:
For setup instructions, see the FlexKV Distributed Reuse Guide.
FlexKV consists of three core modules:
Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.
The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
The data plane that executes data transfers: