LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling prefill-once, reuse-everywhere semantics. As described in the official documentation, LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
This document describes how LMCache is integrated into Dynamo’s vLLM backend to provide enhanced performance and memory efficiency.
Important Note: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time.
LMCache is enabled using the --kv-transfer-config flag:
LMCache configuration can be customized via environment variables listed here.
For advanced configurations, LMCache supports multiple storage backends:
Use the provided launch script for quick setup:
This will:
In aggregated mode, the system uses:
LMCacheConnectorV1kv_both (handles both reading and writing)Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments.
Use the provided disaggregated launch script (requires at least 2 GPUs):
This will:
NixlConnector only for KV transfer between prefill and decode workersMultiConnector with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.--disaggregation-mode prefillThe system automatically configures KV transfer based on the deployment mode and worker type:
Argument Parsing (args.py):
Engine Setup (main.py):
Chunk Size Tuning: Adjust LMCACHE_CHUNK_SIZE based on your use case:
Memory Allocation: Set LMCACHE_MAX_LOCAL_CPU_SIZE conservatively:
Workload Optimization: LMCache performs best with:
When LMCache is enabled with --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' and DYN_SYSTEM_PORT is set, LMCache metrics are automatically exposed via Dynamo’s /metrics endpoint alongside vLLM and Dynamo metrics.
Requirements to access LMCache metrics:
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' - Enables LMCacheDYN_SYSTEM_PORT=8081 - Enables metrics HTTP endpointPROMETHEUS_MULTIPROC_DIR (optional) - If not set, Dynamo manages it internallyFor detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the LMCache Metrics section in the vLLM Prometheus Metrics Guide.
PrometheusLogger instance already created with different metadataYou may see an error like:
Version note: We reproduced this behavior with vLLM v0.12.0. We have not reproduced it with vLLM v0.11.0, so it may be specific to (or introduced in) v0.12.0.
This is emitted by LMCache when the LMCache connector is initialized more than once in the same process (for example, once for a WORKER role and later for a SCHEDULER role). LMCache uses a process-global singleton for its Prometheus logger, so the second initialization can log this warning if its metadata differs.
LMCACHE_LOG_LEVEL=CRITICAL.Found PROMETHEUS_MULTIPROC_DIR was set by uservLLM v1 uses prometheus_client.multiprocess and stores intermediate metric values in PROMETHEUS_MULTIPROC_DIR.
PROMETHEUS_MULTIPROC_DIR yourself, vLLM warns that the directory must be wiped between runs to avoid stale/incorrect metrics.PROMETHEUS_MULTIPROC_DIR internally to a temporary directory to avoid vLLM cleanup issues. If you still see the warning, confirm you are not exporting PROMETHEUS_MULTIPROC_DIR in your shell or container environment.