HiCache
Hierarchical KV caching with tier-aware router integration
Hierarchical KV caching with tier-aware router integration
This guide covers running SGLang’s Hierarchical Cache (HiCache) with Dynamo, and how the Dynamo KV router integrates with HiCache for tier-aware worker selection when workers share an external pool such as Mooncake.
SGLang HiCache extends RadixAttention with a multi-tier KV cache that transparently moves pages between GPU HBM, host memory, and an optional external storage backend (e.g. Mooncake). For a full description of HiCache itself — flag reference, storage backends, memory layouts, prefetch policies — see SGLang’s own documentation:
What Dynamo adds on top of HiCache:
If you are running a single worker with HiCache and no shared pool, no Dynamo-side configuration is required — the worker reports KV events to the router as usual.
Launch a worker with HiCache enabled:
Then start the frontend:
The HiCache flags (--enable-hierarchical-cache, --hicache-ratio, --hicache-write-policy, --hicache-storage-backend, --hicache-mem-layout, etc.) are SGLang-native — Dynamo passes them through unchanged. See SGLang’s best-practices doc for the complete flag reference and tuning guidance.
When you scale out to multiple SGLang workers that share an external pool such as Mooncake, the Dynamo router can be made tier-aware. It tracks per-tier residency from worker events and consults the shared pool directly so that blocks cached anywhere in the cluster — not just on the candidate worker’s GPU — contribute to worker scoring.
By default the router’s radix tree only reflects blocks resident in GPU HBM on each worker. HiCache silently demotes blocks to host memory and further to Mooncake as the device pool fills, but the router never sees those transitions. A worker that has the full request prefix on host + Mooncake looks identical to a cold worker. The router ends up treating “fetchable from Mooncake in milliseconds” the same as “must be recomputed from scratch.”
SGLang’s HiRadixCache emits BlockStored / BlockRemoved events carrying a medium field on every tier transition:
A few properties the router relies on:
store(new_tier) is emitted before remove(old_tier) so the block is never invisible to the router during a transition.store(CPU) for a GPU→Host copy is deferred until finish_event.synchronize() confirms the DMA landed — events never fire before bytes are resident.On every request the router runs two lookups in parallel:
If the shared-pool query fails, the router falls back to indexer-only scoring and logs a warning. The request still succeeds.
For each candidate worker, the router computes a logit (lower wins):
hits_beyond(n) counts shared-cache pages at positions >= n — “pages past my device prefix that I can still fetch from Mooncake instead of recomputing.”
Worked example. Request is 4 blocks, shared_cache_multiplier = 0.5, block_size = 1, overlap_weight = 1.0. Shared pool contains blocks 0–3.
W1 wins despite zero local overlap, because the shared pool covers its whole prefix. The multiplier encodes the cost ratio of a Mooncake fetch relative to a fresh GPU compute — 0.5 means “fetching from shared is half as expensive as recomputing.”
Tier-aware shared cache routing requires SGLang changes from sgl-project/sglang#22894 (“fix(hicache): emit KV events for L2 host cache insertions”). This PR is not yet merged to SGLang main. Until it lands and a SGLang release includes it, the feature is not accessible from a stock pip install sglang — you must build SGLang from the PR branch (gh pr checkout 22894 && pip install -e python/ from the SGLang repo). This section will be updated with the minimum required version once #22894 ships in a release.
Without PR #22894, worker events carry only medium=GPU and the router is blind to Host-tier residency — regardless of Mooncake configuration.
You also need:
--shared-cache-type hicache (see Configuration).--hicache-storage-backend mooncake.SGLang worker — HiCache with Mooncake storage:
Launch additional workers on other GPUs / hosts with the same Mooncake config so they back to the same cluster.
Dynamo frontend — enable tier-aware routing:
Per-request overrides are available via RouterConfigOverride.shared_cache_multiplier for A/B experimentation without restarting the router.
No extra flags are required on the worker. When --hicache-storage-backend mooncake is set, Dynamo publishes the required metadata (page size, TP/PP layout, master address) via the worker’s ModelRuntimeConfig.engine_specific blob under the key sglang_hicache_mooncake.
Events carry a medium. Run the worker with --log-level debug and grep the log:
If medium is missing or always reads GPU, the worker is running an SGLang build without PR #22894.
Router sees the shared pool. Two new histograms are exposed on the frontend’s Prometheus endpoint: