SGLang for Agentic Workloads
Priority scheduling, KV cache eviction policies, and cache pinning for multi-turn agentic serving
Priority scheduling, KV cache eviction policies, and cache pinning for multi-turn agentic serving
This guide covers SGLang-specific configuration for agentic serving with Dynamo. It explains which SGLang engine flags to enable, how Dynamo’s agent hints map to SGLang behavior, and how to use experimental cache pinning to protect KV cache for high-value conversations.
Agentic workloads (tool-calling loops, multi-turn reasoning, code generation pipelines) have different performance characteristics than batch inference:
Dynamo’s agent hints give the router per-request metadata. SGLang’s engine flags control how that metadata affects scheduling and eviction on the worker.
Enable priority-based scheduling so the engine respects the priority value from nvext.agent_hints.priority:
When priority scheduling is enabled, the engine uses the priority field from nvext.agent_hints to order requests in its internal queue. Requests with higher effective priority are scheduled before lower-priority ones. Ties are broken by arrival time.
By default, SGLang evicts radix tree nodes using LRU. You can switch to priority-based eviction so that low-priority cache entries are evicted before high-priority ones:
This does not require HiCache. It controls GPU-only radix tree eviction. When the GPU KV cache is full:
lru: Evicts the least recently used leaf nodes first.priority: Evicts lowest-priority leaf nodes first. Nodes with equal priority fall back to LRU ordering.When both --radix-eviction-policy priority and --enable-hierarchical-cache are enabled, priority affects eviction at both tiers:
The practical impact depends on your write policy. With write_through, GPU eviction is just a demotion — the real deletion happens at host eviction, which is where priority ordering matters most.
Dynamo’s nvext.agent_hints fields are consumed by the router and forwarded to SGLang workers. Here is how each hint interacts with the SGLang engine:
Required PRs:
Cache pinning lets you explicitly protect KV cache for high-value conversation prefixes. When a request includes nvext.cache_control, the router fires a pin_prefix call to the SGLang worker after generation completes. Pinned nodes resist eviction for the specified TTL — even under memory pressure, they are retained (demoted to host memory with HiCache rather than deleted).
nvext.cache_control with a TTL in the request.PinState.pin_prefix RPC to the worker that served the request.pin_expiry and acquiring a host_ref_counter hold that prevents eviction.Frontend flag:
SGLang worker: The worker receives PIN requests via its cache_control service mesh endpoint. You must set the SGLANG_HICACHE_MAX_PINNED_RATIO environment variable to a non-zero value — pinning is disabled by default.
HiCache is required (--enable-hierarchical-cache). Without it, the scheduler rejects PIN requests. For best results, use write_through so that pinned nodes demote to host memory instead of being deleted when GPU memory fills:
Include cache_control as a top-level field in nvext:
The response includes prompt_tokens_details.cached_tokens in the usage object when --enable-cache-report is set on the SGLang worker:
A high cached_tokens / prompt_tokens ratio on subsequent turns confirms that the pinned prefix was preserved.
SGLANG_HICACHE_MAX_PINNED_RATIO defaults to 0.0. You must set it to a non-zero value (e.g., 0.1) or all PIN requests will be rejected.--enable-hierarchical-cache is set.SGLANG_HICACHE_MAX_PINNED_RATIO (fraction of host pool capacity). Requests exceeding this budget are rejected.pin_prefix does not set a priority on the radix tree nodes. All pinned nodes have equal eviction priority and fall back to LRU ordering among themselves when host memory fills.nvext field reference including agent hints