Agents
Workload-aware inference with agentic hints for routing, scheduling, and KV cache Management
Gaps with workload-agnostic inference
Agentic LLM inference is dominated by KV-cache storage and I/O rather than computation; without leveraging the predictable structure of agent lifecycles, we leave significant optimizations on the table. Three gaps stand out with current workflows:
-
Reactive vs. proactive: Current runtimes do not use signals from the harness about what will happen next—e.g. that a “Plan” step is done and “Execute” steps are coming—so they cannot prefetch, pin, or schedule proactively.
-
All KV-cache blocks treated equally: Generic eviction (e.g. LRU) does not distinguish high-value, long-lived context (system prompt, tool definitions) from ephemeral context (chain-of-thought, scratchpad).
-
Workload-agnostic scheduling: Agents have predictable structure. Tools and system prompts repeat across turns, shallow vs. deep research have different latency needs, and the orchestrator knows which phase comes next.
Dynamo as an Agentic Runtime
Dynamo exposes agentic hints and uses them at the frontend API, router, and backend scheduling layers. Together, these enable workload-aware inference instead of generic, state-of-the-moment optimization.
Agentic Hints
Agentic hints are per-request metadata that the agent client (e.g. Claude Code, Codex, NeMo Agent Toolkit) sends to Dynamo’s frontend. They are carried in the request body under nvext on chat completions. The frontend parses them and passes them to the KV router and, where applicable, to backends.
- Flow: Harness sets hints in the request → Dynamo frontend parses
nvextinto routing hints → KV router uses them for queue ordering and worker selection → backends use them for priority scheduling and cache eviction.
The request body includes nvext.agent_hints for routing and scheduling metadata that the frontend passes through to the KV router and backend runtime.
Feature matrix
🚧 = Work in progress or experimental.
Using Dynamo from LangChain
Dynamo is now supported directly in LangChain using the NVIDIA AI Endpoints integration. Configure the chat model to use the Dynamo endpoint and pass agent hints directly from the LangChain client.
Features (experimental)
KV cache optimizations
-
Priority-based KV cache eviction: Instead of evicting by LRU alone, the backend can evict low-priority cache entries first when the GPU (and, with HiCache, host) cache is full. The
priorityvalue innvext.agent_hintsis forwarded to the engine; with SGLang, enable--enable-priority-schedulingand--radix-eviction-policy priority. -
Subagent KV isolation (experimental): Session control holds subagent KV in dedicated streaming session slots outside the radix tree. Session KV is invisible to eviction and freed deterministically on close or timeout. The router manages sticky session affinity so subsequent turns always hit the same worker. See SGLang for Agentic Workloads — Session Control.
-
Cache prefetching (future work): Using the predictable agentic lifecycle (e.g. parent-child subagents, known next turn), Dynamo could proactively prefetch or move KV cache to a different worker so that the next request hits warm cache.
Speculative prefill
After a turn finishes, the system can send a speculative max_tokens=1 prefill with the predicted next-turn prefix (conversation history + assistant text, e.g. thinking stripped) to the same worker. When the real next request arrives, it hits a warm KV cache. Per-turn TTFT on turns 2+ can drop significantly (e.g. up to ~3× in multiturn benchmarks). This can be extended so that Dynamo automatically sends tools and system prompt for subagents to a worker in advance, so subagent requests always hit warm cache.
Priority-aware routing
When --router-queue-threshold is set, the router maintains a priority queue. Requests with higher priority are treated as if they arrived earlier, so they are scheduled ahead of bulk or background work. Under load, this keeps median latency low for user-facing agent turns while background work can tolerate higher latency. For a runnable demo and results, see NeMo Agent Toolkit priority demo.