Agent Hints

Per-request serving hints for agentic workloads

View as Markdown

Agent hints are optional per-request metadata that a harness sends under nvext.agent_hints. Dynamo parses these hints in the frontend and passes them to the router and, where supported, backend runtimes.

Use hints only for serving-relevant intent. Use Agent Context for passive trace identity.

Request Schema

1{
2 "model": "my-model",
3 "messages": [
4 { "role": "user", "content": "Continue the report." }
5 ],
6 "nvext": {
7 "agent_hints": {
8 "priority": 5,
9 "osl": 1024,
10 "speculative_prefill": true
11 }
12 }
13}
HintDescription
priorityUnified request priority. Higher values move the request earlier in the router queue and are forwarded to backends that support priority scheduling or eviction.
oslExpected output sequence length in tokens. Used by the router for output block tracking and load-balancing accuracy when --router-track-output-blocks is enabled.
speculative_prefillWhen true, Dynamo can prefill the predicted next-turn prefix after the current turn completes to warm the KV cache for the next request.

Request Flow

The frontend parses nvext.agent_hints, the router uses hints for queueing and worker selection, and supported backends use forwarded hints for engine-level scheduling and cache policy.

Backend Support

Backend support is runtime-specific. For SGLang flags and behavior, see SGLang for Agentic Workloads.

FeaturevLLMSGLangTensorRT-LLM
Priority-aware routingYesYesYes
Priority-based cache evictionPlannedYesPlanned
Speculative prefillYesYesYes
Subagent KV isolation with session controlNoExperimentalNo

agent_hints is separate from agent_context:

  • agent_context is passive identity for traces and joins.
  • agent_hints is active serving intent for routing, scheduling, and cache behavior.

Session-control metadata for SGLang subagent KV isolation lives under nvext.session_control; see NVIDIA Request Extensions.