Additional Resources

NVIDIA Request Extensions (nvext)

View as Markdown

nvext is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. nvext fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.

Usage

Include nvext as a top-level field alongside standard OpenAI-compatible fields:

1{
2 "model": "my-model",
3 "messages": [{"role": "user", "content": "Hello"}],
4 "nvext": {
5 "greed_sampling": true,
6 "extra_fields": ["worker_id", "timing"],
7 "agent_hints": {
8 "osl": 1024,
9 "priority": 5
10 }
11 }
12}

Field Reference

FieldTypeDefaultConsumed ByDescription
greed_samplingboolNonePreprocessorForces greedy sampling regardless of other sampling parameters.
use_raw_promptboolNonePreprocessorBypasses the prompt template and passes the prompt directly to the tokenizer.
annotationsstring[]NonePreprocessorTriggers out-of-band information in the SSE stream via the event: field.
backend_instance_idu64NoneRouterRoutes the request to a specific backend instance.
token_datau32[]NonePreprocessorPre-tokenized prompt tokens. When provided with backend_instance_id, tokenization is skipped.
max_thinking_tokensu32NoneBackendMaximum thinking tokens allowed (passed through to backends).
extra_fieldsstring[]NoneResponse builderFields to include in the response nvext. Supported: "worker_id", "timing", "routed_experts".
prefill_worker_idu64NoneRouterRoutes the request to a specific prefill worker (disaggregated serving).
decode_worker_idu64NoneRouterRoutes the request to a specific decode worker (disaggregated serving).
agent_hintsobjectNoneRouterPer-request hints for scheduling and load balancing. See Agent Hints.
session_controlobjectNoneRouterSession lifecycle and sticky routing for subagent KV isolation. See Session Control.

Header Overrides

Routing fields can also be set via HTTP headers, which take priority over nvext values:

HeaderOverrides
x-worker-instance-idbackend_instance_id and decode_worker_id
x-prefill-instance-idprefill_worker_id

Agent Hints

The agent_hints sub-object carries per-request hints that the router uses for scheduling, load balancing, and KV cache optimization.

FieldTypeDefaultDescription
priorityi32NoneUnified request priority. Higher values mean higher priority at the Dynamo API level. Used for router queue ordering and backend scheduling/eviction.
oslu32NoneExpected output sequence length (tokens). Used for output block tracking and resource estimation.
speculative_prefillboolfalseWhen true, speculatively prefills the predicted next-turn prompt after the current turn completes to warm the KV cache.

priority

priority is the single user-facing scheduling hint. Higher values mean “more important” across Dynamo.

When --router-queue-threshold is set and the queue is active, higher-priority requests are shifted earlier in the router queue. Once dispatched, Dynamo forwards the same semantic priority to the backend engine for queue ordering, preemption, and KV cache eviction. Dynamo normalizes backend-specific polarity internally, including vLLM’s lower-is-higher convention.

1{
2 "nvext": {
3 "agent_hints": {
4 "priority": 5
5 }
6 }
7}

osl

Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:

  1. Output block tracking: When --router-track-output-blocks is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward osl.
  2. Resource estimation: Helps the router estimate total resource requirements when making routing decisions.
1{
2 "nvext": {
3 "agent_hints": {
4 "osl": 1024
5 }
6 }
7}

speculative_prefill

When set to true, the system speculatively prefills the predicted next-turn prompt after the current assistant turn completes. This is designed for multi-turn agentic workloads where the next request’s prefix is predictable.

How it works:

  1. As the assistant response streams, the system accumulates the full response text.
  2. Once the response finishes, a background task constructs the next-turn prompt by appending the assistant response to the conversation history (with thinking content stripped for non-last turns).
  3. The constructed prompt is tokenized and sent as a max_tokens=1 request to warm the KV cache on a worker.
  4. When the actual next request arrives, it benefits from the already-warm KV cache, reducing TTFT.
1{
2 "nvext": {
3 "agent_hints": {
4 "speculative_prefill": true
5 }
6 }
7}

Backend details:

  • SGLang: Requires --enable-priority-scheduling for queue ordering and --radix-eviction-policy priority for priority-based eviction.
  • vLLM: Requires --scheduling-policy priority.
  • TensorRT-LLM: Does not currently support per-request priority.
1{
2 "nvext": {
3 "agent_hints": {
4 "priority": 5
5 }
6 }
7}

Session Control

session_control enables subagent KV isolation with sticky routing. The router uses session_id to keep a session on the same worker and can issue open / close lifecycle RPCs around streaming sessions.

FieldTypeDefaultDescription
session_control.session_idstringUnique session identifier. Present on every turn.
session_control.actionstringomittedOptional lifecycle action: "open" or "close".
session_control.timeoutinteger300Inactivity timeout in seconds. Only used with action: "open".
1{
2 "nvext": {
3 "session_control": {
4 "session_id": "subagent-1",
5 "action": "open",
6 "timeout": 300
7 }
8 }
9}

Requires --router-mode=kv on the frontend. Session control activates automatically when requests carry nvext.session_control. See SGLang for Agentic Workloads for backend setup details.

Response Extensions

When the client requests response metadata via extra_fields, the response includes an nvext object with the requested fields:

FieldRequested ViaDescription
worker_idextra_fields: ["worker_id"]Prefill/decode worker IDs and data parallel ranks that processed the request.
timingextra_fields: ["timing"]Per-request timing information (TTFT, ITL, queue time, etc.).
routed_expertsextra_fields: ["routed_experts"]Routed expert capture payload returned by SGLang-backed requests.
token_idsAutomatic (GAIE Stage 1)Tokenized prompt for reuse in Stage 2 query-only mode.

Example response nvext

1{
2 "nvext": {
3 "worker_id": {
4 "prefill_worker_id": 1,
5 "prefill_dp_rank": 0,
6 "decode_worker_id": 2,
7 "decode_dp_rank": 0
8 },
9 "timing": {
10 "ttft_ms": 45.2,
11 "itl_ms": 12.1
12 }
13 }
14}

See Also

DocumentDescription
Frontend GuideKServe gRPC configuration and integration
Configuration and TuningFull router configuration and CLI arguments
SGLang for Agentic WorkloadsSGLang engine flags for priority scheduling, eviction policies, and session control