NVIDIA Request Extensions (nvext)
NVIDIA Request Extensions (nvext)
NVIDIA Request Extensions (nvext)
nvext is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. nvext fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.
Include nvext as a top-level field alongside standard OpenAI-compatible fields:
Routing fields can also be set via HTTP headers, which take priority over nvext values:
The agent_hints sub-object carries per-request hints that the router uses for scheduling, load balancing, and KV cache optimization.
latency_sensitivityWhen --router-queue-threshold is set and the queue is active, this value shifts the request’s effective arrival time earlier in the queue, giving it priority over requests with lower (or no) latency_sensitivity. A value of 5.0 means the request is treated as if it arrived 5 seconds earlier than it actually did. A recommended default is 1.2 for latency-sensitive agentic requests. Has no effect when queueing is disabled.
oslExpected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
--router-track-output-blocks is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward osl.speculative_prefillWhen set to true, the system speculatively prefills the predicted next-turn prompt after the current assistant turn completes. This is designed for multi-turn agentic workloads where the next request’s prefix is predictable.
How it works:
max_tokens=1 request to warm the KV cache on a worker.priorityBackend engine scheduling priority forwarded to the engine’s generate call. Influences queue ordering, KV cache eviction under memory pressure, and preemption of running requests.
The semantics of the priority value differ between backends:
--schedule-low-priority-values-first to match vLLM’s convention. Requires --enable-priority-scheduling on the engine.priority: 0 is scheduled before priority: 10. Ties are broken by arrival time. Requires --scheduling-policy priority on the engine.When omitted, SGLang defaults to None (engine default); vLLM defaults to 0. TensorRT-LLM does not currently support per-request priority.
The cache_control object enables explicit KV cache pinning with a TTL. When set, the router fires a pin_prefix call to the backend worker after generation completes, protecting the conversation’s KV cache from eviction for the specified duration.
Requires --enable-cache-control and --router-mode=kv on the frontend. See SGLang for Agentic Workloads for full setup and usage details.
When the client requests response metadata via extra_fields, the response includes an nvext object with the requested fields:
nvext