Priority scheduling lets a client mark one request as more important than
another. In Dynamo, the user-facing request field is
nvext.agent_hints.priority.
Higher values mean higher priority at the Dynamo API layer. Clients should send the intended Dynamo value directly and should not invert the value for a specific backend. Dynamo normalizes backend-specific priority conventions before forwarding the request to the engine.
Priority can affect three different layers. They are configured separately.
These layers are additive. For example, a request can jump ahead in the router queue but still use default engine scheduling if the backend priority flag is not enabled.
The router queue only matters when requests are held before dispatch. If a request can be routed immediately, there is no pending queue to reorder and the priority hint will not change TTFT at the router layer.
--router-queue-threshold controls when the router starts holding requests. A
request waits in the router queue while every eligible worker is above the
configured threshold. The queue drains when capacity is available, and
higher-priority requests are selected before lower-priority requests according
to the configured --router-queue-policy.
The default policy is fcfs, which uses the priority value as a positive
arrival-time bump. Higher values move the request earlier in the queue. Negative
priority values are clamped to zero for router queueing, so a request cannot be
pushed behind normal first-come, first-served ordering by sending a negative
priority.
For the flag-level semantics, default value, and backend caveats, see Router Configuration and Tuning.
The backend receives the same Dynamo semantic priority, but each engine has its own native scheduling convention. Dynamo handles that conversion internally.
Do not negate nvext.agent_hints.priority in client code for vLLM. If a test
shows lower user values receiving better TTFT, first check whether the benchmark
harness or endpoint path inverted the value before it reached Dynamo.
Priority is not Kubernetes PriorityClass, GPU preemption, or a hard admission
control policy. It does not reserve capacity for high-priority requests.
Priority also does not show an effect unless there is contention at a layer that uses it:
Use a benchmark that can send different nvext.agent_hints.priority values on
individual requests. For AIPerf, use a version with per-request extra payload
support. Older AIPerf versions may only support global --extra-inputs, which
is not enough for mixed-priority tiers in the same run.
For router-priority validation:
dynamo_frontend_router_queue_pending_requests
and confirm it is greater than zero during the measured window.Expected result: higher Dynamo priority values should receive better TTFT under contention. If lower values win, first check whether the client, benchmark harness, or gateway path negated the priority before it reached Dynamo.