For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
      • Agent Tracing
      • Agent Hints
      • Priority Scheduling
      • Use Pi-Mono with Dynamo
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Priority Layers
  • Router Queue Priority
  • Backend Engine Priority
  • What Priority Does Not Do
  • Verify Priority Is Working
  • Troubleshooting
  • Version Notes
  • Related Docs
User GuidesAgents

Priority Scheduling

Request priority across the Dynamo router and backend engines
||View as Markdown|
Previous

Agent Hints

Next

Use Pi-Mono with Dynamo

Priority scheduling lets a client mark one request as more important than another. In Dynamo, the user-facing request field is nvext.agent_hints.priority.

Higher values mean higher priority at the Dynamo API layer. Clients should send the intended Dynamo value directly and should not invert the value for a specific backend. Dynamo normalizes backend-specific priority conventions before forwarding the request to the engine.

1{
2 "model": "my-model",
3 "messages": [
4 { "role": "user", "content": "Summarize this incident." }
5 ],
6 "nvext": {
7 "agent_hints": {
8 "priority": 10
9 }
10 }
11}

Priority Layers

Priority can affect three different layers. They are configured separately.

LayerWhat It ControlsRequired ConfigurationDeep Details
Frontend APIThe user-facing request schema and priority polarity.Send nvext.agent_hints.priority on each request that needs a priority hint.nvext.agent_hints.priority
Router queueWhich waiting request is dispatched first when the router queue is non-empty.KV routing plus --router-queue-threshold set to a value that actually causes queueing.--router-queue-threshold, --router-queue-policy
Backend engineWhich admitted request the engine schedules first.Backend-specific priority scheduling flag, such as vLLM --scheduling-policy priority or SGLang --enable-priority-scheduling.vLLM priority scheduling, SGLang priority scheduling
KV cache policyWhich cached blocks are retained or evicted first under memory pressure.Backend-specific cache priority configuration, such as SGLang --radix-eviction-policy priority.SGLang priority-based KV cache eviction

These layers are additive. For example, a request can jump ahead in the router queue but still use default engine scheduling if the backend priority flag is not enabled.

Router Queue Priority

The router queue only matters when requests are held before dispatch. If a request can be routed immediately, there is no pending queue to reorder and the priority hint will not change TTFT at the router layer.

--router-queue-threshold controls when the router starts holding requests. A request waits in the router queue while every eligible worker is above the configured threshold. The queue drains when capacity is available, and higher-priority requests are selected before lower-priority requests according to the configured --router-queue-policy.

The default policy is fcfs, which uses the priority value as a positive arrival-time bump. Higher values move the request earlier in the queue. Negative priority values are clamped to zero for router queueing, so a request cannot be pushed behind normal first-come, first-served ordering by sending a negative priority.

For the flag-level semantics, default value, and backend caveats, see Router Configuration and Tuning.

Backend Engine Priority

The backend receives the same Dynamo semantic priority, but each engine has its own native scheduling convention. Dynamo handles that conversion internally.

BackendEngine Scheduling RequirementDynamo Behavior
vLLMStart vLLM with --scheduling-policy priority.Dynamo forwards the user priority with the polarity vLLM expects.
SGLangStart SGLang with --enable-priority-scheduling.Dynamo forwards higher Dynamo values as higher SGLang scheduling priority and rejects the inverted SGLang flag.
TensorRT-LLMPer-request engine scheduling priority is not currently exposed through Dynamo.Priority can still affect router queueing before dispatch.

Do not negate nvext.agent_hints.priority in client code for vLLM. If a test shows lower user values receiving better TTFT, first check whether the benchmark harness or endpoint path inverted the value before it reached Dynamo.

What Priority Does Not Do

Priority is not Kubernetes PriorityClass, GPU preemption, or a hard admission control policy. It does not reserve capacity for high-priority requests.

Priority also does not show an effect unless there is contention at a layer that uses it:

  • Router priority needs a non-empty router queue.
  • Engine priority needs backend priority scheduling enabled and engine-side queueing or preemption opportunities.
  • Cache priority needs memory pressure and a priority-aware eviction policy.

Verify Priority Is Working

Use a benchmark that can send different nvext.agent_hints.priority values on individual requests. For AIPerf, use a version with per-request extra payload support. Older AIPerf versions may only support global --extra-inputs, which is not enough for mixed-priority tiers in the same run.

For router-priority validation:

  • Use a fixed request count or burst-style test so every priority tier gets the same number of measured requests.
  • Keep the model, input length, output length, streaming mode, and endpoint path identical across priority tiers.
  • Run at enough load for requests to wait in the router queue. Watch dynamo_frontend_router_queue_pending_requests and confirm it is greater than zero during the measured window.
  • Configure the backend priority flag separately if the test is meant to measure engine scheduling, not only router queue ordering.

Expected result: higher Dynamo priority values should receive better TTFT under contention. If lower values win, first check whether the client, benchmark harness, or gateway path negated the priority before it reached Dynamo.

Troubleshooting

SymptomChecks
Priority has no visible effect.Confirm requests actually enter the router queue, and confirm the backend priority flag is enabled if you expect engine-level scheduling.
Lower numeric values appear to win.Do not negate nvext.agent_hints.priority for vLLM. Dynamo normalizes backend polarity internally.
Router queue never becomes non-empty.Lower --router-queue-threshold, increase offered load, or check the SGLang max_num_batched_tokens caveat in Router Configuration and Tuning.
Priority works through the frontend but not through a Kubernetes gateway path.Confirm the gateway path preserves nvext and use Dynamo v1.2.0 or later.
AIPerf cannot assign a different priority per request.Use an AIPerf build with per-request extra payload support.

Version Notes

CapabilityAvailability
Router priority queue and backend priority plumbingDynamo v1.0.0 and later.
Unified Dynamo API semantics where higher nvext.agent_hints.priority means higher priorityDynamo v1.1.0 and later.
EPP / Inference Gateway forwarding fixes for priority hintsDynamo v1.2.0 and later.
AIPerf per-request priority datasetsRequired for mixed-priority benchmark runs; use an AIPerf release with per-request extra payload support.

Related Docs

  • Agent Hints
  • NVIDIA Request Extensions
  • Router Configuration and Tuning
  • Router Queue Metrics
  • vLLM Reference Guide
  • SGLang for Agentic Workloads