For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
      • Agent Tracing
      • Agent Hints
      • Use Pi-Mono with Dynamo
      • ThunderAgent Program Scheduler
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • The Problem
  • The Scheduler
  • Scheduler Tick
  • Tool-Boundary Pause/Resume Semantics
  • Program Close (trajectory_final)
  • Utilization-Driven Control Loop
  • Architecture
  • Observability
  • Reproducing the MiniMax-M2 Results
  • References
Feature GuidesAgents

ThunderAgent Program Scheduler

Program-level scheduling with tool-boundary pause/resume on top of KV-aware routing

||View as Markdown|
Previous

Use Pi-Mono with Dynamo

Next

LoRA Adapters

Experimental — not a released component. Run it from a source checkout, not from a pip install ai-dynamo. The CLI flags, the nvext.agent_context schema, and the lifecycle hooks are all unstable and will change. Build and launch specifics live next to the code in components/src/dynamo/thunderagent_router/README.md.

dynamo.thunderagent_router is a standalone Dynamo router that schedules at the granularity of an agent run — the whole LLM turn → tool call → next turn loop — instead of individual requests. It wraps Dynamo’s native KV router and adds a program-level scheduler with tool-boundary pause/resume on top of KV-aware routing, porting the scheduler from the ThunderAgent paper (Kang et al., 2026).

The Problem

Agentic workloads (SWE-bench, browser-use, anything with a tool loop) make many short LLM calls separated by non-GPU work: docker exec, pytest, curl, waiting on a subagent. Between turns the agent’s KV cache stays resident, holding blocks while doing nothing. A request-level router (vLLM’s, SGLang’s, Dynamo’s stock KvRouter) sees each turn but not the agent behind it, which costs you two ways:

  • Cache-occupancy blowup. With N agents at step K, the working set is N × step_K_context, most of it idle between turns. The engine evicts useful blocks under pressure or refuses admission, and every next turn pays a re-prefill tax.
  • No tool-boundary backpressure. The router can’t defer a hot trajectory at a natural pause point — it can only cancel in-flight requests or queue them, both worse than waiting until the agent is between turns.

The Scheduler

The algorithm groups requests by program_id (the trajectory_id from nvext.agent_context) and runs an outer scheduler that moves each program through (REASONING | ACTING) × (ACTIVE | PAUSED). A program enters ACTING at a tool boundary. Under memory pressure the scheduler pauses ACTING programs — logically, with no decode preemption — so the engine is free to evict their KV. When utilization drops it resumes the smallest-token programs first, BFD-packing them back under threshold. The payoff is working-set accounting that counts programs rather than requests, plus pause/resume aimed at tool boundaries rather than arbitrary tokens.

This is an in-path Dynamo service that owns a KvRouter directly and registers as a model handler, so there is no extra proxy hop, and it reads real prompt_tokens + completion_tokens off each response rather than estimating token counts from raw bytes.

Scheduler Tick

A single background task runs every --scheduler-interval-seconds (default 5.0). Each tick takes a capacity snapshot and runs three phases in a fixed order:

_apply_soft_demotes → _greedy_resume → _pause_until_safe

Resume runs before pause on purpose (upstream ThunderAgent ordering): a program paused this tick cannot resume until the next tick, which prevents a program from being paused and immediately resumed within one tick.

Tool-Boundary Pause/Resume Semantics

  • Pause is logical. The scheduler picks the smallest ACTING programs on an over-threshold worker first and pauses them; if no ACTING candidate exists it marks the smallest REASONING program for pause at its next tool boundary. There is no decode preemption — a paused program’s in-flight turn is allowed to finish, and the program is held out of admission until a later tick resumes it.
  • Resume is greedy and BFD-packed. When a worker has headroom (see the control loop below), the scheduler resumes the smallest-token paused programs first, fitting each back under threshold and accounting for buffer_per_program. Resumed requests get a transient priority boost so they re-enter ahead of fresh admissions, and a forced-resume cap (--resume-timeout-seconds) guarantees no program is starved indefinitely.

Program Close (trajectory_final)

A program is created on its first turn (keyed by trajectory_id) and otherwise lives ACTIVE↔PAUSED for the process lifetime — a single LLM finish_reason cannot mark a program done, since the agent typically keeps issuing turns. To release a program deterministically, the harness sets the optional terminal marker nvext.agent_context.trajectory_final: true on a final request (e.g. on the agent’s agent_end). When the router sees it, it deletes the program from its table (and paused set) and short-circuits the request — it is never forwarded to a worker (the response is an empty completion). This frees the program’s scheduling bookkeeping so its tokens stop counting against worker utilization.

1// final request — released, no inference
2{ "model": "...", "max_tokens": 1, "messages": [{"role":"user","content":"."}],
3 "nvext": { "agent_context": { "session_type_id": "...", "session_id": "...",
4 "trajectory_id": "abc", "trajectory_final": true } } }

The close is best-effort from the harness side; if it never arrives (crash, black-box harness) the program lingers in the table, but its token weight decays toward zero so it stops counting against worker utilization — the scheduling impact self-heals even without an explicit close. pi-dynamo-provider fires the close automatically on agent_end / session_shutdown (a dedicated max_tokens: 1 request — a reactive agent loop only learns a turn was terminal from its response, so a run’s end is typically known only after its last real turn already returned, leaving no live turn to flag).

Utilization-Driven Control Loop

Pause/resume is driven by per-worker utilization — the program working set as a fraction of the worker’s KV pool. The loop has three bands:

  • At or above pause-threshold, the worker is over-subscribed; the tick pauses ACTING programs until utilization falls back to pause-target.
  • In the [soft-demote-threshold, pause-threshold) band, programs are soft-demoted (a negative priority jump) but not paused — early backpressure before a hard pause is needed.
  • Resume only fires once utilization has dropped at least resume-hysteresis below pause-threshold, so the loop does not oscillate between pause and resume on the threshold boundary.
FlagEnv varDefaultDescription
--pause-thresholdDYN_THUNDERAGENT_PAUSE_THRESHOLD0.95Working-set fraction of the KV pool that fires a pause cycle.
--soft-demote-thresholdDYN_THUNDERAGENT_SOFT_DEMOTE_THRESHOLD0.80Soft-demote band start (negative priority jump in [soft, pause)).
--pause-targetDYN_THUNDERAGENT_PAUSE_TARGET0.80Setpoint that pause cycles drive utilization back down to. Must be <= pause-threshold.
--resume-hysteresisDYN_THUNDERAGENT_RESUME_HYSTERESIS0.10Headroom below pause-threshold required before any resume.
--resume-priority-boostDYN_THUNDERAGENT_RESUME_PRIORITY_BOOST1.0Priority seconds added to a request that just resumed.
--resume-timeout-secondsDYN_THUNDERAGENT_RESUME_TIMEOUT_SECONDS1800.0Forced-resume cap. Mirrors ThunderAgent’s _wait_for_resume.
--scheduler-interval-secondsDYN_THUNDERAGENT_SCHEDULER_INTERVAL_SECONDS5.0Scheduler tick period.
--soft-demote-priority-jumpDYN_THUNDERAGENT_SOFT_DEMOTE_PRIORITY_JUMP-2.0Priority seconds applied to soft-demoted programs.
--acting-token-weightDYN_THUNDERAGENT_ACTING_TOKEN_WEIGHT1.0Multiplier on token_total for ACTING programs in the pause-side working set.
--acting-decay-tau-secondsDYN_THUNDERAGENT_ACTING_DECAY_TAU_SECONDS1.0Tau for exponential decay of ACTING tokens in the resume-side working set.

Constraint: pause-target <= pause-threshold. The service rejects configs that violate it (along with 0 <= resume-hysteresis <= pause-threshold and 0 <= soft-demote-threshold <= pause-threshold).

All KvRouter flags from dynamo.router (--router-temperature, --use-kv-events, --router-track-output-blocks, …) are also accepted and forwarded. See the folder README for the remaining service flags (--endpoint, --model-name, --model-path, tool-call and reasoning parsers).

Architecture

┌─────────────────────────────────────────────────────────────┐
│ dynamo.frontend (HTTP + auth + tracing sink) │
└────────────────────┬────────────────────────────────────────┘
│ chat completions, with nvext.agent_context
▼
┌─────────────────────────────────────────────────────────────┐
│ dynamo.thunderagent_router (this service) │
│ - ProgramTable: trajectory_id → ProgramState │
│ - admission gate: before_request → was_paused? │
│ - scheduler loop (every scheduler_interval_seconds): │
│ _apply_soft_demotes → _greedy_resume → _pause_until_safe│
│ - sticky worker pin from program.assigned_worker_id │
│ - after_request: real-token accounting │
└────────────────────┬────────────────────────────────────────┘
│ KvRouter.generate
▼
┌─────────────────────────────────────────────────────────────┐
│ KvRouter (in-process; subscribes to KV events + FPM) │
└────────────────────┬────────────────────────────────────────┘
│ per-worker dispatch
▼
┌─────────────────────────────────────────────────────────────┐
│ dynamo.vllm (N workers; FPM publisher, KV events publisher)│
└─────────────────────────────────────────────────────────────┘

Observability

The scheduler emits a per-tick INFO summary on each side of the control loop, so both pause and resume activity are visible at INFO without enabling DEBUG. Per-program detail stays at DEBUG.

Pause side — logged when a worker pauses or marks any program in a tick:

scheduler.tick worker=<id> paused=<N> marked=<M> util=<X> -> <Y>

paused is the number of ACTING programs paused this tick, marked is the number of REASONING programs marked for pause at their next tool boundary, and util=X -> Y is the worker utilization before and after the pause cycle.

Resume side — logged when a worker resumes any program in a tick:

scheduler.tick resumed=<N> still_paused=<M>

resumed is the number of programs resumed this tick and still_paused is the size of the paused table afterward. This line is symmetric to the pause-side summary; before it existed, pause was observable at INFO but resume was only visible at DEBUG, leaving a gap when reconstructing a control-loop cycle from INFO logs alone.

Per-program detail (DEBUG):

Paused program <program_id> (tokens=<n>)
Resumed program <program_id> -> worker=<id> (tokens=<n>)

Enable these by lowering the log level for dynamo.thunderagent_router. They give the exact program identities behind each INFO summary count.

For per-request tracing (token counts, cache hits, worker placement, tool-event timelines), the router also integrates with Agent Tracing: set DYN_AGENT_TRACE=1 on the frontend to land a request_end record per LLM call plus the harness tool-event timeline.

Reproducing the MiniMax-M2 Results

The headline numbers (program-aware scheduling vs KV-routing-only on the same hardware, ~12-16% throughput improvement on SWE-bench-Lite with two TP4 MiniMax-M2 replicas on a single 8×H100 node) and the exact launch/repro commands live in the folder README.

References

  • ThunderAgent paper: arxiv.org/abs/2602.13692
  • Upstream ThunderAgent reference: HaoKang-Timmy/ThunderAgent
  • Repro fork (mini-swe-agent + agent_context injector): ishandhanani/ThunderAgent
  • Dynamo KV router: Router Guide
  • nvext.agent_context schema: nvext reference
  • Agent Tracing and Agent Hints