For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
      • DynoSim Overview
      • Live Simulation with Mocker
      • DynoSim Runs
      • DynoSim Sweeps
      • Planner DynoSim Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • What It Answers
  • How It Works
  • Spec Shape
  • Prerequisites
  • Run The Example
  • Run Against A Trace
  • Customize A Sweep
  • Outputs
  • Relationship To DynoSim Runs
User GuidesDynoSim

DynoSim Sweeps

Search simulated deployment candidates across topology, router, and SLA constraints
||View as Markdown|
Previous

DynoSim Runs

Next

Planner DynoSim Benchmarking

A DynoSim sweep runs many simulated trials across candidate topologies, router settings, and timing-model inputs, then ranks the results against SLA constraints and GPU budget. Use sweeps when a single DynoSim run is not enough and you want to search the design space before validating on real GPUs.

The current Python API is dynamo.profiler.utils.replay_optimize. The docs use “DynoSim sweep” as the product term while keeping the existing implementation name for now.

What It Answers

A sweep answers a concrete deployment question:

  • given a fixed GPU budget
  • for a workload with prefix overlap
  • and latency SLAs that still permit meaningful throughput

which topology, worker split, and router settings produce the best simulated result?

For disaggregated deployments, the search can cover:

  • tensor-parallel shape for prefill and decode workers
  • prefill and decode worker counts
  • KV-router overlap credit
  • prompt-load scaling
  • throughput, TTFT, ITL, or end-to-end latency objectives

This is a heuristic search over simulated states, not an exact optimizer over every feasible configuration.

How It Works

Each candidate state is evaluated by the DynoSim run harness. The optimizer records the metrics from each run, filters candidates that violate SLA or GPU-budget constraints, and returns the best feasible state plus the full evaluated table for analysis.

The descent is budget-focused: each step prunes to near-budget-edge states so the sweep ends up at a TP/worker shape that actually consumes the available GPU budget, rather than at a pure throughput-per-GPU point. Aggregated sweeps collapse the TP and worker dimensions into (tp, workers) but otherwise follow the same idea.

Spec Shape

The public API takes a single ReplayOptimizeSpec composed of:

SpecPurpose
EngineSpecModel, backend, and JSON-like engine arguments. Disaggregated sweeps use prefill and decode engine args; aggregated sweeps use baseEngineArgs.
HardwareSpecGPU SKU and total simulated GPU budget
WorkloadSpecSynthetic workload knobs or a trace file
SLASpecOptional TTFT, ITL, end-to-end latency, and p95 bounds
RouterSpecRouter mode, overlap-score-credit sweep, prefill-load-scale sweep, and KV-router base config
objectiveRanking target, such as throughput, mean TTFT, or mean end-to-end latency
maxParallelEvalsNumber of candidate evaluations to run concurrently

Field names use lowerCamelCase to align with DynamoGraphDeploymentRequest concepts. Method names stay snake_case to match Pydantic convention.

Prerequisites

Run from the repository root.

Use the project virtual environment:

$.venv/bin/python --version

If the Python bindings are not importable yet, build them first:

$.venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml

The example uses AIC-backed timing by default:

  • AIC enumerates dense TP candidates
  • AIC-backed engine timing is used for candidate configs

Install aiconfigurator into the project environment:

$uv pip install --python .venv/bin/python aiconfigurator

If a regular install fails to load usable perf data, reinstall from a source checkout that has real systems data materialized:

$uv pip install --python .venv/bin/python --force-reinstall /path/to/aiconfigurator

If DynoSim sweep setup fails with AIC errors about missing perf databases or parse failures such as KeyError: 'gemm_dtype', inspect the installed files under:

.venv/lib/python*/site-packages/aiconfigurator/systems/data/...

If those files begin with version https://git-lfs.github.com/spec/v1, you have Git LFS pointer stubs instead of real perf tables. Install aiconfigurator from a checkout or wheel that includes the real LFS materialized payloads in systems/.

When running directly from a source checkout, expose the in-repo Python packages:

$export PYTHONPATH=lib/bindings/python/src:components/src

If the sweep uses multiple worker processes, prefer a real script file over a heredoc. On macOS, ProcessPoolExecutor child workers need a stable module path, and the driver module must guard its entry behind if __name__ == "__main__":.

For KV-router logs, this filter keeps the run readable without hiding useful info output:

$export DYN_LOG='info,dynamo_kv_router::scheduling::selector=warn'

Run The Example

The canonical starting point is the checked-in driver script:

$.venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \
> --max-parallel-evals 4

The default example searches a synthetic disaggregated KV-router workload using AIC-backed candidate timing. It prints the best feasible state and a table of top feasible configurations.

The example uses:

  • model: Qwen/Qwen3-32B
  • backend: vllm
  • GPU SKU: h200_sxm
  • total simulated GPUs: 16
  • router mode: kv_router
  • synthetic workload:
    • isl=32768
    • osl=256
    • requestCount=5000
    • concurrency=200
    • sharedPrefixRatio=0.5
    • numPrefixGroups=50

The GPU budget is a simulated search constraint. You do not need 16 real GPUs locally to run the search.

The base engine args stay conservative:

  • block_size=512
  • enable_prefix_caching=True
  • explicit worker_type for prefill versus decode

The example intentionally omits num_gpu_blocks; AIC-backed DynoSim estimates capacity for each candidate TP shape unless a base input explicitly pins it.

This setup does not force scheduler-specific bottlenecks such as:

  • enable_chunked_prefill
  • a small max_num_seqs
  • a pinned max_num_batched_tokens

Only add those when the experiment is specifically about scheduler limits.

Run Against A Trace

To run against a Mooncake-style trace instead of the synthetic workload:

$.venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \
> --trace-file /path/to/mooncake_trace.jsonl \
> --arrival-speedup-ratio 1.0 \
> --max-parallel-evals 4

For a public starting point, download the FAST’25 toolagent trace:

$curl -sL \
> https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl \
> -o /tmp/toolagent_trace.jsonl

Then run:

$.venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \
> --trace-file /tmp/toolagent_trace.jsonl \
> --arrival-speedup-ratio 1.0 \
> --max-parallel-evals 4

In trace mode:

  • traceFile points at the Mooncake-style JSONL input
  • arrivalSpeedupRatio compresses or stretches the trace arrival process
  • synthetic-only knobs such as isl, osl, requestCount, concurrency, sharedPrefixRatio, and numPrefixGroups are ignored

Important notes for the public toolagent trace:

  • the dataset uses Mooncake-style hash_ids with 512 tokens per block
  • the underlying run_trace_replay(...) API defaults trace_block_size to 512
  • WorkloadSpec does not yet expose a separate traceBlockSize field

Customize A Sweep

Treat the example driver as a starting point, not a frozen harness. Modify it as needed for your search:

  • change the WorkloadSpec shape or switch to a trace source with traceFile
  • add SLA bounds on SLASpec, such as ttft, itl, e2eLatency, or their p95 variants
  • change RouterSpec.overlapCredits within the valid 0.0 to 1.0 range
  • change RouterSpec.prefillLoadScales when you want to weigh TTFT/prompt-side load more or less heavily
  • print different columns from result.evaluated_df or result.feasible_df
  • persist the tables to CSV or Parquet for downstream analysis

Useful axes to vary:

  • HardwareSpec.totalGpus
  • RouterSpec.overlapCredits
  • RouterSpec.prefillLoadScales
  • WorkloadSpec.sharedPrefixRatio
  • WorkloadSpec.numPrefixGroups
  • base prefill/decode engine args

If you want to compare routing strategies directly, use RouterSpec(mode="both") instead of the default KV-router-only search.

Outputs

The optimizer returns a DenseReplayOptimizationResult with:

  • best_feasible: best visited state that satisfies all configured SLA and GPU-budget constraints
  • best_infeasible: best visited state that misses at least one SLA bound or the budget
  • evaluated_df: all visited states
  • feasible_df: only feasible states

Common columns to inspect:

  • topology: prefill_tp, decode_tp, prefill_workers, decode_workers
  • routing: router_mode, overlap_score_credit, prefill_load_scale
  • budget: total_gpus_used This is the simulated GPU footprint of the candidate state, not a count of GPUs actually allocated on the machine running the search.
  • throughput: output_throughput_tok_s
  • cache behavior: prefix_cache_reused_ratio
  • latency: mean_ttft_ms, mean_tpot_ms, mean_e2e_latency_ms

The report DataFrame still uses the Rust DynoSim runner’s metric keys (mean_ttft_ms, mean_tpot_ms, mean_e2e_latency_ms) even though the input SLASpec uses DGDR-style camelCase names (ttft, itl, e2eLatency). SLASpec carries an internal translation map.

In local testing, the default synthetic setup produced a non-trivial mean-E2E winner around:

  • prefill_tp=4, decode_tp=1, prefill_workers=3, decode_workers=4, overlap_score_credit=0.5, prefill_load_scale=1.0
  • output_throughput_tok_s ~= 970, prefix_cache_reused_ratio ~= 0.5, mean_ttft_ms ~= 42800, mean_tpot_ms ~= 35, mean_e2e_latency_ms ~= 51900

Treat those as sanity-check ranges, not fixed assertions.

Relationship To DynoSim Runs

A DynoSim run answers “how does this one configuration perform?” A DynoSim sweep answers “which configuration should I try next?”

For final validation, take feasible candidates into a live Mocker deployment or a real-GPU AIPerf benchmark. DynoSim is designed to narrow the search space before cluster validation, not to replace it.