A DynoSim sweep runs many simulated trials across candidate topologies, router settings, and timing-model inputs, then ranks the results against SLA constraints and GPU budget. Use sweeps when a single DynoSim run is not enough and you want to search the design space before validating on real GPUs.
The current Python API is dynamo.profiler.utils.replay_optimize. The docs use “DynoSim sweep” as the product term while keeping the existing implementation name for now.
A sweep answers a concrete deployment question:
which topology, worker split, and router settings produce the best simulated result?
For disaggregated deployments, the search can cover:
This is a heuristic search over simulated states, not an exact optimizer over every feasible configuration.
Each candidate state is evaluated by the DynoSim run harness. The optimizer records the metrics from each run, filters candidates that violate SLA or GPU-budget constraints, and returns the best feasible state plus the full evaluated table for analysis.
The descent is budget-focused: each step prunes to near-budget-edge states so the sweep ends up at a TP/worker shape that actually consumes the available GPU budget, rather than at a pure throughput-per-GPU point. Aggregated sweeps collapse the TP and worker dimensions into (tp, workers) but otherwise follow the same idea.
The public API takes a single ReplayOptimizeSpec composed of:
Field names use lowerCamelCase to align with DynamoGraphDeploymentRequest concepts. Method names stay snake_case to match Pydantic convention.
Run from the repository root.
Use the project virtual environment:
If the Python bindings are not importable yet, build them first:
The example uses AIC-backed timing by default:
Install aiconfigurator into the project environment:
If a regular install fails to load usable perf data, reinstall from a source checkout that has real systems data materialized:
If DynoSim sweep setup fails with AIC errors about missing perf databases or parse failures such as KeyError: 'gemm_dtype', inspect the installed files under:
If those files begin with version https://git-lfs.github.com/spec/v1, you have Git LFS pointer stubs instead of real perf tables. Install aiconfigurator from a checkout or wheel that includes the real LFS materialized payloads in systems/.
When running directly from a source checkout, expose the in-repo Python packages:
If the sweep uses multiple worker processes, prefer a real script file over a heredoc. On macOS, ProcessPoolExecutor child workers need a stable module path, and the driver module must guard its entry behind if __name__ == "__main__":.
For KV-router logs, this filter keeps the run readable without hiding useful info output:
The canonical starting point is the checked-in driver script:
The default example searches a synthetic disaggregated KV-router workload using AIC-backed candidate timing. It prints the best feasible state and a table of top feasible configurations.
The example uses:
Qwen/Qwen3-32Bvllmh200_sxm16kv_routerisl=32768osl=256requestCount=5000concurrency=200sharedPrefixRatio=0.5numPrefixGroups=50The GPU budget is a simulated search constraint. You do not need 16 real GPUs locally to run the search.
The base engine args stay conservative:
block_size=512enable_prefix_caching=Trueworker_type for prefill versus decodeThe example intentionally omits num_gpu_blocks; AIC-backed DynoSim estimates capacity for each candidate TP shape unless a base input explicitly pins it.
This setup does not force scheduler-specific bottlenecks such as:
enable_chunked_prefillmax_num_seqsmax_num_batched_tokensOnly add those when the experiment is specifically about scheduler limits.
To run against a Mooncake-style trace instead of the synthetic workload:
For a public starting point, download the FAST’25 toolagent trace:
Then run:
In trace mode:
traceFile points at the Mooncake-style JSONL inputarrivalSpeedupRatio compresses or stretches the trace arrival processisl, osl, requestCount, concurrency, sharedPrefixRatio, and numPrefixGroups are ignoredImportant notes for the public toolagent trace:
hash_ids with 512 tokens per blockrun_trace_replay(...) API defaults trace_block_size to 512WorkloadSpec does not yet expose a separate traceBlockSize fieldTreat the example driver as a starting point, not a frozen harness. Modify it as needed for your search:
WorkloadSpec shape or switch to a trace source with traceFileSLASpec, such as ttft, itl, e2eLatency, or their p95 variantsRouterSpec.overlapCredits within the valid 0.0 to 1.0 rangeRouterSpec.prefillLoadScales when you want to weigh TTFT/prompt-side load more or less heavilyresult.evaluated_df or result.feasible_dfUseful axes to vary:
HardwareSpec.totalGpusRouterSpec.overlapCreditsRouterSpec.prefillLoadScalesWorkloadSpec.sharedPrefixRatioWorkloadSpec.numPrefixGroupsIf you want to compare routing strategies directly, use RouterSpec(mode="both") instead of the default KV-router-only search.
The optimizer returns a DenseReplayOptimizationResult with:
best_feasible: best visited state that satisfies all configured SLA and GPU-budget constraintsbest_infeasible: best visited state that misses at least one SLA bound or the budgetevaluated_df: all visited statesfeasible_df: only feasible statesCommon columns to inspect:
prefill_tp, decode_tp, prefill_workers, decode_workersrouter_mode, overlap_score_credit, prefill_load_scaletotal_gpus_used
This is the simulated GPU footprint of the candidate state, not a count of GPUs actually allocated on the machine running the search.output_throughput_tok_sprefix_cache_reused_ratiomean_ttft_ms, mean_tpot_ms, mean_e2e_latency_msThe report DataFrame still uses the Rust DynoSim runner’s metric keys (mean_ttft_ms, mean_tpot_ms, mean_e2e_latency_ms) even though the input SLASpec uses DGDR-style camelCase names (ttft, itl, e2eLatency). SLASpec carries an internal translation map.
In local testing, the default synthetic setup produced a non-trivial mean-E2E winner around:
prefill_tp=4, decode_tp=1, prefill_workers=3, decode_workers=4, overlap_score_credit=0.5, prefill_load_scale=1.0output_throughput_tok_s ~= 970, prefix_cache_reused_ratio ~= 0.5, mean_ttft_ms ~= 42800, mean_tpot_ms ~= 35, mean_e2e_latency_ms ~= 51900Treat those as sanity-check ranges, not fixed assertions.
A DynoSim run answers “how does this one configuration perform?” A DynoSim sweep answers “which configuration should I try next?”
For final validation, take feasible candidates into a live Mocker deployment or a real-GPU AIPerf benchmark. DynoSim is designed to narrow the search space before cluster validation, not to replace it.