Sweep + Orchestrator Developer Reference
Sweep + Orchestrator Developer Reference
Sweep + Orchestrator Developer Reference
Developer reference for AIPerf’s sweep + adaptive-search machinery. Three zoom levels below: mental model -> seven-stage tour -> class/module map.
Every AIPerf run — single benchmark, parameter grid, or Bayesian search — is the same pipeline with different cardinalities:
One pipeline, every scale — by design. A single benchmark, a local multi-run for confidence intervals, a grid or scenarios sweep, a Sobol or Latin-Hypercube characterization, a local Bayesian-Optimization adaptive search, and a cluster-side BO running across hundreds of pods are not seven different code paths. They are seven cardinalities of one pipeline:
BenchmarkPlandescribes what to run,MultiRunOrchestratordecides when and in what order, aSearchPlanner(optional) decides what to try next, and aRunExecutordecides how to actually run one cell. Each piece owns exactly one concern and knows nothing about the others.That separation is what makes the system extensible without churn. Want a new sweep shape? Add a discriminated-union variant to
SweepConfig—expand_sweepdoes the rest, and every executor, exporter, and analyzer picks it up for free. Want a new planner (a different acquisition function, a 1D SLA-saturation algorithm, a multi-fidelity scheme)? ImplementSearchPlanner.ask/telland register it under thesearch_plannerplugin category — the orchestrator and executors don’t change. Want to run the whole thing on Kubernetes? ImplementRunExecutor.executeto create anAIPerfJobCR and HTTP-pull its results instead of forking a subprocess (this is the coming-soonK8sChildJobExecutor) — the plan, orchestrator, planner, analyzer, and exporters are reused byte-for-byte. The progression from a single-shotaiperf profileto a cluster-distributed BO search isn’t a rewrite; it’s the same machinery at a different cardinality with a different executor at the bottom.
Today only LocalSubprocessExecutor ships: aiperf profile -f config.yaml runs the orchestrator in the same Python process and forks aiperf.orchestrator.subprocess_runner per cell.
Cluster execution (coming soon). K8sChildJobExecutor lives on the K8s integration branch (not main yet). It runs in-cluster in a sweep-controller pod (from an AIPerfSweep CR). Each cell becomes an AIPerfJob CR, watched to completion; the operator results server supplies the child export (same shape as local). Orchestrator logic is unchanged—only RunExecutor differs. CLI: aiperf kube sweep (alongside aiperf kube profile).
The whole flow uses about a dozen types. If you know these, you can read any sweep code.
The orchestrator forks a subprocess per cell at stage 6; aggregation is pure post-hoc compute over the collected RunResults. YAML configs reach AIPerfConfig directly through load_config → AIPerfConfig.model_validate; only CLI flags travel through CLIConfig first so cyclopts can parse magic-list affordances (--concurrency 1,2,4) before they’re lifted into a typed SweepConfig.
A “cell” is one (variation, trial) slot. Inside a cell, an ExecutionStrategy decides whether to keep going. FixedTrialsStrategy stops after M trials. AdaptiveStrategy (selected automatically when multi_run.convergence is set) keeps going until a ConvergenceCriterion is satisfied, capped by multi_run.num_runs. Around each executor.execute(run), the orchestrator threads cancel-checking, sweep-wide failure thresholds, and inter-run cooldowns. Two distinct cooldown fields are in play: multi_run.cooldown_seconds (between trials within a cell, returned by strategy.get_cooldown_seconds()) and sweep.cooldown_seconds (between variations, applied in the outer loop).
The strategy is fresh per cell in INDEPENDENT mode, so adaptive trial-convergence resets between variations. In REPEATED mode there’s only one trial per cell — the “outer trial loop” replays the whole grid.
Two ways to traverse the same N variations × M trials grid. sweep.iteration_order picks; default is REPEATED. The numbers below are the order in which cells execute (example: 3 variations, 3 trials).
REPEATED interleaves trials across variations so transient effects (warm caches, thermal drift) hit every variation similarly — better for cross-variation comparison. INDEPENDENT runs one variation to completion before moving on — required for convergence-based adaptive trials, since a strategy needs to observe all of one cell’s results in sequence. Cooldowns and per-cell strategy reuse follow from the nesting; see MultiRunOrchestrator.
The artifact tree branches on three flags: whether a sweep is configured
(is_sweep), whether multiple trials run per cell (trials > 1), and
the sweep iteration order (REPEATED vs INDEPENDENT). Implemented in
_resolve_artifact_dir in src/aiperf/orchestrator/orchestrator.py.
<dir_name> is the {leaf_param_name}_{value} form (e.g.
concurrency_10, request_rate_5.0); multi-dim sweep cells join
components with __ (e.g. concurrency_10__isl_512). Inner-dir
naming is asymmetric on purpose — the no-sweep multi-run case uses
run_NNNN, the sweep + INDEPENDENT case uses trial_NNNN. Downstream
consumers (plotters, dashboards) account for this asymmetry.
The sweep-level aggregate path follows a parallel rule:
<base>/aggregate/sweep_aggregate/<base>/sweep_aggregate/Per-variation aggregates land at <base>/aggregate/<dir_name>/ in
REPEATED mode and <base>/<dir_name>/aggregate/ otherwise (INDEPENDENT
is the explicit default fallback in _per_variation_aggregate_dir; any
non-REPEATED mode takes the else branch).
Adaptive search is the same pipeline with one swap: instead of “expand a fixed grid into N configs up front,” the planner generates configs one at a time, learning from each result.
AdaptiveSearchSweep (type: adaptive_search) instead of GridSweep / ZipSweep / ScenarioSweep.BenchmarkPlan.configs starts with one seed config; the planner extends it as it asks.MultiRunOrchestrator dispatches to execute_adaptive_search, which runs planner.ask() -> execute trials -> planner.tell(results) until planner.ask() returns None (or cancellation / abort).BayesianSearchPlanner (curated Optuna+BoTorch preset; auto-selects qLogNEI / qLogNEHVI based on objective count), MonotonicSLASearchPlanner (1D probe + bisection), SmoothIsotonicSLAPlanner (isotonic regression on bootstrap-resampled trials), OptunaSearchPlanner (TPE / GP / BoTorch samplers, expert-mode flag exposure). The BayesianSearchPlanner is implemented as a thin subclass of OptunaSearchPlanner that locks in the BoTorch sampler and the curated acquisition; it is not a separate engine.search_recipe plugins build the whole AdaptiveSearchSweep from a higher-level recipe (e.g. max-concurrency-under-sla, prefill-ttft-curve, pareto-sweep).post_process handler (degradation_knee_detect, ttft_curve_fit, itl_surface_fit, sla_breach_knee, pareto_sweep_export) runs after the final iteration.Each iteration adds one SearchIteration to planner.history(). Convergence terminates the loop via planner.ask() returning None; the reason (plateau / improvement-patience / max-iterations) comes from planner.convergence_reason(). search_history.json is rewritten after every iteration so a crashed sweep still has a usable trail.
The cardinality of any sweep is N variations × M trials = N×M cells. Where N and M come from depends on the path.
For adaptive search, N is the iteration count: bounded above by max_iterations, possibly less if the planner converges early. M (trials per iteration) still applies — adaptive runs M trials per planner-proposed point, then tell()s the planner the aggregate.
For a fully-indexed file map covering every entry point, see Where to look in the code in Part 3.
A guided tour of the sweep / multi-run / adaptive-search flow, focused on the big picture and the names of the types that move data between stages. Read this when you want to know what happens when you press enter.
Every aiperf profile invocation walks the same seven stages. The shape of each
stage’s input and output is named — those names are the things to remember.
The pipeline doesn’t change shape between a single benchmark, a multi-run, a grid sweep, a scenarios sweep, or a Bayesian search. Only how many cells stage 5 produces and what decides each next cell changes:
Each cell is one BenchmarkRun -> one RunResult. The next section unpacks
the N / M dimensions in detail.
The sweep cardinality has two independent dimensions. Mixing them up is the single most common source of “wait, why did this run that many times?” surprise.
N comes from SweepConfig (the sweep block on AIPerfConfig):
the sweep.parameters cartesian product, the runs[] list, or the planner’s
proposals. Without a sweep block, N = 1.
M comes from MultiRunConfig (the multi-run block on AIPerfConfig):
When M > 1, SweepAnalyzer.compute automatically produces a confidence
block (mean / std / 95% CI) per metric per variation. When M = 1 you get
point estimates only.
Trials are not iterations. For an adaptive search, --search-max-iterations
controls N (how many points the planner gets to try), and --num-profile-runs
controls M (how many times each proposed point is benchmarked before the
planner sees the aggregate). They multiply: a max_iterations=30, num_profile_runs=3
adaptive run executes up to 90 subprocess benchmarks.
Two entry points converge on the same typed envelope AIPerfConfig, but by
different paths. YAML skips CLIConfig entirely: load_config /
load_config_from_string parse the file and call AIPerfConfig.model_validate
directly. CLI flags are parsed by cyclopts into a CLIConfig (the
human-friendly, CLI-shaped surface), then convert_cli_to_aiperf lifts magic
flags into the typed envelope. From here on, AIPerfConfig is the single
source of truth.
Why the CLI -> envelope hop? CLIConfig is the human-friendly CLI shape — magic-lists like
--concurrency 1,2,4, --prefill-concurrency 1,2,4, or --request-rate 10,20,50 mean
“sweep that field over those values.” The converter lifts those affordances into a typed
sweep block on AIPerfConfig. After conversion, every flag has one canonical home in the
envelope. YAML configs don’t need this hop — they’re already written in envelope shape, so
load_config constructs AIPerfConfig directly via model_validate and skips CLIConfig
entirely.
SweepConfig is a discriminated unionPydantic discriminates by a type field on the YAML / dump (each variant sets a
default for type, so YAML authors do not need to write it explicitly). The orchestrator
never inspects the variant directly — it reads BenchmarkPlan.is_adaptive_search,
which is true exactly when the variant is AdaptiveSearchSweep.
BenchmarkPlanAIPerfConfig describes intent; BenchmarkPlan lists the actual cells the
orchestrator will run. The plan-builder either short-circuits to a single seed
variation (for adaptive runs and for no-sweep runs) or calls expand_sweep
(cartesian product for grid, lockstep zip for zip, deep-merge for scenarios —
also Sobol / Latin-hypercube for QMC sweeps), then renders any per-variation
Jinja and emits one BenchmarkConfig per variation.
A few useful invariants:
SweepVariation — {index, label, values}. One per variation. values is
the dict of swept parameters that differ from the base config; the label is built
from those for artifact directory names.trials = M comes from MultiRunConfig.num_runs (default 1, max 10).
It’s the per-cell repeat count for confidence aggregation, not the total run count.configs starts with one seed and grows as the planner
asks. The plan-builder doesn’t know the final length up front.plan.is_adaptive_search is the orchestrator’s only branch on the sweep
variant — every other piece of code is variant-agnostic.MultiRunOrchestrator.execute(plan, executor, search_planner=...) is the single
entry point. It dispatches on plan.is_adaptive_search:
For an N×M grid (N variations, M trials), there are two ways to interleave the
work. Both produce the same N×M cells; they differ only in which loop is outer.
iteration_order is a field on the grid family of sweeps (GridSweep, ZipSweep,
ScenarioSweep); AdaptiveSearchSweep does not expose this knob.
REPEATED is the default. It interleaves so transient effects (warm caches,
thermal drift) hit every variation similarly — better for cross-variation
comparison. INDEPENDENT runs one variation to completion before moving on;
required when each variation needs its own ExecutionStrategy to observe a full
cell’s worth of results before deciding to stop (the adaptive trial-convergence
case).
ask / tell loopWhen plan.is_adaptive_search is true, execute_adaptive_search runs a tighter
loop driven by a SearchPlanner:
A “cell” is one (variation, trial) slot. Every cell runs a small state machine
driven by an ExecutionStrategy:
Three collaborators inside the cell — two ABCs and one Pydantic model:
FixedTrialsStrategy runs exactly M trials. AdaptiveStrategy runs until a
ConvergenceConfig says enough — capped by multi_run.num_runs so it can’t run
forever.
After the orchestrator returns list[RunResult], the CLI runner groups by
RunResult.variation_values, builds a per_combination_stats dict, and hands
it to SweepAnalyzer.compute(per_combination_stats, sweep_parameters, sla_filters=…),
which computes summary stats per group, identifies the Pareto frontier, and
returns the aggregate dict the JSON / CSV exporters write.
The aggregate JSON has three result blocks plus a metadata block:
metadata — num_combinations, swept parameter list, and (when set)
sla_constraints. Downstream consumers key off this block.per_combination_metrics — one entry per unique variation_values, with
swept parameters and a metric block (mean / p99 / etc.) for every metric.best_configurations — fixed post-hoc picks for highest throughput and
lowest latency from the aggregate summary. These are not the adaptive search’s
configured objectives.pareto_optimal — fixed post-hoc throughput/latency frontier computed via
_dominates. Adaptive configured objectives are reported in
search_history.json["best_trials"].Orthogonality note.
best_configurationsandpareto_optimalhere are emitted bySweepAnalyzer, computed across the wholeRunResultset, and live undersweep_aggregate/profile_export_aiperf_sweep.json. They are distinct fromsearch_history.json["best_trials"], which is what the BO planner converged on (see Search History API). For a single-objective adaptive run with no failed iterations the two usually agree on the winner; they can disagree when iterations failed, when feasibility differs (search-history is feasibility-first lex oversla_filters, the analyzer ranks the full set), or when the analyzer’s Pareto computation includes objectives the planner wasn’t optimizing.
If the active sweep came from a search recipe with a PostProcessSpec, that
handler runs after the analyzer and emits its own JSON file (e.g.
degradation_knee.json for concurrency-ramp, pareto_sweep.json for
pareto-sweep).
A search recipe is a named preset that bundles “search space + objective +
termination + SLA filters + optional post-process” into one CLI selector
(--search-recipe <name>). It runs before stage 3 and emits the typed sweep
config the rest of the pipeline expects.
SearchRecipeContext is the recipe’s read-only view of user intent — built
BenchmarkConfig, declared SLA targets (--ttft-sla-ms, etc.), and any
sweep-knob overrides (--concurrency-min, --isl-osl-pairs, etc.).
SearchRecipeOutput carries exactly one of adaptive_search,
sweep_parameters, or scenarios (validated mutually exclusive), plus optional
sla_filters, per-request slos, and a post_process spec.
After expansion, downstream stages don’t know a recipe ever existed — they just
see a normal AIPerfConfig.sweep with optional sla_filters attached.
One diagram from key-press to artifact:
If you remember nothing else from this doc, remember these eleven names — every other class in the sweep code is glue or helper.
For adaptive runs, three more:
End-to-end view of how a YAML config or CLI invocation becomes an AIPerfConfig envelope, expands into a BenchmarkPlan, and is executed by MultiRunOrchestrator against a backend RunExecutor. Multiple zoom levels — pick whichever matches what you’re trying to understand.
The same BenchmarkPlan / MultiRunOrchestrator / RunExecutor machinery handles single-run, grid sweep, zip sweep, scenario sweep, and adaptive search. Dispatch differs only inside MultiRunOrchestrator.execute.
cli_runner.run_benchmark peels off single-run plans (plan.is_single_run) before the orchestrator is constructed; only multi-run plans reach MultiRunOrchestrator.execute. Inside execute(), dispatch is two-way: adaptive-search vs. grid/scenarios. Grid/scenarios further branch on _plan_iteration_order(plan) which reads plan.sweep.iteration_order (REPEATED default, or INDEPENDENT).
(The artifact-tree layout table is documented above in Part 1 — Artifact directory layout reference.)
RunExecutor is a 2-method ABC: execute(run) -> RunResult and derive_id(plan, var_idx, trial) -> str. The local executor derives a stable id from the plan/variation/trial tuple for artifact naming; the cluster executor (coming soon — finalized on the K8s integration branch, not yet on main) derives a deterministic K8s-name-safe id from (plan, var_idx, trial) so child AIPerfJob creation is idempotent.
The RunResult shape returned by both backends is identical — the cluster path fetches the same profile_export_aiperf.json schema over HTTP that the local path reads off disk. Downstream SweepAnalyzer.compute(), aggregate_and_export(), and the search_history.json writer don’t know which backend produced the inputs.
The orchestrator layer’s extension points are abstract base classes; implementations are registered as plugins or instantiated directly by category-aware factories.
How the types from the class diagram actually flow through a sweep run. Read it as: each box is an instance of a class from the class diagram; arrows show what produces what; cardinality annotations make the fan-out explicit (1 plan -> N variations × M trials -> N×M results -> 1 aggregate).
The two views together: the flowchart shows cardinality and which class produces which (the data shape of a sweep); the sequence shows the temporal call pattern between the same classes. Both use only the types from the class diagram — no module-internal helpers.
The adaptive search path layers atop the same BenchmarkPlan / MultiRunOrchestrator / RunExecutor core. Adaptive config is not a separate field — it’s the AdaptiveSearchSweep variant of the SweepConfig discriminated union (type: adaptive_search). Two plugin categories cooperate: a search_planner (drives the outer loop) and an optional search_recipe (curates the search space / objective / post-process from a higher-level recipe template). The optional terminal post_process is a single PostProcessSpec resolved via search_recipe_post_process plugins.
Built-in search_recipe plugins (src/aiperf/search_recipes/):
max-throughput-ttft-sla, max-throughput-itl-slaconcurrency-rampprefill-ttft-curve, decode-itl-curvemax-goodput-under-slo, max-concurrency-under-slapareto-sweepRecipes choose one of three output branches: adaptive_search (BO-style), sweep_parameters (grid-style — e.g. concurrency-ramp, prefill-ttft-curve, decode-itl-curve), or scenarios (deep-merge variants — e.g. pareto-sweep). The SearchRecipeOutput validator enforces exactly-one-of, so downstream code can branch cleanly.
Built-in search_recipe_post_process plugins: degradation_knee_detect, ttft_curve_fit, itl_surface_fit, sla_breach_knee, pareto_sweep_export.
The BO outer loop is a propose -> execute -> record cycle inside MultiRunOrchestrator.execute_adaptive_search. BenchmarkRun and RunExecutor are the same as in the grid path; the difference is that BenchmarkPlan.configs starts with one seed config and grows by one per iteration as the planner asks for the next point.
A user can either author an AdaptiveSearchSweep directly under sweep: (low level) or pick a search_recipe plugin (high level) that builds one from a recipe + the user’s existing benchmark config. The adaptive block lives entirely on sweep; there is no separate adaptive-search field on MultiRunConfig.
How AIPerf’s Bayesian-Optimization outer loop is wired together: the protocols, the runtime sequence, and the config-to-execution flow.
The planner and the orchestrator talk through narrow protocols. MultiRunOrchestrator doesn’t know about Bayesian Optimization — it only knows SearchPlanner and RunExecutor. The Optuna+BoTorch dependency is hidden inside OptunaSearchPlanner and its BayesianSearchPlanner curated-preset subclass; MonotonicSLASearchPlanner and SmoothIsotonicSLAPlanner are 1D-feasibility-search planners that plug in at the same SearchPlanner ABC. Future planners (random-search baseline, MORBO, etc.) plug in identically.
Planner modules below are relative to aiperf.orchestrator.search_planner. (e.g. bayesian.py -> aiperf.orchestrator.search_planner.bayesian).
All four are registered in src/aiperf/plugin/plugins.yaml under the search_planner: category and resolved via plugins.get_class(PluginType.SEARCH_PLANNER, name).
The CLI grammar lives in aiperf.orchestrator.search_planner.parsing.parse_search_space(values), which converts --search-space "path:lo,hi[:kind]" strings into SearchSpaceDimension instances. The v1->v2 converter (build_multi_run in aiperf.config.flags._converter_optionals) packages everything into a typed AdaptiveSearchSweep carried on AIPerfConfig.sweep.
MultiRunOrchestrator.execute_adaptive_search is a thin loop. Every iteration: ask the planner for a (BenchmarkConfig, SweepVariation); run all configured trials at that point via the same _run_independent_cell grid sweeps use; tell the planner what happened; write search_history.json incrementally. When ask() returns None, surface the planner’s convergence_reason() and exit.
A few things this view makes explicit:
objective_pooling=mean it tells Optuna the mean of finite per-trial objective values; with pooled percentile mode it tells Optuna the pooled percentile objective computed from the raw record samples. Per-trial RunResult objects remain on SearchIteration.results in memory for search-history derivation, but separate per-trial Optuna observations are not recorded in v1.study.tell(trial, fallback_objective) so Optuna’s ask/tell pairing stays consistent. The fallback is a strictly worse-than-prior sentinel used only inside the study; search_history.json persists objective_values: null for that iteration.is_converged() checks max_iterations, then improvement-over-best patience, then coefficient-of-variation plateau. The first to fire wins; the reason is recorded in search_history.json.The CLI feeds a CLIConfig through src/aiperf/config/flags/converter.py, which packages the search-space + objective + filters into a typed AdaptiveSearchSweep carried on AIPerfConfig.sweep. From there the plan builder produces a BenchmarkPlan, and MultiRunOrchestrator.execute dispatches on plan.is_adaptive_search to the BO loop.
SearchPlanner, implement the four abstract methods (ask/tell/is_converged/history); optionally override convergence_reason (default returns None) and boundary_summary (1D feasibility planners only). No orchestrator changes required — MonotonicSLASearchPlanner, SmoothIsotonicSLAPlanner, and OptunaSearchPlanner are existing examples reusing the same ABC. Wiring is already generic: AdaptiveSearchSweep.planner is a SearchPlannerType (ExtensibleStrEnum in src/aiperf/plugin/enums.pyi), and cli_runner._run_multi_benchmark instantiates the planner via plugins.get_class(PluginType.SEARCH_PLANNER, sweep.planner). To register a new backend, add an entry under search_planner: in src/aiperf/plugin/plugins.yaml and a matching enum value — no dispatch code changes.RunExecutor. LocalSubprocessExecutor iterates one (variation, trial) at a time via execute(BenchmarkRun) — the seam is adaptive-shaped by construction.OptunaSearchPlanner boundary (and its BayesianSearchPlanner curated-preset subclass) is the only Optuna-aware code in the project. BoTorch-specific acquisitions live behind the optional botorch extra in pyproject.toml. The qlognei / qlognehvi acquisitions, posterior-regret stopping (--optuna-terminator regret/emmr), pooled-percentile aggregation (--search-percentile-pooling pooled), and the Hvarfner-DSP Matern-5/2 kernel (arXiv:2402.02229) are all plumbed through _optuna_helpers.py. The remaining principled upgrade path is wiring per-iteration heteroscedastic noise estimates from the pooled-percentile JSONL helper into a HeteroskedasticSingleTaskGP-based custom candidates_func. Evidence-gated: ship only if observed within-trial variance varies meaningfully across the search space on real workloads.smooth_isotonic as novel-in-compositionThe SmoothIsotonicSLAPlanner algorithm (PAVA monotonic regression as denoiser -> PCHIP cubic Hermite interpolant -> root-find for SLA-threshold crossing, plus a PAVA-residual changepoint detector for the cliff-guard exit) does not appear in published BO literature in this exact composition. The components are textbook (PAVA: pool-adjacent-violators; PCHIP: shape-preserving piecewise-cubic Hermite interpolation; bracketed root-find: classical numerical analysis), but their composition for SLA-saturation in noisy GPU-serving benchmarking is original. Adjacent prior art:
MonotonicSLASearchPlanner reproduces DistServe’s algorithm; SmoothIsotonicSLAPlanner is a strict improvement (denoised + continuous-space root-find).qNEHVI on BoTorch with ModelListGP. Different problem (serving-system optimization rather than benchmark-side adaptive sweep), same machinery family.smooth_isotonic is defensibly novel in the systems-benchmarking literature even though every individual piece is classical statistics; worth a section in a future technical report.