Sweep + Orchestrator Developer Reference

View as Markdown

Developer reference for AIPerf’s sweep + adaptive-search machinery. Three zoom levels below: mental model -> seven-stage tour -> class/module map.


Part 1 — Mental model

The big idea

Every AIPerf run — single benchmark, parameter grid, or Bayesian search — is the same pipeline with different cardinalities:

One pipeline, every scale — by design. A single benchmark, a local multi-run for confidence intervals, a grid or scenarios sweep, a Sobol or Latin-Hypercube characterization, a local Bayesian-Optimization adaptive search, and a cluster-side BO running across hundreds of pods are not seven different code paths. They are seven cardinalities of one pipeline: BenchmarkPlan describes what to run, MultiRunOrchestrator decides when and in what order, a SearchPlanner (optional) decides what to try next, and a RunExecutor decides how to actually run one cell. Each piece owns exactly one concern and knows nothing about the others.

That separation is what makes the system extensible without churn. Want a new sweep shape? Add a discriminated-union variant to SweepConfigexpand_sweep does the rest, and every executor, exporter, and analyzer picks it up for free. Want a new planner (a different acquisition function, a 1D SLA-saturation algorithm, a multi-fidelity scheme)? Implement SearchPlanner.ask/tell and register it under the search_planner plugin category — the orchestrator and executors don’t change. Want to run the whole thing on Kubernetes? Implement RunExecutor.execute to create an AIPerfJob CR and HTTP-pull its results instead of forking a subprocess (this is the coming-soon K8sChildJobExecutor) — the plan, orchestrator, planner, analyzer, and exporters are reused byte-for-byte. The progression from a single-shot aiperf profile to a cluster-distributed BO search isn’t a rewrite; it’s the same machinery at a different cardinality with a different executor at the bottom.

Execution

Today only LocalSubprocessExecutor ships: aiperf profile -f config.yaml runs the orchestrator in the same Python process and forks aiperf.orchestrator.subprocess_runner per cell.

Cluster execution (coming soon). K8sChildJobExecutor lives on the K8s integration branch (not main yet). It runs in-cluster in a sweep-controller pod (from an AIPerfSweep CR). Each cell becomes an AIPerfJob CR, watched to completion; the operator results server supplies the child export (same shape as local). Orchestrator logic is unchanged—only RunExecutor differs. CLI: aiperf kube sweep (alongside aiperf kube profile).

Key types

The whole flow uses about a dozen types. If you know these, you can read any sweep code.

TypeRole
AIPerfConfigTop-level envelope. Holds a BenchmarkConfig body plus envelope-level knobs: sweep, multi_run, variables, random_seed.
BenchmarkConfigThe actual benchmark settings (models, endpoint, datasets, phases, artifacts, …). The unit of “what to benchmark.”
SweepConfigDiscriminated union: GridSweep (YAML: type: grid, cartesian over parameters), ZipSweep (type: zip, lockstep / element-wise over parameters, all lists equal length), ScenarioSweep (type: scenarios, deep-merge runs[i]), or AdaptiveSearchSweep (type: adaptive_search, BO / monotonic).
SobolSweep / LatinHypercubeSweepFixed-budget space-filling samplers. N = samples; each variation drawn from scipy.stats.qmc. Reuses the grid-style iteration_order / cooldown / SLA-filter mechanics.
MultiRunConfigTrial mechanics: num_runs (= trials per variation), cooldown, optional convergence: ConvergenceConfig.
SweepVariation{index, label, values}. One per variation; carries the parameter values that differ from base. Also exposes dir_name: the {leaf}_{value} form (e.g. concurrency_10) used as the per-variation directory name.
BenchmarkPlanThe “expanded” form: configs[N], variations[N], trials=M, plus the originating sweep + multi_run. Output of build_benchmark_plan.
BenchmarkRunOne cell: (cfg, variation, trial, artifact_dir). The smallest unit of work.
RunResult{success, summary_metrics, artifacts_path, variation_label, variation_values, trial_index, error}. One per BenchmarkRun.
MultiRunOrchestratorDrives the N×M loop. Picks REPEATED (trials outer) or INDEPENDENT (variations outer) based on sweep.iteration_order; dispatches to execute_adaptive_search if the sweep is adaptive.
RunExecutorABC with execute(run) -> RunResult plus a second abstract derive_id(plan, var_idx, trial) -> str for stable per-cell identifiers. LocalSubprocessExecutor is the only shipping implementation today; K8sChildJobExecutor (one child AIPerfJob CR per call) is finalized but unmerged — see Execution above.
SweepAnalyzerPost-hoc aggregator. CLI helpers group list[RunResult] by variation_values into per_combination_stats; SweepAnalyzer.compute() then produces best_configurations, pareto_optimal, per_combination_metrics. Written to sweep_aggregate/profile_export_aiperf_sweep.{json,csv}.

End-to-end pipeline (canonical)

The orchestrator forks a subprocess per cell at stage 6; aggregation is pure post-hoc compute over the collected RunResults. YAML configs reach AIPerfConfig directly through load_configAIPerfConfig.model_validate; only CLI flags travel through CLIConfig first so cyclopts can parse magic-list affordances (--concurrency 1,2,4) before they’re lifted into a typed SweepConfig.

What happens between runs (per-cell loop)

A “cell” is one (variation, trial) slot. Inside a cell, an ExecutionStrategy decides whether to keep going. FixedTrialsStrategy stops after M trials. AdaptiveStrategy (selected automatically when multi_run.convergence is set) keeps going until a ConvergenceCriterion is satisfied, capped by multi_run.num_runs. Around each executor.execute(run), the orchestrator threads cancel-checking, sweep-wide failure thresholds, and inter-run cooldowns. Two distinct cooldown fields are in play: multi_run.cooldown_seconds (between trials within a cell, returned by strategy.get_cooldown_seconds()) and sweep.cooldown_seconds (between variations, applied in the outer loop).

The strategy is fresh per cell in INDEPENDENT mode, so adaptive trial-convergence resets between variations. In REPEATED mode there’s only one trial per cell — the “outer trial loop” replays the whole grid.

REPEATED vs INDEPENDENT — loop nesting

Two ways to traverse the same N variations × M trials grid. sweep.iteration_order picks; default is REPEATED. The numbers below are the order in which cells execute (example: 3 variations, 3 trials).

REPEATED interleaves trials across variations so transient effects (warm caches, thermal drift) hit every variation similarly — better for cross-variation comparison. INDEPENDENT runs one variation to completion before moving on — required for convergence-based adaptive trials, since a strategy needs to observe all of one cell’s results in sequence. Cooldowns and per-cell strategy reuse follow from the nesting; see MultiRunOrchestrator.

Artifact directory layout reference

The artifact tree branches on three flags: whether a sweep is configured (is_sweep), whether multiple trials run per cell (trials > 1), and the sweep iteration order (REPEATED vs INDEPENDENT). Implemented in _resolve_artifact_dir in src/aiperf/orchestrator/orchestrator.py.

sweeptrialsorderlayout
no1-<base>/
no>1-<base>/profile_runs/run_NNNN/
yes1-<base>/<dir_name>/
yes>1REPEATED<base>/profile_runs/trial_NNNN/<dir_name>/
yes>1INDEPENDENT<base>/<dir_name>/profile_runs/trial_NNNN/
adaptiveany-<base>/search_iter_NNNN/profile_runs/run_NNNN/

<dir_name> is the {leaf_param_name}_{value} form (e.g. concurrency_10, request_rate_5.0); multi-dim sweep cells join components with __ (e.g. concurrency_10__isl_512). Inner-dir naming is asymmetric on purpose — the no-sweep multi-run case uses run_NNNN, the sweep + INDEPENDENT case uses trial_NNNN. Downstream consumers (plotters, dashboards) account for this asymmetry.

The sweep-level aggregate path follows a parallel rule:

  • REPEATED + multi-run: <base>/aggregate/sweep_aggregate/
  • everything else (sweep-only, sweep + INDEPENDENT): <base>/sweep_aggregate/

Per-variation aggregates land at <base>/aggregate/<dir_name>/ in REPEATED mode and <base>/<dir_name>/aggregate/ otherwise (INDEPENDENT is the explicit default fallback in _per_variation_aggregate_dir; any non-REPEATED mode takes the else branch).

Adaptive outer loop (ask / tell)

Adaptive search is the same pipeline with one swap: instead of “expand a fixed grid into N configs up front,” the planner generates configs one at a time, learning from each result.

  • The sweep block is AdaptiveSearchSweep (type: adaptive_search) instead of GridSweep / ZipSweep / ScenarioSweep.
  • BenchmarkPlan.configs starts with one seed config; the planner extends it as it asks.
  • MultiRunOrchestrator dispatches to execute_adaptive_search, which runs planner.ask() -> execute trials -> planner.tell(results) until planner.ask() returns None (or cancellation / abort).
  • Four planner plugins ship: BayesianSearchPlanner (curated Optuna+BoTorch preset; auto-selects qLogNEI / qLogNEHVI based on objective count), MonotonicSLASearchPlanner (1D probe + bisection), SmoothIsotonicSLAPlanner (isotonic regression on bootstrap-resampled trials), OptunaSearchPlanner (TPE / GP / BoTorch samplers, expert-mode flag exposure). The BayesianSearchPlanner is implemented as a thin subclass of OptunaSearchPlanner that locks in the BoTorch sampler and the curated acquisition; it is not a separate engine.
  • Optional search_recipe plugins build the whole AdaptiveSearchSweep from a higher-level recipe (e.g. max-concurrency-under-sla, prefill-ttft-curve, pareto-sweep).
  • An optional post_process handler (degradation_knee_detect, ttft_curve_fit, itl_surface_fit, sla_breach_knee, pareto_sweep_export) runs after the final iteration.

Each iteration adds one SearchIteration to planner.history(). Convergence terminates the loop via planner.ask() returning None; the reason (plateau / improvement-patience / max-iterations) comes from planner.convergence_reason(). search_history.json is rewritten after every iteration so a crashed sweep still has a usable trail.

Fan-out math

The cardinality of any sweep is N variations × M trials = N×M cells. Where N and M come from depends on the path.

For adaptive search, N is the iteration count: bounded above by max_iterations, possibly less if the planner converges early. M (trials per iteration) still applies — adaptive runs M trials per planner-proposed point, then tell()s the planner the aggregate.

Where the code lives

ConceptFile
Envelope, bodysrc/aiperf/config/config.py (AIPerfConfig, BenchmarkConfig)
Multi-run / convergencesrc/aiperf/config/sweep/multi_run.py
Sweep variantssrc/aiperf/config/sweep/config.py + sampling.py (QMC) + adaptive.py
Sweep expansionsrc/aiperf/config/sweep/expand.py + expand_qmc.py
Plan loader (CLI/YAML -> plan)src/aiperf/config/loader/plan.py
BenchmarkPlan / BenchmarkRun modelssrc/aiperf/config/resolution/plan.py
Orchestratorsrc/aiperf/orchestrator/orchestrator.py
Executorssrc/aiperf/orchestrator/{executor,local_executor}.py
Aggregationsrc/aiperf/orchestrator/aggregation/sweep.py
Search planners + recipessrc/aiperf/orchestrator/search_planner/, src/aiperf/search_recipes/

For a fully-indexed file map covering every entry point, see Where to look in the code in Part 3.


Part 2 — Seven-stage tour

A guided tour of the sweep / multi-run / adaptive-search flow, focused on the big picture and the names of the types that move data between stages. Read this when you want to know what happens when you press enter.

The seven stages

Every aiperf profile invocation walks the same seven stages. The shape of each stage’s input and output is named — those names are the things to remember.

The pipeline doesn’t change shape between a single benchmark, a multi-run, a grid sweep, a scenarios sweep, or a Bayesian search. Only how many cells stage 5 produces and what decides each next cell changes:

ModeN (variations)M (trials per variation)Total cells
Single benchmark111
Multi-run1MultiRunConfig.num_runs (1–10)M
Grid sweepcartesian product of sweep.parametersMultiRunConfig.num_runs (default 1)N × M
Scenarios sweeplen(runs[])MultiRunConfig.num_runs (default 1)N × M
Adaptive searchgrows by 1 every planner.ask(); capped by max_iterationsMultiRunConfig.num_runs (default 1)≤ max_iter × M

Each cell is one BenchmarkRun -> one RunResult. The next section unpacks the N / M dimensions in detail.

Two dimensions: N variations × M trials

The sweep cardinality has two independent dimensions. Mixing them up is the single most common source of “wait, why did this run that many times?” surprise.

N comes from SweepConfig (the sweep block on AIPerfConfig): the sweep.parameters cartesian product, the runs[] list, or the planner’s proposals. Without a sweep block, N = 1.

M comes from MultiRunConfig (the multi-run block on AIPerfConfig):

FieldDefaultMeaning
num_runs1Trial count per variation. M = 1 is “single run, no repeats.” Max 10 (both CLI flag and typed field share the cap).
cooldown_seconds0Sleep between trials so server caches / thermals reset.
convergenceunsetOptional ConvergenceConfig — stop early when results stabilize.
Total runs = N × M
N=1, M=1 -> 1 run single benchmark, no confidence
N=1, M=5 -> 5 runs one config, repeated for confidence intervals
N=4, M=1 -> 4 runs a 4-point sweep, one shot per point
N=4, M=3 -> 12 runs sweep with 3-trial confidence per variation

When M > 1, SweepAnalyzer.compute automatically produces a confidence block (mean / std / 95% CI) per metric per variation. When M = 1 you get point estimates only.

Trials are not iterations. For an adaptive search, --search-max-iterations controls N (how many points the planner gets to try), and --num-profile-runs controls M (how many times each proposed point is benchmarked before the planner sees the aggregate). They multiply: a max_iterations=30, num_profile_runs=3 adaptive run executes up to 90 subprocess benchmarks.

Stages 1–2 — User input -> typed config

Two entry points converge on the same typed envelope AIPerfConfig, but by different paths. YAML skips CLIConfig entirely: load_config / load_config_from_string parse the file and call AIPerfConfig.model_validate directly. CLI flags are parsed by cyclopts into a CLIConfig (the human-friendly, CLI-shaped surface), then convert_cli_to_aiperf lifts magic flags into the typed envelope. From here on, AIPerfConfig is the single source of truth.

Why the CLI -> envelope hop? CLIConfig is the human-friendly CLI shape — magic-lists like --concurrency 1,2,4, --prefill-concurrency 1,2,4, or --request-rate 10,20,50 mean “sweep that field over those values.” The converter lifts those affordances into a typed sweep block on AIPerfConfig. After conversion, every flag has one canonical home in the envelope. YAML configs don’t need this hop — they’re already written in envelope shape, so load_config constructs AIPerfConfig directly via model_validate and skips CLIConfig entirely.

SweepConfig is a discriminated union

Pydantic discriminates by a type field on the YAML / dump (each variant sets a default for type, so YAML authors do not need to write it explicitly). The orchestrator never inspects the variant directly — it reads BenchmarkPlan.is_adaptive_search, which is true exactly when the variant is AdaptiveSearchSweep.

Stage 3 — Expand into a BenchmarkPlan

AIPerfConfig describes intent; BenchmarkPlan lists the actual cells the orchestrator will run. The plan-builder either short-circuits to a single seed variation (for adaptive runs and for no-sweep runs) or calls expand_sweep (cartesian product for grid, lockstep zip for zip, deep-merge for scenarios — also Sobol / Latin-hypercube for QMC sweeps), then renders any per-variation Jinja and emits one BenchmarkConfig per variation.

A few useful invariants:

  • SweepVariation{index, label, values}. One per variation. values is the dict of swept parameters that differ from the base config; the label is built from those for artifact directory names.
  • trials = M comes from MultiRunConfig.num_runs (default 1, max 10). It’s the per-cell repeat count for confidence aggregation, not the total run count.
  • For adaptive search, configs starts with one seed and grows as the planner asks. The plan-builder doesn’t know the final length up front.
  • plan.is_adaptive_search is the orchestrator’s only branch on the sweep variant — every other piece of code is variant-agnostic.

Stages 4–5 — Orchestrator dispatch

MultiRunOrchestrator.execute(plan, executor, search_planner=...) is the single entry point. It dispatches on plan.is_adaptive_search:

Grid / scenarios — REPEATED vs INDEPENDENT

For an N×M grid (N variations, M trials), there are two ways to interleave the work. Both produce the same N×M cells; they differ only in which loop is outer. iteration_order is a field on the grid family of sweeps (GridSweep, ZipSweep, ScenarioSweep); AdaptiveSearchSweep does not expose this knob.

REPEATED is the default. It interleaves so transient effects (warm caches, thermal drift) hit every variation similarly — better for cross-variation comparison. INDEPENDENT runs one variation to completion before moving on; required when each variation needs its own ExecutionStrategy to observe a full cell’s worth of results before deciding to stop (the adaptive trial-convergence case).

Adaptive — ask / tell loop

When plan.is_adaptive_search is true, execute_adaptive_search runs a tighter loop driven by a SearchPlanner:

StepWhat actually happens
planner.ask()returns the next (BenchmarkConfig, SweepVariation) — or None to terminate. State: a Gaussian process (BayesianSearchPlanner), a bisection bracket (MonotonicSLASearchPlanner), an isotonic-fit history (SmoothIsotonicSLAPlanner), or an Optuna study (OptunaSearchPlanner).
_run_independent_cellruns M trials at the proposed point — the same per-cell loop INDEPENDENT mode uses.
planner.tell(...)feeds the M-trial aggregate back so the planner can update its model and propose a better next point.
planner.is_converged()checked inside ask(). When max-iter / plateau / improvement-patience fires, ask() returns None.
search_history.jsonrewritten after every iteration. A crashed run still has a usable trail.

Stage 6 — Inside one cell

A “cell” is one (variation, trial) slot. Every cell runs a small state machine driven by an ExecutionStrategy:

Three collaborators inside the cell — two ABCs and one Pydantic model:

TypeImplementationsJob
ExecutionStrategy (ABC)FixedTrialsStrategy, AdaptiveStrategyDecide whether to run another trial in this cell.
RunExecutor (ABC)LocalSubprocessExecutor (only one shipping)Turn one BenchmarkRun into one RunResult by spawning a fresh subprocess of aiperf.orchestrator.subprocess_runner.
BenchmarkRun (Pydantic model)The smallest unit of work — essentially (cfg, variation, trial, artifact_dir), plus identity fields (benchmark_id, sweep_id, label, cli_command, random_seed) that the orchestrator uses for the artifact tree and sweep grouping.

FixedTrialsStrategy runs exactly M trials. AdaptiveStrategy runs until a ConvergenceConfig says enough — capped by multi_run.num_runs so it can’t run forever.

Stage 7 — Aggregate

After the orchestrator returns list[RunResult], the CLI runner groups by RunResult.variation_values, builds a per_combination_stats dict, and hands it to SweepAnalyzer.compute(per_combination_stats, sweep_parameters, sla_filters=…), which computes summary stats per group, identifies the Pareto frontier, and returns the aggregate dict the JSON / CSV exporters write.

The aggregate JSON has three result blocks plus a metadata block:

  • metadatanum_combinations, swept parameter list, and (when set) sla_constraints. Downstream consumers key off this block.
  • per_combination_metrics — one entry per unique variation_values, with swept parameters and a metric block (mean / p99 / etc.) for every metric.
  • best_configurations — fixed post-hoc picks for highest throughput and lowest latency from the aggregate summary. These are not the adaptive search’s configured objectives.
  • pareto_optimal — fixed post-hoc throughput/latency frontier computed via _dominates. Adaptive configured objectives are reported in search_history.json["best_trials"].

Orthogonality note. best_configurations and pareto_optimal here are emitted by SweepAnalyzer, computed across the whole RunResult set, and live under sweep_aggregate/profile_export_aiperf_sweep.json. They are distinct from search_history.json["best_trials"], which is what the BO planner converged on (see Search History API). For a single-objective adaptive run with no failed iterations the two usually agree on the winner; they can disagree when iterations failed, when feasibility differs (search-history is feasibility-first lex over sla_filters, the analyzer ranks the full set), or when the analyzer’s Pareto computation includes objectives the planner wasn’t optimizing.

If the active sweep came from a search recipe with a PostProcessSpec, that handler runs after the analyzer and emits its own JSON file (e.g. degradation_knee.json for concurrency-ramp, pareto_sweep.json for pareto-sweep).

How search recipes plug in

A search recipe is a named preset that bundles “search space + objective + termination + SLA filters + optional post-process” into one CLI selector (--search-recipe <name>). It runs before stage 3 and emits the typed sweep config the rest of the pipeline expects.

SearchRecipeContext is the recipe’s read-only view of user intent — built BenchmarkConfig, declared SLA targets (--ttft-sla-ms, etc.), and any sweep-knob overrides (--concurrency-min, --isl-osl-pairs, etc.).

SearchRecipeOutput carries exactly one of adaptive_search, sweep_parameters, or scenarios (validated mutually exclusive), plus optional sla_filters, per-request slos, and a post_process spec.

The eight built-in recipes

RecipeBranchWhat it builds
max-throughput-ttft-slaadaptive_searchBO over concurrency, objective = throughput, SLA = TTFT
max-throughput-itl-slaadaptive_searchBO over concurrency, objective = throughput, SLA = ITL
max-concurrency-under-slaadaptive_search (smooth_isotonic default) or grid1D feasibility — max concurrency where every SLA filter passes
max-goodput-under-sloadaptive_searchBO maximizing goodput at >= attainment-fraction SLO compliance
concurrency-rampsweep_parameters + degradation_knee_detectlog-spaced concurrency grid, finds p99 degradation knee
prefill-ttft-curvesweep_parameters + ttft_curve_fitISL grid at concurrency=1, linear / quadratic fit
decode-itl-curvesweep_parameters + itl_surface_fit2D (concurrency × OSL) grid, surface fit
pareto-sweepscenarios + pareto_sweep_exportpaired ISL/OSL × concurrency Pareto frontier

After expansion, downstream stages don’t know a recipe ever existed — they just see a normal AIPerfConfig.sweep with optional sla_filters attached.

End-to-end — putting it all together

One diagram from key-press to artifact:

Names worth remembering

If you remember nothing else from this doc, remember these eleven names — every other class in the sweep code is glue or helper.

NameWhat it isWhere in the flow
AIPerfConfigTyped envelope. Everything user-supplied lands here.Stage 2 out -> 3 in
BenchmarkConfigBenchmark body — models, endpoint, datasets, phases.Field of AIPerfConfig
SweepConfigDiscriminated union — GridSweep, ZipSweep, ScenarioSweep, AdaptiveSearchSweep, SobolSweep, LatinHypercubeSweep.Field of AIPerfConfig
SearchRecipePluggable preset that emits a SearchRecipeOutput.Pre-stage 3
BenchmarkPlanExpanded plan — configs[], variations[], trials, sweep.Stage 3 out -> 4 in
MultiRunOrchestratorDrives the cell loop; dispatches grid vs adaptive.Stage 4
ExecutionStrategyPer-cell “should I keep going?” — FixedTrialsStrategy / AdaptiveStrategy.Stages 5–6
BenchmarkRunOne (cfg, variation, trial) plus identity (benchmark_id, sweep_id, label, cli_command, random_seed). Smallest unit of work.Stage 6 in
RunExecutorABC. Only impl: LocalSubprocessExecutor.Stage 6
RunResultOne BenchmarkRun’s output (metrics + variation metadata).Stage 6 out -> 7 in
SweepAnalyzerPure compute: list[RunResult] -> grouped / best / Pareto JSON.Stage 7

For adaptive runs, three more:

NameWhat it is
SearchPlannerABC. BayesianSearchPlanner, MonotonicSLASearchPlanner, SmoothIsotonicSLAPlanner, OptunaSearchPlanner.
SearchIterationPer-iteration record — proposal + measured objective + feasibility.
PostProcessHandlerRecipe artifact emitter — degradation_knee_detect, ttft_curve_fit, itl_surface_fit, sla_breach_knee, pareto_sweep_export.

Part 3 — Class & module map

End-to-end view of how a YAML config or CLI invocation becomes an AIPerfConfig envelope, expands into a BenchmarkPlan, and is executed by MultiRunOrchestrator against a backend RunExecutor. Multiple zoom levels — pick whichever matches what you’re trying to understand.

The same BenchmarkPlan / MultiRunOrchestrator / RunExecutor machinery handles single-run, grid sweep, zip sweep, scenario sweep, and adaptive search. Dispatch differs only inside MultiRunOrchestrator.execute.

30,000 ft — what happens, period

10,000 ft — local end-to-end (with cluster path coming soon)

Sub-flow — config layer (YAML/CLI -> BenchmarkPlan)

Sub-flow — orchestrator iteration

cli_runner.run_benchmark peels off single-run plans (plan.is_single_run) before the orchestrator is constructed; only multi-run plans reach MultiRunOrchestrator.execute. Inside execute(), dispatch is two-way: adaptive-search vs. grid/scenarios. Grid/scenarios further branch on _plan_iteration_order(plan) which reads plan.sweep.iteration_order (REPEATED default, or INDEPENDENT).

(The artifact-tree layout table is documented above in Part 1 — Artifact directory layout reference.)

Sub-flow — RunExecutor backends

RunExecutor is a 2-method ABC: execute(run) -> RunResult and derive_id(plan, var_idx, trial) -> str. The local executor derives a stable id from the plan/variation/trial tuple for artifact naming; the cluster executor (coming soon — finalized on the K8s integration branch, not yet on main) derives a deterministic K8s-name-safe id from (plan, var_idx, trial) so child AIPerfJob creation is idempotent.

The RunResult shape returned by both backends is identical — the cluster path fetches the same profile_export_aiperf.json schema over HTTP that the local path reads off disk. Downstream SweepAnalyzer.compute(), aggregate_and_export(), and the search_history.json writer don’t know which backend produced the inputs.

Class / module map

Sequence — a sweep run end to end

Where to look in the code

ConceptFile
AIPerfConfig envelope, BenchmarkConfig bodysrc/aiperf/config/config.py
BenchmarkPlan, BenchmarkRun, ResolvedConfigsrc/aiperf/config/resolution/plan.py
MultiRunConfig, ConvergenceConfigsrc/aiperf/config/sweep/multi_run.py
SweepConfig union / GridSweep / ZipSweep / ScenarioSweep / AdaptiveSearchSweep / Objective / OutcomeConstraint / SweepVariationsrc/aiperf/config/sweep/config.py
SobolSweep / LatinHypercubeSweep / QMC sampling helperssrc/aiperf/config/sweep/sampling.py, src/aiperf/config/sweep/expand_qmc.py
expand_sweep (definition)src/aiperf/config/sweep/expand.py (re-exported from src/aiperf/config/sweep/__init__.py)
SearchSpaceDimension, SLAFiltersrc/aiperf/config/sweep/adaptive.py
PostProcessSpec, SearchRecipe, SearchRecipeContext, SearchRecipeOutputsrc/aiperf/search_recipes/_base.py (PostProcessSpec defined in src/aiperf/search_recipes/_post_process.py, re-exported from _base.py)
PostProcessHandler Protocol + built-inssrc/aiperf/search_recipes/post_process.py
build_benchmark_plan (load -> plan)src/aiperf/config/loader/plan.py
MultiRunOrchestratorsrc/aiperf/orchestrator/orchestrator.py
RunExecutor ABC + RunResultsrc/aiperf/orchestrator/executor.py, src/aiperf/orchestrator/models.py
LocalSubprocessExecutorsrc/aiperf/orchestrator/local_executor.py
Subprocess runner entry (python -m)src/aiperf/orchestrator/subprocess_runner.py
SearchPlanner ABC + SearchIterationsrc/aiperf/orchestrator/search_planner/base.py
BayesianSearchPlanner / MonotonicSLASearchPlanner / SmoothIsotonicSLAPlanner / OptunaSearchPlannersrc/aiperf/orchestrator/search_planner/{bayesian,monotonic,smooth_isotonic,optuna_planner}.py
parse_sla_filter, parse_search_spacesrc/aiperf/orchestrator/search_planner/parsing.py
SweepAnalyzer + exporterssrc/aiperf/orchestrator/aggregation/sweep.py
aggregate_sweep_and_export (file writer)src/aiperf/cli_runner/_sweep_aggregate.py (re-exported from cli_runner/_aggregate.py)
write_search_historysrc/aiperf/exporters/search_history.py
run_benchmark (single vs multi dispatch) + _reject_in_process_sweep_under_operatorsrc/aiperf/cli_runner.py
Plugin registry + categoriessrc/aiperf/plugin/{plugins.py,categories.yaml,types.py,schema/}

ABC hierarchy — orchestrator-side

The orchestrator layer’s extension points are abstract base classes; implementations are registered as plugins or instantiated directly by category-aware factories.

Sweep execution flow — class module map in motion

How the types from the class diagram actually flow through a sweep run. Read it as: each box is an instance of a class from the class diagram; arrows show what produces what; cardinality annotations make the fan-out explicit (1 plan -> N variations × M trials -> N×M results -> 1 aggregate).

The two views together: the flowchart shows cardinality and which class produces which (the data shape of a sweep); the sequence shows the temporal call pattern between the same classes. Both use only the types from the class diagram — no module-internal helpers.

Adaptive search — class types

The adaptive search path layers atop the same BenchmarkPlan / MultiRunOrchestrator / RunExecutor core. Adaptive config is not a separate field — it’s the AdaptiveSearchSweep variant of the SweepConfig discriminated union (type: adaptive_search). Two plugin categories cooperate: a search_planner (drives the outer loop) and an optional search_recipe (curates the search space / objective / post-process from a higher-level recipe template). The optional terminal post_process is a single PostProcessSpec resolved via search_recipe_post_process plugins.

Built-in search_recipe plugins (src/aiperf/search_recipes/):

  • max-throughput-ttft-sla, max-throughput-itl-sla
  • concurrency-ramp
  • prefill-ttft-curve, decode-itl-curve
  • max-goodput-under-slo, max-concurrency-under-sla
  • pareto-sweep

Recipes choose one of three output branches: adaptive_search (BO-style), sweep_parameters (grid-style — e.g. concurrency-ramp, prefill-ttft-curve, decode-itl-curve), or scenarios (deep-merge variants — e.g. pareto-sweep). The SearchRecipeOutput validator enforces exactly-one-of, so downstream code can branch cleanly.

Built-in search_recipe_post_process plugins: degradation_knee_detect, ttft_curve_fit, itl_surface_fit, sla_breach_knee, pareto_sweep_export.

Adaptive search — execution flow

The BO outer loop is a propose -> execute -> record cycle inside MultiRunOrchestrator.execute_adaptive_search. BenchmarkRun and RunExecutor are the same as in the grid path; the difference is that BenchmarkPlan.configs starts with one seed config and grows by one per iteration as the planner asks for the next point.

Adaptive search — recipe -> AdaptiveSearchSweep

A user can either author an AdaptiveSearchSweep directly under sweep: (low level) or pick a search_recipe plugin (high level) that builds one from a recipe + the user’s existing benchmark config. The adaptive block lives entirely on sweep; there is no separate adaptive-search field on MultiRunConfig.

SearchPlanner — protocols, planners, and extension points

How AIPerf’s Bayesian-Optimization outer loop is wired together: the protocols, the runtime sequence, and the config-to-execution flow.

The planner and the orchestrator talk through narrow protocols. MultiRunOrchestrator doesn’t know about Bayesian Optimization — it only knows SearchPlanner and RunExecutor. The Optuna+BoTorch dependency is hidden inside OptunaSearchPlanner and its BayesianSearchPlanner curated-preset subclass; MonotonicSLASearchPlanner and SmoothIsotonicSLAPlanner are 1D-feasibility-search planners that plug in at the same SearchPlanner ABC. Future planners (random-search baseline, MORBO, etc.) plug in identically.

Registered planners

Planner modules below are relative to aiperf.orchestrator.search_planner. (e.g. bayesian.py -> aiperf.orchestrator.search_planner.bayesian).

Plugin nameClassModulePurpose
bayesianBayesianSearchPlannerbayesian.pyCurated Optuna preset (subclass of OptunaSearchPlanner); uses BoTorch qLogNEI/qLogNEHVI when available and falls back to TPE with a warning when the optional BoTorch stack is unavailable
monotonic_slaMonotonicSLASearchPlannermonotonic.py1D exponential probe + bisection mirroring perf_analyzer’s --binary-search. Margin-magnitude-blind.
smooth_isotonicSmoothIsotonicSLAPlannersmooth_isotonic.py (+ helpers _smooth_isotonic_fit.py, _replicate_budget.py, _cliff_detect.py, _margin_normalize.py)1D PAVA + PCHIP smooth-isotonic fit; opt-in replicates and bootstrap CI; cliff-curve guard. Default for max-concurrency-under-sla.
optunaOptunaSearchPlanneroptuna_planner.pyExpert-mode Optuna BO (TPE / GP / BoTorch samplers exposed via --optuna-sampler); Optuna ships by default, BoTorch requires the optional botorch extra.

All four are registered in src/aiperf/plugin/plugins.yaml under the search_planner: category and resolved via plugins.get_class(PluginType.SEARCH_PLANNER, name).

SearchPlanner class diagram

The CLI grammar lives in aiperf.orchestrator.search_planner.parsing.parse_search_space(values), which converts --search-space "path:lo,hi[:kind]" strings into SearchSpaceDimension instances. The v1->v2 converter (build_multi_run in aiperf.config.flags._converter_optionals) packages everything into a typed AdaptiveSearchSweep carried on AIPerfConfig.sweep.

Runtime sequence — one BO iteration

MultiRunOrchestrator.execute_adaptive_search is a thin loop. Every iteration: ask the planner for a (BenchmarkConfig, SweepVariation); run all configured trials at that point via the same _run_independent_cell grid sweeps use; tell the planner what happened; write search_history.json incrementally. When ask() returns None, surface the planner’s convergence_reason() and exit.

A few things this view makes explicit:

  • Aggregate observations to the GP. The planner currently reports one Optuna trial per search point. With objective_pooling=mean it tells Optuna the mean of finite per-trial objective values; with pooled percentile mode it tells Optuna the pooled percentile objective computed from the raw record samples. Per-trial RunResult objects remain on SearchIteration.results in memory for search-history derivation, but separate per-trial Optuna observations are not recorded in v1.
  • Failed-iteration handling. When zero trials produce a usable objective, the planner still calls study.tell(trial, fallback_objective) so Optuna’s ask/tell pairing stays consistent. The fallback is a strictly worse-than-prior sentinel used only inside the study; search_history.json persists objective_values: null for that iteration.
  • Three convergence signals. is_converged() checks max_iterations, then improvement-over-best patience, then coefficient-of-variation plateau. The first to fire wins; the reason is recorded in search_history.json.

Config flow — CLI / YAML -> execution

The CLI feeds a CLIConfig through src/aiperf/config/flags/converter.py, which packages the search-space + objective + filters into a typed AdaptiveSearchSweep carried on AIPerfConfig.sweep. From there the plan builder produces a BenchmarkPlan, and MultiRunOrchestrator.execute dispatches on plan.is_adaptive_search to the BO loop.

Notes on extension points

  • Adding a new planner backend (random-search baseline, etc.): subclass SearchPlanner, implement the four abstract methods (ask/tell/is_converged/history); optionally override convergence_reason (default returns None) and boundary_summary (1D feasibility planners only). No orchestrator changes required — MonotonicSLASearchPlanner, SmoothIsotonicSLAPlanner, and OptunaSearchPlanner are existing examples reusing the same ABC. Wiring is already generic: AdaptiveSearchSweep.planner is a SearchPlannerType (ExtensibleStrEnum in src/aiperf/plugin/enums.pyi), and cli_runner._run_multi_benchmark instantiates the planner via plugins.get_class(PluginType.SEARCH_PLANNER, sweep.planner). To register a new backend, add an entry under search_planner: in src/aiperf/plugin/plugins.yaml and a matching enum value — no dispatch code changes.
  • Adding a new executor backend: subclass RunExecutor. LocalSubprocessExecutor iterates one (variation, trial) at a time via execute(BenchmarkRun) — the seam is adaptive-shaped by construction.
  • Replacing the BO backend. The OptunaSearchPlanner boundary (and its BayesianSearchPlanner curated-preset subclass) is the only Optuna-aware code in the project. BoTorch-specific acquisitions live behind the optional botorch extra in pyproject.toml. The qlognei / qlognehvi acquisitions, posterior-regret stopping (--optuna-terminator regret/emmr), pooled-percentile aggregation (--search-percentile-pooling pooled), and the Hvarfner-DSP Matern-5/2 kernel (arXiv:2402.02229) are all plumbed through _optuna_helpers.py. The remaining principled upgrade path is wiring per-iteration heteroscedastic noise estimates from the pooled-percentile JSONL helper into a HeteroskedasticSingleTaskGP-based custom candidates_func. Evidence-gated: ship only if observed within-trial variance varies meaningfully across the search space on real workloads.

smooth_isotonic as novel-in-composition

The SmoothIsotonicSLAPlanner algorithm (PAVA monotonic regression as denoiser -> PCHIP cubic Hermite interpolant -> root-find for SLA-threshold crossing, plus a PAVA-residual changepoint detector for the cliff-guard exit) does not appear in published BO literature in this exact composition. The components are textbook (PAVA: pool-adjacent-violators; PCHIP: shape-preserving piecewise-cubic Hermite interpolation; bracketed root-find: classical numerical analysis), but their composition for SLA-saturation in noisy GPU-serving benchmarking is original. Adjacent prior art:

  • Letham et al. 2017 (arXiv:1706.07094) — the noise-modeling anchor; per-trial-observations in BO with feasibility-product constraints.
  • DistServe (Zhong et al. OSDI ‘24, arXiv:2401.09670) — “DistServe simply enumerates the placements via binary search and finds the maximum rate that meets the SLO attainment target with simulation trials.” MonotonicSLASearchPlanner reproduces DistServe’s algorithm; SmoothIsotonicSLAPlanner is a strict improvement (denoised + continuous-space root-find).
  • BOute (Jiang et al. 2026, arXiv:2602.10729) — closest contemporary work using BO for LLM serving; constrained qNEHVI on BoTorch with ModelListGP. Different problem (serving-system optimization rather than benchmark-side adaptive sweep), same machinery family.

smooth_isotonic is defensibly novel in the systems-benchmarking literature even though every individual piece is classical statistics; worth a section in a future technical report.