Bayesian-Optimization Outer Loop
Bayesian-Optimization Outer Loop
Bayesian-Optimization Outer Loop
New users start here: Search Recipes bundle the BO knobs below into named presets such as
--search-recipe max-throughput-ttft-sla --ttft-sla-ms 200. The 1D-saturation case (max-passing-concurrency under an SLA) is covered in 1D SLA saturation below. Use the explicit--search-*flags documented on this page when no recipe matches your workflow.
Kubernetes execution — coming soon. Every
--search-*flag documented below is designed to work unchanged under cluster execution via theAIPerfSweepCRD +aiperf kube sweepCLI. The cluster-side path is finalized on the upcoming K8s integration branch but not yet onmain. When it ships, BO will run inside an in-clustersweep-controllerpod that creates one childAIPerfJobCR per iteration;search_history.jsonandsweep_aggregate/artifacts are served via the operator’s results API instead of being written to the local artifacts directory. Until then, run BO withaiperf profilelocally.
aiperf profile --search-space ... --search-metric ... --search-direction ... --search-max-iterations ... runs an adaptive outer loop instead of a grid sweep. Each iteration the planner asks Optuna for the next point in the search space, runs --num-profile-runs benchmarks at it, scores the configured objective, and feeds the result back to the optimizer.
AIPerf ships an Optuna-backed adaptive-search engine, exposed under two --search-planner names:
--search-planner=bayesian (recommended default) — a curated preset that uses the BoTorch sampler when available, auto-selects qLogNoisyExpectedImprovement for single-objective and qLogNoisyExpectedHypervolumeImprovement for multi-objective, and applies the Hvarfner-DSP Matern-5/2 kernel (Hvarfner et al. ICML 2024, arXiv:2402.02229) to every GP fit. If the optional BoTorch stack is unavailable, it logs a warning and falls back to Optuna’s core TPE sampler.--search-planner=optuna (expert mode) — same engine, but exposes --optuna-sampler {tpe,gp,botorch} and --optuna-acquisition {logei,qlogei,qnei,qlognei,qehvi,qnehvi,qlognehvi} for users who specifically want TPE / GP-EI / a non-default acquisition. Defaults to the BoTorch path when available and falls back to TPE only when that default was implicit; explicit botorch requests require the optional BoTorch stack and fail clearly when it is unavailable.Optuna is installed by default. The BoTorch sampler is optional:
The BoTorch extra pulls in optuna-integration, botorch>=0.10, gpytorch, and torch.
Use BO when:
Use a grid sweep when:
BO runs in-process via aiperf profile --search-*. The orchestrator owns the planner state and drives one benchmark per iteration.
This runs 30 search iterations × 3 trials each = 90 benchmarks. --search-planner=bayesian is the implicit default. Output:
<artifact_dir>/search_iter_NNNN/profile_runs/run_NNNN/ — per-trial artifacts.<artifact_dir>/search_history.json — BO trajectory, written incrementally.<artifact_dir>/aggregate/sweep_aggregate/profile_export_aiperf_sweep.{json,csv} — same per-combination aggregate the grid path produces. (For sweep-only runs without --num-profile-runs, this lands at <artifact_dir>/sweep_aggregate/ instead; multi-run wrapping nests it under aggregate/.)PATH:LO,HI[:KIND]
PATH is a dotted path resolved by _set_nested_value (the same primitive grid sweeps use). For named-list segments like phases.profiling.concurrency, the segment matches against the name field. Typos error loudly with the available names listed.LO and HI are inclusive bounds parsed as floats. For :int, integer rounding happens at the planner boundary so candidates always coerce to integers before the run.:int produces an integer-valued dimension, :real a real-valued dimension. Categorical dimensions are not supported by the current SearchSpaceDimension model.Multi-dim:
--search-metric must match a key in RunResult.summary_metrics produced by the run — that is, the bare metric tag (output_token_throughput, time_to_first_token), not the flattened _avg/_p99 aggregator-suffixed form.
Aggregation across --num-profile-runs. Per iteration, the planner records a single Optuna trial whose scalar objective is the arithmetic mean of finite per-trial values across the N successful benchmark trials (or the pooled-percentile point estimate when --search-percentile-pooling=pooled and --search-stat is a percentile). The GP therefore sees one observation per search point, not N, and fits a single homoscedastic noise term over the whole study. The within-point spread is not currently exposed to the GP; heteroscedastic-noise modelling via HeteroskedasticSingleTaskGP is listed as a deferred upgrade under What this implementation isn’t.
The same scalar is what search_history.json records as the first entry of the per-iteration objective_values list (length-1 for single-objective).
Failed trials. Skipped when extracting the objective. An iteration with zero successful trials is reported to Optuna via a per-direction failure sentinel passed to study.tell() (large value when minimizing, small when maximizing), and the loop continues. A warning is logged.
Mean of percentiles vs pooled percentiles. When --search-stat is a percentile (p50/p99/…), the BO objective defaults to the expected per-trial percentile (mean across trials). Pass --search-percentile-pooling pooled to switch to the percentile of the pooled per-request samples across all --num-profile-runs trials — the planner walks each trial’s profile_export.jsonl and computes numpy.percentile over the pooled bag. The two differ for skewed distributions: pooled-p99 over N×requests exposes more tail mass than mean(per-trial p99). For BO finding the best config the choice rarely changes the optimum’s location; for SLO claims it does, and pooled is the statistic that satisfies the claim. pooled requires --export-level records (or raw) so the per-request JSONL is on disk; missing JSONL falls back to mean-of-percentiles with a one-time warning. Cite Nakayama 2014, Confidence Intervals for Quantiles Using Sectioning (PDF) for the canonical bias/variance analysis of pooled-vs-sectioned quantile estimation; the WSC tutorial Pasupathy & Yeh 2022, Input Uncertainty Quantification for Quantiles (PDF) is the recommended starting point for engineers wiring up sectioning-based confidence intervals on top of the point estimate.
SLA filters (--search-sla, --ttft-sla-ms, etc.) and outcome_constraints are wired into Optuna via the native constraints_func interface. Each trial’s signed-violation vector (v_i = observed_i - threshold_i, sign-aligned so v_i <= 0 means feasible) is written to trial.user_attrs["constraints"] and the BoTorch sampler consumes it through constraints_func. Per Letham et al. 2019 (arXiv:1706.07094), this lets the acquisition function downweight infeasible regions without requiring a hand-rolled penalty merit, and lets the GP learn the feasibility surface as a separate output.
The loop terminates when any of:
--search-max-iterations iterations have been run.improvement_patience, default 10): no successful iteration has improved the running best for that many consecutive iterations. “We’ve stopped finding better points” is a stronger termination signal than “values stopped fluctuating.”plateau_window, plateau_threshold): on the last plateau_window (default 8) successful iterations, the sample CV (stddev/|mean|, Bessel’s correction) falls below plateau_threshold (default 0.01 = 1% relative spread). Refused when |mean| is essentially zero — CV has no scale in that regime.Whichever signal fires first wins; the reason is logged and recorded under convergence_reason in search_history.json so post-run audit can tell which terminated. "max_iterations", "improvement_patience", "plateau_cv", or — when --optuna-terminator is set under --search-planner=optuna — "posterior_regret_bound" (Makarova 2022) / "emmr" (Ishibashi 2023).
Plateau detection is scale-free — works for throughput (~1000) and latency (~50) without tuning. Convergence can fire as early as iteration plateau_window if the first random-Sobol points happen to land in a flat region; this is correct behavior, not a bug.
--search-* is mutually exclusive with:
--concurrency 10,20,30).sweep: blocks in YAML.--convergence-metric (adaptive trial-level early stop). Reason: the trial-level convergence semantics are orthogonal to outer-loop convergence; their composition is undefined. Rejected at config-validate time in _converter_optionals._reject_search_plus_convergence.search_history.json:
Single-objective is the length-1 special case of the multi-objective shape: best_trials always contains the global argmax/argmin. For len(objectives) > 1, best_trials is the Pareto front (one entry per non-dominated trial). Schema reference: Search History API.
convergence_reason is one of "max_iterations", "improvement_patience", "plateau_cv", "posterior_regret_bound", "emmr", or null (still running / written mid-loop). The monotonic-planner path (see 1D SLA saturation) adds "monotonic_precision_reached", "monotonic_no_failure_in_range", and "monotonic_no_pass_in_range". The smooth_isotonic planner additionally emits "smooth_isotonic_no_pass_in_range" and "smooth_isotonic_no_failure_in_range".
The file is rewritten after every iteration, so a crashed run still leaves the partial trajectory on disk.
boundary_summary (1D SLA-saturation)When the search has exactly one search-space dimension, search_history.json carries a boundary_summary block reporting the literal feasibility boundary — the highest swept value that passed (feasible_max) and, when at least one SLA filter is configured, the lowest that failed (infeasible_min, with the breaching filter recorded under first_breach). For runs with no SLA filters, every iteration is feasible by definition, so infeasible_min is null and feasible_max reports the highest swept value seen. For multi-dim searches the entire block is null. See 1D SLA saturation — boundary_summary block for the full schema and examples.
The monotonic_sla planner (registered alongside bayesian under the search_planner plugin category) writes the same boundary_summary shape directly from its bisection state, so consumers don’t branch on planner choice.
When len(objectives) > 1, BO maximizes the dominated hypervolume of the Pareto front rather than a scalar. Use it when you want the trade-off between two-or-more metrics (e.g. throughput vs. p99 TTFT) instead of a single best point. AIPerf supports multi-objective Bayesian optimization via Optuna+BoTorch’s qLogNoisyExpectedHypervolumeImprovement acquisition (qlognehvi; Daulton et al. 2021, arXiv:2105.08195).
The single-objective path is the length-1 special case of the same machinery; this section documents the additional knobs that come into play when len(objectives) > 1.
0.7*throughput - 0.3*ttft) you have to commit to weights up front; with multi-objective you defer the choice.--search-planner=bayesian (the curated preset auto-selects qlognehvi when len(objectives) > 1) or under the explicit-flag form --search-planner=optuna --optuna-sampler=botorch --optuna-acquisition=qlognehvi. The 1D-saturation planners monotonic_sla and smooth_isotonic are single-objective by design.0.7*throughput - 0.3*ttft, or a goodput metric that already encodes the SLA). Single-objective BO is faster, has tighter convergence guarantees, and produces a single number.len(objectives) == 1 is just single-objective BO.sla_filters instead. See Three knobs that look similar.A ModelListGP fits one independent GP per objective, so the planner doesn’t assume the objectives are correlated — the GP for throughput and the GP for TTFT each get their own kernel hyperparameters. qLogNEHVI scores candidate points by expected hypervolume gain over the current dominated region, with a log-space numerically-stable formulation (Ament et al.’s 2023 log-space formulation, arXiv:2310.20708).
Objective.threshold, OutcomeConstraint, and sla_filters are all “thresholds on metrics” but they do different things. Get them right. This table is the canonical reference for the three-knob distinction; other docs link here.
You can combine all three. A typical multi-objective recipe might use objectives: [throughput, ttft] with Objective.threshold on each, plus OutcomeConstraint on error_request_count to keep BO out of failure regions, with no sla_filters because the trade-off itself is the point.
AdaptiveSearchSweep.objectives is a list of Objective entries, each with metric, stat, direction, and an optional threshold (Pareto reference point used to bound hypervolume). outcome_constraints is a parallel list of OutcomeConstraint feasibility gates on metrics the optimizer is not optimizing — distinct from Objective.threshold (reference point) and from sla_filters (post-hoc benchmark eligibility).
The CLI shorthand --search-metric / --search-direction produces a length-1 objectives list. Multi-objective requires either an explicit objectives: block in YAML (preferred, since the CLI shape doesn’t repeat well for N>1) or a multi-objective search recipe.
Multi-objective is supported under both:
--search-planner=bayesian — the curated preset auto-detects len(objectives) > 1 and selects qlognehvi. No further flags required.--search-planner=optuna --optuna-sampler=botorch --optuna-acquisition=qlognehvi — the explicit-flag form for users who want to pin every choice.The 1D-saturation planners monotonic_sla and smooth_isotonic are intrinsically single-objective (1D bisection / 1D isotonic regression) and reject len(objectives) > 1 at config-time.
Multi-objective uses BoTorch’s qLogNoisyExpectedHypervolumeImprovement (Daulton et al. 2021, arXiv:2105.08195) — the noise-aware, log-space-numerically-stable Pareto BO default. qehvi and qnehvi are the older non-log variants kept under --search-planner=optuna for parity studies.
AdaptiveSearchSweep enforces that the acquisition matches the number of objectives:
logei, qlogei, qnei, qlognei) reject len(objectives) > 1 with a config-time error suggesting qlognehvi.qehvi, qnehvi, qlognehvi) reject len(objectives) == 1 with a config-time error suggesting qlognei.monotonic_sla, smooth_isotonic) reject len(objectives) > 1 outright.Objective.threshold is the Pareto reference point for hypervolume computation: trials worse than threshold on this objective contribute zero hypervolume. When threshold: null, the planner auto-derives one from the worst observed value among the Sobol initial points. Set it explicitly when you have a defensible “anything past this is unusable” bound (e.g. p99 TTFT > 250ms is unacceptable for the workload).
The choice matters because hypervolume is computed relative to the reference point: trials worse than the reference contribute zero. Two consequences:
For most workloads, the auto-derived reference (worst Sobol value) is fine. Override when you have an operational floor: “TTFT past 250 ms is unacceptable for this workload” → threshold: 250.0 on the TTFT objective.
improvement_patience and plateau_cv operate on the hypervolume time series in multi-objective mode (rather than the scalar objective). Hypervolume is monotone non-decreasing across iterations (a non-dominated point either expands the front or doesn’t), so improvement-patience fires when the front has stopped growing for improvement_patience consecutive iterations.
Pareto fronts plateau later than scalar objectives — there’s more “frontier” to explore. The default improvement_patience=10 and plateau_window=8 work, but consider:
max_iterations to 50+ for len(objectives) >= 2.n_initial_points to 10+ so the auto-derived reference points are well-conditioned.plateau_threshold to 0.005 (0.5% relative hypervolume CV) if you want the loop to run longer.The full list of convergence_reason values is unchanged from single-objective; see Search History API — Convergence Reasons.
search_history.json carries the same shape as single-objective, with objective_values becoming a length-N tuple per iteration and best_trials becoming the Pareto front. See Search History API — Interpreting best_trials for the full schema.
Each entry of best_trials carries pareto_rank: 0 (all front members are non-dominated by definition). The front is unranked; pick a point afterward by whatever scalar criterion fits your deployment (e.g. “throughput at the highest concurrency where p99 TTFT < 200 ms”).
--search-planner=bayesian (curated preset auto-selects qlognehvi) or under --search-planner=optuna --optuna-sampler=botorch --optuna-acquisition=qlognehvi. The 1D SLA-saturation planners (monotonic_sla, smooth_isotonic) are single-objective by design and reject len(objectives) > 1 at config-time.outcome_constraints are soft, not hard. They mask BoTorch’s acquisition score but don’t reject infeasible trials. For hard cutoffs use sla_filters instead. The two compose: outcome_constraints keeps BO out of the failure region; sla_filters makes any trial that lands there infeasible in best_trials selection.ModelListGP. The MORBO (arXiv:2109.10964) approach for ≥20D Pareto BO is not on the roadmap. See What this implementation isn’t.Objective.threshold is for hypervolume, not for filtering. A trial worse than the threshold still flows into the GP — it just contributes zero hypervolume. If you want to actively avoid a region, use outcome_constraints.max-concurrency-under-sla and max-goodput-under-sloThe classic LLM-serving capacity question: what is the highest concurrency at which the SUT still meets its SLA? AIPerf answers it with an adaptive search that names both the maximum passing concurrency and the first failing concurrency in O(log N) trials. The max-concurrency-under-sla search recipe is the canonical entry point; the goodput-formulation alternative is max-goodput-under-slo. Both are plugin-registered presets that compose with the BO engine described above and with Search Recipes.
The research basis (industry survey + academic citations) for the adaptive SLA search lived in a companion design-notes document that is not part of the repository.
The plugin registry ships two recipes built on top of the engine:
The generic --search-sla "metric:stat:op:threshold" flag (repeatable) attaches arbitrary SLA filters to the explicit --search-space path, with no recipe involved. See the SLA flags table below for the format.
The recipe expands in the CLI assembly pipeline into the same AdaptiveSearchSweep (set on AIPerfConfig.sweep) machinery a hand-written --search-space invocation would produce.
The four issue-named SLA flags are sugar over the generic --search-sla syntax. All five may be combined; recipe-named flags compose first, then --search-sla entries in CLI order.
Malformed --search-sla values raise TypeError naming the offending flag. Unknown stat or op keys are validated against the SLAFilter Literal types — typos error loud at parse time.
max-concurrency-under-sla--search-style selects which planner the recipe expands to. The defaults match the issue’s exact ask.
The monotonic planner mirrors Triton perf_analyzer’s --binary-search: each point’s verdict is provisional until 2 trials agree (configurable via AdaptiveSearchSweep.monotonic_stability_trials, default 2).
smooth_isotonic (default)The smooth-isotonic planner is a drop-in replacement for monotonic that fixes its core accuracy gap: bisection uses sign-only feedback at every probe, so a single noisy probe at the boundary can flip the verdict and corrupt the next root estimate. smooth_isotonic instead fits a smooth, monotone curve to all probe margins and root-finds the boundary on the curve.
The algorithm runs in five phases:
x = x_min, 2·x_min, 4·x_min, …) until the first SLO breach, identical to monotonic. Output: [x_lo, x_hi].[x_lo, x_hi], then for each per-SLO margin series: PAVA (scipy.optimize.isotonic_regression) denoises by pooling adjacent violators into a monotone step function, then PCHIP (scipy.interpolate.PchipInterpolator) interpolates the denoised points to give a smooth, monotone, root-findable curve. Solve m̂(x*) = 0 per SLO and aggregate via σ-normalized max-of-margins to pick the candidate boundary. PAVA-then-PCHIP composition fixes both PCHIP’s noise-fragility (vLLM’s deleted serve_sla.py pattern) and isotonic regression’s piecewise-constant ambiguous-root problem.sla_replicates: N > 0 in YAML (or the auto-budget formula triggers N ≥ 3), re-run the candidate x* N times under Common Random Numbers (same BenchmarkConfig + same random_seed) to estimate per-replicate margin variance. Bootstrap CI on the binding margin → if CI brackets zero, expand to x* ± δ and refit; otherwise terminate. Capped at 20 replicates to bound runaway under noisy degenerate constraints.|m_observed - m̂| exceeds 3·σ_local AND the bracket gap exceeds precision · x_hi, the planner declares boundary_type: "cliff" and reports (boundary_low, boundary_high) instead of pretending the curve is smooth across a discontinuity. Otherwise boundary_type: "smooth". Catches the prefill-prioritizing-server pattern documented in Sarathi-Serve.(infeasible_min - feasible_max) / infeasible_min < SLA_PRECISION_DEFAULT (5% by default), OR the Phase-3 bootstrap CI on the binding-constraint margin no longer brackets zero, OR --search-max-iterations exhausted. Reasons emitted in convergence_reason: smooth_isotonic_precision_reached, smooth_isotonic_cliff_precision_reached, smooth_isotonic_no_pass_in_range, smooth_isotonic_no_failure_in_range, smooth_isotonic_pchip_fallback_bisection, or max_iterations.Power-user knobs (all optional; the defaults are sized for typical LLM-serving workloads). These are YAML-only fields on the AdaptiveSearchSweep schema (src/aiperf/config/sweep/config.py); they are not exposed as CLI flags and have no AIPERF_SEARCH_PLANNER_* env-var binding. Set them under a sweep: block in your AIPerf YAML config:
sla_replicates: N — Phase-3 replicate count override. Default 0 (auto). Set to a fixed integer to override the auto budget.sla_precision: tight|normal|coarse — Per-probe sample budget. Maps to n_requests_per_probe ∈ {10000, 1000, 300}. Default normal → p99 CI ≈ ±10%.sla_warmup_seconds: N — Per-probe warmup discard before computing margins. Default None → 30s flat floor (AIPERF_SEARCH_PLANNER_DEFAULT_WARMUP_SECONDS). First-probe-at-each-x is floored at 60s (FIRST_PROBE_WARMUP_FLOOR); replicate probes are floored at 15s (REPLICATE_WARMUP_FLOOR).The boundary_summary block in search_history.json carries three new optional fields when smooth_isotonic ran: boundary_type ("smooth" or "cliff"), binding_constraint (the SLO key with the worst σ-normalized margin at termination), and boundary_ci ({lo, hi} bootstrap CI on the binding margin) when Phase-3 replicates ran. See Search History API Reference.
No new dependencies — the planner uses only scipy.optimize.isotonic_regression and scipy.interpolate.PchipInterpolator, both already part of the scipy>=1.13.0 hard dep.
search_history.json — boundary_summary blockThe BO and monotonic paths write search_history.json incrementally per iteration (same file documented in Output schema). The 1D-feasibility extension is the boundary_summary block:
boundary_summary is null when the search space has more than one dimension — the field is intentionally narrow and its semantics are only well-defined in 1D. For monotonic_sla and smooth_isotonic, the planner writes the summary directly from its internal state; for the BO style, the field is derived post-hoc from the iteration history (highest feasible swept value, lowest infeasible). The smooth_isotonic planner additionally writes boundary_type, binding_constraint, and (when Phase-3 replicates ran) boundary_ci — see Search History API Reference. All shapes share the same base so consumers don’t branch on style.
sla_breach.json — grid style onlyThe grid style emits a dedicated artifact under sweep_aggregate/sla_breach.json. Its keys substitute the leaf parameter name (here concurrency) for clarity:
Edge cases: max_passing_concurrency: null when every point fails; first_failing_concurrency: null when every point passes. monotonicity_check: false when feasibility alternates along the swept axis — informational, never an error (it usually means the SUT is unstable, not that the search broke).
max-goodput-under-slo writes the same search_history.json shape, but the BO objective is the goodput metric tag and the per-request SLO threshold-set (TTFT/TPOT/E2E) is wired into the goodput-metric configuration channel; only the attainment-fraction gate (good_request_fraction:avg:ge:<attainment>) appears as an SLAFilter row. Per the DistServe formulation (Zhong et al. OSDI ‘24), a request counts as “good” only when all three thresholds are simultaneously met, and the attainment fraction (default 0.95) is the minimum acceptable share of good requests.
Monotonicity assumption. Bisection assumes feasibility is monotonic along the swept axis (high concurrency fails, low passes). Real systems can violate this under cold-cache conditions or memory pressure. Watch for monotonicity_check: false in sla_breach.json and the non_monotonic_warning flag on the iteration history — when set, treat boundary_summary.feasible_max as the largest observed passing value, not a proof of optimality.
“First failing” semantics. Well-defined for monotonic and grid paths. For bo, the BO trajectory is non-monotonic by design; boundary_summary.infeasible_min.value reports the lowest seen failing concurrency, which is a lower bound on the true first-failing point — not a tight one.
Stability under noise. A single trial’s verdict can flip with run-to-run variance. Pass --num-profile-runs >= 2 so each point’s verdict averages over trials; the monotonic planner’s stability window kicks in automatically. The cost is linear in the number of trials, but the boundary location is more robust.
Streaming requirement. The TTFT and TPOT/ITL filters are streaming-only metrics. The recipe rejects --no-streaming at expand time when any SLA references a streaming-only metric. E2E latency and error-rate filters work without streaming.
Mutual exclusion. As with all recipes, --search-recipe is mutually exclusive with explicit --search-* flags and with magic-list sweeps; see Search Recipes — mutual-exclusion rules for the full matrix.
qLogNoisyExpectedImprovement is what the bayesian preset selects for single-objective.SingleTaskGP fit on the bayesian and --optuna-sampler=botorch paths.constraints_func consumes.RegretBoundEvaluator, available via --optuna-terminator regret.EMMREvaluator, available via --optuna-terminator emmr.The current planner family is a noisy-objective BO with the conventional knobs plus native Pareto BO. It is not — and we know it is not — the literature-state-of-the-art for every HPO regime. Several upgrades have already landed under explicit flags; the remaining deferrals, for context:
qLogNoisyExpectedHypervolumeImprovement on a ModelListGP (Daulton et al. 2021, arXiv:2105.08195) — see Multi-objective Pareto BO. MORBO targets the ≥20-dimensional regime; AIPerf’s search spaces are 1D–3D, so qNEHVI on a single ModelListGP is the right choice and MORBO is not on the roadmap.candidates_func that builds botorch.models.gp_regression.HeteroskedasticSingleTaskGP from per-trial sample variances. Evidence-gated: ship only if observed within-trial variance varies meaningfully across the search space on real workloads.search_history.json schema, including the multi-objective best_trials shape and boundary_summary fields.max-goodput-under-slo.