Search Recipes
Search Recipes are named, plugin-registered presets that bundle a search space, an optimization objective (or grid), termination conditions, optional SLA constraints, and an optional post-process step into a single CLI selector. They lift the user-facing surface from “write --search-space / --search-metric / --search-direction / --search-max-iterations and pick the right combination” to --search-recipe <name>.
Kubernetes execution — coming soon. Every recipe in this catalog is designed to run unmodified under
aiperf kube sweeponce the K8s integration branch lands onmain. The recipe selector, post-process hooks, and output artifacts are execution-mode-independent; the cluster path swaps in the in-clustersweep-controllerpod + childAIPerfJobCRs for the local subprocess executor.
Recipes expand in the CLI assembly pipeline into the same machinery the explicit --search-* / sweep flags drive — the runtime path is unchanged. See Bayesian-Optimization Outer Loop for the underlying engine and search_history.json schema.
When to use a recipe
Power users can keep the explicit --search-* flags; recipes are mutually exclusive with them at the converter (clear error on collision).
How it feels — a walkthrough
This section shows the user experience end-to-end. Every recipe collapses several BO/grid flags into one named selector, and emits artifacts the user can read directly.
Before / after
Flow at a glance
Recipes by interaction shape
Concrete BO interaction (max-throughput-ttft-sla)
The user reads best_trials[0].variation_values and gets a concrete answer: deploy at concurrency=401 to maximize throughput while keeping p95 TTFT under 200 ms. Without the recipe they’d have written ~5 BO flags by hand and post-hoc filtered for the SLA themselves. (best_trials is a list because multi-objective recipes surface the full Pareto front; single-objective recipes emit a length-1 list.)
The above terminal log is illustrative — the actual progress format depends on the dashboard / progress UI mode.
Concrete grid + curve interaction (prefill-ttft-curve)
The user gets a usable equation: TTFT(ms) = 0.0641 × ISL + 1.83 — feed it into a capacity planner directly. Quadratic fallback fires automatically if linear r² < 0.85; below_floor flags low-confidence fits.
Failure paths fail loud
What stays invisible
That whole pipeline — Protocol dispatch, mutual-exclusion checking, model_dump round-trips, soft-penalty math, lexicographic best, post-process plugin lookup — is invisible to the user. They typed two flags. They got an answer.
Catalog
All recipes whose metric is streaming-only (TTFT, ITL) require --streaming; the recipe rejects non-streaming endpoints at expand time with a message naming the recipe and the missing flag. max-concurrency-under-sla checks streaming only when a streaming-only SLA filter (--ttft-sla-ms / --tpot-sla-ms / --itl-sla-ms) is configured; --e2e-sla-ms-only and --error-rate-sla-only runs do not require streaming.
Per-recipe usage
max-throughput-ttft-sla
Bayesian-optimized over phases.profiling.concurrency in [1, 1000]. Lifts the SLA p95(time_to_first_token) < ttft-sla-ms into a soft penalty in the GP score and a strict feasibility filter on best_trials. See Bayesian-Optimization Outer Loop for the scoring details.
max-throughput-itl-sla
Identical shape to the TTFT twin, but on p95(inter_token_latency) < itl-sla-ms. Accepts --itl-sla-ms or its alias --tpot-sla-ms (passing both raises a conflict error).
max-concurrency-under-sla
Find the largest concurrency at which every configured SLA filter passes. Composes any combination of --ttft-sla-ms / --tpot-sla-ms / --e2e-sla-ms / --error-rate-sla / --search-sla. Five search styles (--search-style {smooth_isotonic|monotonic|bo|optuna|grid}, default smooth_isotonic):
smooth_isotonic— PAVA-denoised isotonic regression + PCHIP root-find on per-SLO margin curves; opt-in Phase-3 replicates with bootstrap CI; cliff-curve guard. Strictly more accurate thanmonotonicunder noise. ~13–25 iterations on[1, 1000]at 5% precision (more with replicates).monotonic— exponential probe + bisection; ~10 iterations on[1, 1000]at 5% precision; the direct equivalent of perf_analyzer’s--binary-search. Margin-magnitude-blind.bo— penalty-BO maximizingoutput_token_throughputwithin the feasibility region.optuna— same penalty-BO formulation asbo, routed through theOptunaSearchPlanner(TPE / GP / BoTorch samplers, selected via--optuna-sampler). Optuna ships by default; BoTorch requires the optionalbotorchextra.grid— 8 log-spaced points +sla_breach_kneepost-process emittingsweep_aggregate/sla_breach.json.
The full reference — including artifact schemas, comparison-to-other-tools, and caveats — is at Bayesian Optimization — 1D SLA saturation.
max-goodput-under-slo
The DistServe canonical formulation (Zhong et al. OSDI ‘24). BO over concurrency with the goodput metric tag as the maximization objective. A request counts as “good” only when all three per-request thresholds (TTFT, TPOT, E2E) are simultaneously satisfied; the --slo-attainment-fraction (default 0.95) sets the minimum acceptable share. Streaming required.
concurrency-ramp
8-step log-spaced grid over concurrency in [1, 1000]; post-process detects the first concurrency where p99(request_latency) exceeds baseline * (1 + --degradation-threshold). Streaming is not required (request_latency is end-to-end).
Output: sweep_aggregate/degradation_knee.json with baseline_concurrency, knee_concurrency (or null if no knee found), threshold, and the full point series.
prefill-ttft-curve
8-step log-spaced grid over ISL in [--isl-min, --isl-max] (defaults 256, 32768) at concurrency=1; post-process fits TTFT = a*ISL + b and falls back to a quadratic fit when r² < 0.85.
Output: sweep_aggregate/prefill_curve.json with fit_form (linear | quadratic), coefficients, r_squared, r_squared_floor, and the raw (isl, ttft_ms) points.
decode-itl-curve
Two-axis grid: 6 log-spaced concurrency points in [1, 200] x 4 log-spaced OSL points in [64, 1024]. Post-process emits an axis-aligned grid surface; cells where no triple was measured stay null (the handler refuses to invent values for missing cells).
Output: sweep_aggregate/decode_itl_surface.json with surface.concurrency_axis, surface.osl_axis, surface.itl_grid (2D, indexed [concurrency_idx][osl_idx]), and the raw (concurrency, osl, itl_ms) triples.
pareto-sweep
Sweeps paired ISL/OSL workload shapes from --isl-osl-pairs against a list of concurrency values, pre-flattened to a ScenarioSweep so the pairs stay paired (vs the Cartesian product a grid would emit). Each cell runs as its own benchmark; the pareto_sweep_export post-process walks the per-combination metrics and marks each cell pareto_optimal: true iff no other cell has both lower time_to_first_token.p95 and higher output_token_throughput.avg. Streaming required (the recipe’s y-axis is output_token_throughput, a streaming-only metric). --concurrency defaults to [1, 4, 16, 64, 256] when omitted; this recipe consumes the magic-list flag directly.
The above expands to 3 pairs × 5 concurrency values = 15 benchmark runs and writes sweep_aggregate/pareto_sweep.json with one cell per run plus a per-cell pareto_optimal flag.
Scenario
You want a single chart for a capacity-planning doc that shows, for the same model and deployment, how throughput trades off against latency under several distinct workload shapes — short chat turns (128/128), RAG-style prompts (512/256), long-doc summarization (2048/512) — across a range of concurrency. A grid sweep is the wrong tool: it would Cartesian-product isl × osl × concurrency, and most of those cells (isl=128, osl=512, isl=2048, osl=128) aren’t workload shapes you care about. You want the ISL and OSL to stay paired, with concurrency swept inside each pair. pareto-sweep is built for exactly this.
How it works
The recipe pre-flattens the (pairs × concurrency) grid into a ScenarioSweep — one scenario per cell, with internal label shape_<isl>_<osl>_c<conc> and swept values {isl, osl, concurrency}. The orchestrator then runs each scenario as a separate benchmark, producing the same per-run artifact tree a --sweep invocation would. The on-disk directory name is derived from the swept values (not the internal label). After all runs complete and SweepAnalyzer.compute() finishes, the pareto_sweep_export post-process handler walks the per-combination metrics and writes the frontier JSON. Failures in the post-process step are logged into sweep_aggregate/post_process_errors.json but do not fail the sweep — the per-run profile exports are already on disk.
--isl-osl-pairs syntax
Syntax: <isl>/<osl>,<isl>/<osl>,.... Each side is a positive integer. Whitespace around commas and slashes is tolerated. Pairs must be unique. Valid:
Invalid inputs raise a ValueError from parse_isl_osl_pairs at expand time, naming the bad token:
A single-cell sweep is also rejected — a one-point “Pareto frontier” is meaningless:
The recipe also rejects non-streaming endpoints at expand time:
--isl-osl-pairs is recipe-only and is silently ignored unless --search-recipe pareto-sweep is set. The full flag entry lives in CLI Options.
Artifacts
Standard sweep artifacts are written under <artifact_dir>/:
sweep_aggregate/profile_export_aiperf_sweep.{json,csv}— the cross-cell summary table the grid path always emits.isl_<isl>__osl_<osl>__concurrency_<conc>/profile_export_aiperf.json— full per-run metrics for each(isl, osl, concurrency)cell (default single-trial layout;SweepVariation.dir_namejoins the swept values with__). With--num-profile-runs N(N > 1) and the defaultREPEATEDiteration order, per-trial outputs live under<artifact_dir>/profile_runs/trial_NNNN/isl_<isl>__osl_<osl>__concurrency_<conc>/profile_export_aiperf.json.sweep_aggregate/pareto_sweep.json— the recipe-specific frontier file, fixed axesx_metric=time_to_first_token/p95(lower-is-better) vsy_metric=output_token_throughput/avg(higher-is-better):
A cell is marked pareto_optimal: true iff no other cell weakly dominates it — i.e. no other cell has x <= cell.x AND y >= cell.y with strict inequality on at least one axis. The frontier is computed across all cells in the file — over every shape and every concurrency together — so the optimal set typically includes the lowest-latency cell of the smallest shape AND the highest-throughput cell of the largest shape, with intermediate cells filling in between. If you need per-shape frontiers (one Pareto curve per (isl, osl)) rather than a single global one, group cells on (isl, osl) client-side and do the dominance check yourself — see the plotting snippet below.
Plotting the frontier
aiperf plot does not currently render pareto_sweep.json directly, and pareto-sweep does not opt in to --auto-plot (only the curve recipes — concurrency-ramp, prefill-ttft-curve, decode-itl-curve — set auto_plot_default = True). Plot it yourself with matplotlib:
Each line traces one workload shape across concurrency; circles ringed in red are globally Pareto-optimal across all shapes.
When NOT to use this
- If you want a single optimal concurrency rather than a frontier, use
adaptive-search(BO over concurrency for one objective). - If you want adaptive multi-objective BO (the optimizer steers toward the front instead of enumerating a grid) rather than a discrete grid frontier, see Multi-Objective Pareto BO and the Adaptive Search tutorial’s “Going multi-objective” section.
- If you want a TTFT(ISL) curve at a single concurrency, use the
prefill-ttft-curverecipe. - If you want the throughput-maximizing concurrency under an SLA, use
max-throughput-ttft-slaormax-throughput-itl-sla. - If your shapes are paired but you want full control over the per-scenario YAML (different
request_count,duration,phases, or differentdatasettypes per shape), write aScenarioSweepdirectly — seedocs/tutorials/sweeps.md-> Paired ISL/OSL via Scenarios.pareto-sweepis the one-liner for the common case where every cell shares the same per-run config and you only want to vary(isl, osl, concurrency).
Limits and common follow-ups
- Coarse concurrency list. If
--concurrency 1,4,16,64,256lands a 256× jump on either side of the knee, the frontier you plot will visibly miss the actual knee. Re-run with a denser list around where the curve bends — e.g.--concurrency 16,32,48,64,96,128,192,256. - Asymmetric pairs. ISL/OSL don’t have to match (
128/64,512/256,2048/512all parse fine). Mirror the production traffic shape, not symmetric powers of two. - Single-shape sweep. Pass exactly one pair plus a list of concurrency values to characterize one workload shape across concurrency — it works fine, just the post-process JSON degenerates to a single curve.
- Statistic axes are fixed. The recipe wires
time_to_first_token.p95andoutput_token_throughput.avginto the post-process spec; there is no CLI flag to swap them. If you need a different pair, copy the recipe under a new name and adjust thePostProcessSpecparams(see Writing your own recipe). - Streaming-only.
output_token_throughputrequires--streaming. There is no non-streaming variant of this recipe; chat-completions and similar endpoints must be in streaming mode. - Pareto-optimality is global, not per-shape. The
pareto_optimalflag in the JSON is computed across every cell, not within each(isl, osl)group. Group cells client-side (as the plotting snippet above shows) if you want per-shape frontiers.
Mutual-exclusion rules
--search-recipeis rejected alongside any defining--search-*flag (--search-space,--search-metric,--search-direction,--search-stat,--search-planner,--search-percentile-pooling,--optuna-sampler,--optuna-acquisition,--optuna-terminator,--bo-constraint-mode). Drop one or the other.- Tunable
--search-*flags (--search-max-iterations,--search-initial-points,--search-random-seed) are accepted on BO recipes and override the recipe’s defaults; they are rejected on grid recipes (which have no BO loop to tune). - Grid recipes are rejected alongside magic-list flags (
--concurrency 10,20,30, etc.). The recipe owns the swept variables. Exception:pareto-sweepconsumes--concurrencydirectly (declared inconsumed_magic_lists), so passing a--concurrencylist alongside--search-recipe pareto-sweepis allowed and forms one axis of the sweep. - BO recipes are rejected alongside
--convergence-metric(trial-level adaptive early-stop). The two operate at different levels.
Errors name both the recipe and the conflicting flag list.
Writing your own recipe
A recipe is a stateless class implementing the SearchRecipe Protocol in aiperf.search_recipes._base:
Then register the recipe in your plugins.yaml:
The plugin loader picks it up at startup; aiperf plugins --validate exercises the registry. See Plugin System for the broader registry shape.
Returning a grid recipe instead of BO
Set sweep_parameters (a path -> list-of-values map) instead of adaptive_search; the converter writes the dict into sweep.parameters so expand_sweep materializes one variation per cartesian-product cell. Optionally attach a PostProcessSpec to emit a derived artifact under sweep_aggregate/:
Writing a post-process handler
Handlers implement PostProcessHandler in aiperf.search_recipes.post_process and register under the search_recipe_post_process plugin category. They run after SweepAnalyzer.compute() and emit a JSON artifact under sweep_aggregate/<output_filename>:
Failures in a handler are logged and recorded in sweep_aggregate/post_process_errors.json but do not fail the sweep — standard artifacts are already written.
See also
- Bayesian Optimization — 1D SLA saturation —
max-concurrency-under-slaandmax-goodput-under-slodeep dive: SLA flag table, search styles, output-artifact schemas, comparison to perf_analyzer / k6 / Triton Model Analyzer. - Bayesian-Optimization Outer Loop — engine details, search-space grammar, SLA scoring,
search_history.json. - Adaptive Search Tutorial — narrative walkthrough.
- Plugin System — registry shape, validation, override priorities.