Search Recipes are named, plugin-registered presets that bundle a search space, an optimization objective (or grid), termination conditions, optional SLA constraints, and an optional post-process step into a single CLI selector. They lift the user-facing surface from “write --search-space / --search-metric / --search-direction / --search-max-iterations and pick the right combination” to --search-recipe <name>.
Kubernetes execution — coming soon. Every recipe in this catalog is designed to run unmodified under
aiperf kube sweeponce the K8s integration branch lands onmain. The recipe selector, post-process hooks, and output artifacts are execution-mode-independent; the cluster path swaps in the in-clustersweep-controllerpod + childAIPerfJobCRs for the local subprocess executor.
Recipes expand in the CLI assembly pipeline into the same machinery the explicit --search-* / sweep flags drive — the runtime path is unchanged. See Bayesian-Optimization Outer Loop for the underlying engine and search_history.json schema.
Power users can keep the explicit --search-* flags; recipes are mutually exclusive with them at the converter (clear error on collision).
This section shows the user experience end-to-end. Every recipe collapses several BO/grid flags into one named selector, and emits artifacts the user can read directly.
max-throughput-ttft-sla)The user reads best_trials[0].variation_values and gets a concrete answer: deploy at concurrency=401 to maximize throughput while keeping p95 TTFT under 200 ms. Without the recipe they’d have written ~5 BO flags by hand and post-hoc filtered for the SLA themselves. (best_trials is a list because multi-objective recipes surface the full Pareto front; single-objective recipes emit a length-1 list.)
The above terminal log is illustrative — the actual progress format depends on the dashboard / progress UI mode.
prefill-ttft-curve)The user gets a usable equation: TTFT(ms) = 0.0641 × ISL + 1.83 — feed it into a capacity planner directly. Quadratic fallback fires automatically if linear r² < 0.85; below_floor flags low-confidence fits.
That whole pipeline — Protocol dispatch, mutual-exclusion checking, model_dump round-trips, soft-penalty math, lexicographic best, post-process plugin lookup — is invisible to the user. They typed two flags. They got an answer.
All recipes whose metric is streaming-only (TTFT, ITL) require --streaming; the recipe rejects non-streaming endpoints at expand time with a message naming the recipe and the missing flag. max-concurrency-under-sla checks streaming only when a streaming-only SLA filter (--ttft-sla-ms / --tpot-sla-ms / --itl-sla-ms) is configured; --e2e-sla-ms-only and --error-rate-sla-only runs do not require streaming.
max-throughput-ttft-slaBayesian-optimized over phases.profiling.concurrency in [1, 1000]. Lifts the SLA p95(time_to_first_token) < ttft-sla-ms into a soft penalty in the GP score and a strict feasibility filter on best_trials. See Bayesian-Optimization Outer Loop for the scoring details.
max-throughput-itl-slaIdentical shape to the TTFT twin, but on p95(inter_token_latency) < itl-sla-ms. Accepts --itl-sla-ms or its alias --tpot-sla-ms (passing both raises a conflict error).
max-concurrency-under-slaFind the largest concurrency at which every configured SLA filter passes. Composes any combination of --ttft-sla-ms / --tpot-sla-ms / --e2e-sla-ms / --error-rate-sla / --search-sla. Five search styles (--search-style {smooth_isotonic|monotonic|bo|optuna|grid}, default smooth_isotonic):
smooth_isotonic — PAVA-denoised isotonic regression + PCHIP root-find on per-SLO margin curves; opt-in Phase-3 replicates with bootstrap CI; cliff-curve guard. Strictly more accurate than monotonic under noise. ~13–25 iterations on [1, 1000] at 5% precision (more with replicates).monotonic — exponential probe + bisection; ~10 iterations on [1, 1000] at 5% precision; the direct equivalent of perf_analyzer’s --binary-search. Margin-magnitude-blind.bo — penalty-BO maximizing output_token_throughput within the feasibility region.optuna — same penalty-BO formulation as bo, routed through the OptunaSearchPlanner (TPE / GP / BoTorch samplers, selected via --optuna-sampler). Optuna ships by default; BoTorch requires the optional botorch extra.grid — 8 log-spaced points + sla_breach_knee post-process emitting sweep_aggregate/sla_breach.json.The full reference — including artifact schemas, comparison-to-other-tools, and caveats — is at Bayesian Optimization — 1D SLA saturation.
max-goodput-under-sloThe DistServe canonical formulation (Zhong et al. OSDI ‘24). BO over concurrency with the goodput metric tag as the maximization objective. A request counts as “good” only when all three per-request thresholds (TTFT, TPOT, E2E) are simultaneously satisfied; the --slo-attainment-fraction (default 0.95) sets the minimum acceptable share. Streaming required.
concurrency-ramp8-step log-spaced grid over concurrency in [1, 1000]; post-process detects the first concurrency where p99(request_latency) exceeds baseline * (1 + --degradation-threshold). Streaming is not required (request_latency is end-to-end).
Output: sweep_aggregate/degradation_knee.json with baseline_concurrency, knee_concurrency (or null if no knee found), threshold, and the full point series.
prefill-ttft-curve8-step log-spaced grid over ISL in [--isl-min, --isl-max] (defaults 256, 32768) at concurrency=1; post-process fits TTFT = a*ISL + b and falls back to a quadratic fit when r² < 0.85.
Output: sweep_aggregate/prefill_curve.json with fit_form (linear | quadratic), coefficients, r_squared, r_squared_floor, and the raw (isl, ttft_ms) points.
decode-itl-curveTwo-axis grid: 6 log-spaced concurrency points in [1, 200] x 4 log-spaced OSL points in [64, 1024]. Post-process emits an axis-aligned grid surface; cells where no triple was measured stay null (the handler refuses to invent values for missing cells).
Output: sweep_aggregate/decode_itl_surface.json with surface.concurrency_axis, surface.osl_axis, surface.itl_grid (2D, indexed [concurrency_idx][osl_idx]), and the raw (concurrency, osl, itl_ms) triples.
pareto-sweepSweeps paired ISL/OSL workload shapes from --isl-osl-pairs against a list of concurrency values, pre-flattened to a ScenarioSweep so the pairs stay paired (vs the Cartesian product a grid would emit). Each cell runs as its own benchmark; the pareto_sweep_export post-process walks the per-combination metrics and marks each cell pareto_optimal: true iff no other cell has both lower time_to_first_token.p95 and higher output_token_throughput.avg. Streaming required (the recipe’s y-axis is output_token_throughput, a streaming-only metric). --concurrency defaults to [1, 4, 16, 64, 256] when omitted; this recipe consumes the magic-list flag directly.
The above expands to 3 pairs × 5 concurrency values = 15 benchmark runs and writes sweep_aggregate/pareto_sweep.json with one cell per run plus a per-cell pareto_optimal flag.
You want a single chart for a capacity-planning doc that shows, for the same model and deployment, how throughput trades off against latency under several distinct workload shapes — short chat turns (128/128), RAG-style prompts (512/256), long-doc summarization (2048/512) — across a range of concurrency. A grid sweep is the wrong tool: it would Cartesian-product isl × osl × concurrency, and most of those cells (isl=128, osl=512, isl=2048, osl=128) aren’t workload shapes you care about. You want the ISL and OSL to stay paired, with concurrency swept inside each pair. pareto-sweep is built for exactly this.
The recipe pre-flattens the (pairs × concurrency) grid into a ScenarioSweep — one scenario per cell, with internal label shape_<isl>_<osl>_c<conc> and swept values {isl, osl, concurrency}. The orchestrator then runs each scenario as a separate benchmark, producing the same per-run artifact tree a --sweep invocation would. The on-disk directory name is derived from the swept values (not the internal label). After all runs complete and SweepAnalyzer.compute() finishes, the pareto_sweep_export post-process handler walks the per-combination metrics and writes the frontier JSON. Failures in the post-process step are logged into sweep_aggregate/post_process_errors.json but do not fail the sweep — the per-run profile exports are already on disk.
--isl-osl-pairs syntaxSyntax: <isl>/<osl>,<isl>/<osl>,.... Each side is a positive integer. Whitespace around commas and slashes is tolerated. Pairs must be unique. Valid:
Invalid inputs raise a ValueError from parse_isl_osl_pairs at expand time, naming the bad token:
A single-cell sweep is also rejected — a one-point “Pareto frontier” is meaningless:
The recipe also rejects non-streaming endpoints at expand time:
--isl-osl-pairs is recipe-only and is silently ignored unless --search-recipe pareto-sweep is set. The full flag entry lives in CLI Options.
Standard sweep artifacts are written under <artifact_dir>/:
sweep_aggregate/profile_export_aiperf_sweep.{json,csv} — the cross-cell summary table the grid path always emits.isl_<isl>__osl_<osl>__concurrency_<conc>/profile_export_aiperf.json — full per-run metrics for each (isl, osl, concurrency) cell (default single-trial layout; SweepVariation.dir_name joins the swept values with __). With --num-profile-runs N (N > 1) and the default REPEATED iteration order, per-trial outputs live under <artifact_dir>/profile_runs/trial_NNNN/isl_<isl>__osl_<osl>__concurrency_<conc>/profile_export_aiperf.json.sweep_aggregate/pareto_sweep.json — the recipe-specific frontier file, fixed axes x_metric=time_to_first_token/p95 (lower-is-better) vs y_metric=output_token_throughput/avg (higher-is-better):A cell is marked pareto_optimal: true iff no other cell weakly dominates it — i.e. no other cell has x <= cell.x AND y >= cell.y with strict inequality on at least one axis. The frontier is computed across all cells in the file — over every shape and every concurrency together — so the optimal set typically includes the lowest-latency cell of the smallest shape AND the highest-throughput cell of the largest shape, with intermediate cells filling in between. If you need per-shape frontiers (one Pareto curve per (isl, osl)) rather than a single global one, group cells on (isl, osl) client-side and do the dominance check yourself — see the plotting snippet below.
aiperf plot does not currently render pareto_sweep.json directly, and pareto-sweep does not opt in to --auto-plot (only the curve recipes — concurrency-ramp, prefill-ttft-curve, decode-itl-curve — set auto_plot_default = True). Plot it yourself with matplotlib:
Each line traces one workload shape across concurrency; circles ringed in red are globally Pareto-optimal across all shapes.
adaptive-search (BO over concurrency for one objective).prefill-ttft-curve recipe.max-throughput-ttft-sla or max-throughput-itl-sla.request_count, duration, phases, or different dataset types per shape), write a ScenarioSweep directly — see docs/tutorials/sweeps.md -> Paired ISL/OSL via Scenarios. pareto-sweep is the one-liner for the common case where every cell shares the same per-run config and you only want to vary (isl, osl, concurrency).--concurrency 1,4,16,64,256 lands a 256× jump on either side of the knee, the frontier you plot will visibly miss the actual knee. Re-run with a denser list around where the curve bends — e.g. --concurrency 16,32,48,64,96,128,192,256.128/64, 512/256, 2048/512 all parse fine). Mirror the production traffic shape, not symmetric powers of two.time_to_first_token.p95 and output_token_throughput.avg into the post-process spec; there is no CLI flag to swap them. If you need a different pair, copy the recipe under a new name and adjust the PostProcessSpec params (see Writing your own recipe).output_token_throughput requires --streaming. There is no non-streaming variant of this recipe; chat-completions and similar endpoints must be in streaming mode.pareto_optimal flag in the JSON is computed across every cell, not within each (isl, osl) group. Group cells client-side (as the plotting snippet above shows) if you want per-shape frontiers.--search-recipe is rejected alongside any defining --search-* flag (--search-space, --search-metric, --search-direction, --search-stat, --search-planner, --search-percentile-pooling, --optuna-sampler, --optuna-acquisition, --optuna-terminator, --bo-constraint-mode). Drop one or the other.--search-* flags (--search-max-iterations, --search-initial-points, --search-random-seed) are accepted on BO recipes and override the recipe’s defaults; they are rejected on grid recipes (which have no BO loop to tune).--concurrency 10,20,30, etc.). The recipe owns the swept variables. Exception: pareto-sweep consumes --concurrency directly (declared in consumed_magic_lists), so passing a --concurrency list alongside --search-recipe pareto-sweep is allowed and forms one axis of the sweep.--convergence-metric (trial-level adaptive early-stop). The two operate at different levels.Errors name both the recipe and the conflicting flag list.
A recipe is a stateless class implementing the SearchRecipe Protocol in aiperf.search_recipes._base:
Then register the recipe in your plugins.yaml:
The plugin loader picks it up at startup; aiperf plugins --validate exercises the registry. See Plugin System for the broader registry shape.
Set sweep_parameters (a path -> list-of-values map) instead of adaptive_search; the converter writes the dict into sweep.parameters so expand_sweep materializes one variation per cartesian-product cell. Optionally attach a PostProcessSpec to emit a derived artifact under sweep_aggregate/:
Handlers implement PostProcessHandler in aiperf.search_recipes.post_process and register under the search_recipe_post_process plugin category. They run after SweepAnalyzer.compute() and emit a JSON artifact under sweep_aggregate/<output_filename>:
Failures in a handler are logged and recorded in sweep_aggregate/post_process_errors.json but do not fail the sweep — standard artifacts are already written.
max-concurrency-under-sla and max-goodput-under-slo deep dive: SLA flag table, search styles, output-artifact schemas, comparison to perf_analyzer / k6 / Triton Model Analyzer.search_history.json.