Sweeps Error Troubleshooting Guide
This page covers configuration and runtime errors for both grid-style parameter sweeps and adaptive (Bayesian) search. For algorithm semantics, see Bayesian-Optimization Outer Loop. For the YAML reference, see Parameter Sweeps.
Each entry quotes the literal error/warning string raised by the code today, with a source-file pointer so you can verify against main.
Grid / Zip / Scenarios Errors
1. Invalid Concurrency Value
Error Message (Pydantic, from CLI parse):
Cause: You provided a non-numeric value for --concurrency. parse_int_or_int_list calls int(s) directly, so the stdlib ValueError propagates and Pydantic wraps it as the int_parsing error above on the concurrency field.
Where it’s raised: src/aiperf/config/loader/parsing.py (parser), src/aiperf/config/flags/cli_config.py (field).
Solution:
2. Invalid Concurrency List
Error Message (stdlib, surfaced through Pydantic):
Cause: One element of a comma-separated --concurrency list is not a valid integer. The list parser does [int(p) for p in parts] and the stdlib ValueError is raised on the first bad token, with no list context or position information.
Where it’s raised: src/aiperf/config/loader/parsing.py.
Solution:
3. Negative or Zero Concurrency Values
Error Message (Pydantic):
Cause: A concurrency value is zero or negative. PhaseConfig.concurrency is constrained to ge=1, so each value is rejected individually with the standard Pydantic greater_than_equal error — there is no aggregated, position-aware message.
Where it’s raised: src/aiperf/config/phases.py.
Solution:
Why: Concurrency represents the number of in-flight requests. Zero or negative is meaningless.
4. Dashboard UI with Parameter Sweeps or Multi-Run
Error Message (late-stage, plan validation — covers both sweep and multi-run):
Where it’s raised: src/aiperf/cli_runner.py (_validate_multi_benchmark_plan).
Earlier sweep-only message (fires first when --ui dashboard is explicitly set on a sweep config):
Where it’s raised: src/aiperf/config/config.py (validate_sweep_no_dashboard_ui, model-validator). Only triggers when runtime.ui is explicitly set by the user and a sweep is configured; multi-run alone does not trip this early check.
Cause: The dashboard UI requires exclusive terminal control and would overwrite itself between sequential runs.
Solution:
5. Invalid Cooldown Duration
CLI path (Pydantic, fires first):
--parameter-sweep-cooldown-seconds has Field(ge=0), so any negative value is rejected at config-parse time before the strategy ever sees it.
Where it’s raised: src/aiperf/config/flags/cli_config.py.
Programmatic path (FixedTrialsStrategy direct construction):
Where it’s raised: src/aiperf/orchestrator/strategies.py.
Solution:
6. Empty Sweep-Block Value List
Error Message (grid sweep):
Error Message (zip sweep):
Cause: A sweep block (in a YAML config) declared a parameter with an empty values: list. This applies to YAML-defined sweeps only; the magic-list CLI path (e.g. --concurrency 10,20,30) collapses --concurrency "" to None and never enters this sweep-block code, so there is no CLI-side trigger for these messages.
Where it’s raised: src/aiperf/config/sweep/expand.py (grid), src/aiperf/config/sweep/expand.py (zip).
7. Insufficient Successful Runs for Aggregation
Warning Message (sweep mode, per-variation):
Where it’s raised: src/aiperf/cli_runner/_sweep_aggregate.py.
Note: Sweep mode does not require at least 2 successful runs. ConfidenceAggregation has a documented single-run degraded mode (std=0, CI collapsed to mean, single_run: True in metadata), and per-variation aggregation explicitly lets single-success cells through — see the comment at src/aiperf/cli_runner/_sweep_aggregate.py. Only cells with zero successful runs are skipped.
Related sweep-level warnings:
Skipping per-variation aggregate for '<label>': ConfidenceAggregation raised <exc>— aggregation crashed for that cell (cli_runner/_sweep_aggregate.py).Sweep aggregate skipped: no successful runs across all variations.— the whole-sweep summary is skipped only when every variation had zero successes (cli_runner/_sweep_aggregate.py).
Warning Message (non-sweep multi-run path):
Where it’s raised: src/aiperf/cli_runner.py. This message applies to plain --num-profile-runs runs (no sweep), where the “need at least 2” rule does hold.
Solution:
Silently-Ignored Flag Combinations
Some flag combinations that look incorrect do not currently raise. Listing them here so users searching for an error message don’t waste time looking:
- Sweep-only flags used without a sweep.
--parameter-sweep-mode,--parameter-sweep-cooldown-seconds, and--parameter-sweep-same-seedare silently no-ops when no sweep is configured. The sweep-override pathway insrc/aiperf/config/flags/converter.pyonly consults these fields when a sweep block is present. No validator exists today. - Multi-run-only flags used in single-run mode.
--confidence-level,--profile-run-cooldown-seconds, and--profile-run-disable-warmup-after-firstare silently ignored when--num-profile-runsis 1. The CLI help text for--confidence-levelsays “Only applies when —num-profile-runs > 1” but this is informational, not enforced (src/aiperf/config/flags/cli_config.py).--set-consistent-seedalso applies in sweep-without-multi-run mode (src/aiperf/config/config.py), so it is not strictly multi-run-only.
If you hit one of these and were expecting an error, please file an issue — these are good UX targets for future validators.
Quick Reference: Common Patterns
Single Concurrency (No Sweep)
Parameter Sweep (No Confidence)
Parameter Sweep + Confidence Reporting
Adaptive Search Errors
This section resolves errors and warnings from AIPerf’s adaptive-search feature — aiperf profile --search-space ... --search-metric ... --search-direction ... --search-max-iterations .... AIPerf wraps Optuna+BoTorch to drive a Bayesian-Optimization (BO) outer loop; most errors come from input validation and a small set of mutual-exclusion guards.
For the deeper “why does BO behave this way,” see /aiperf/sweeping-adaptive-search/bayesian-optimization.
1. Missing Optional BoTorch Dependency
Error message:
Cause:
OptunaSearchPlanner uses Optuna core by default, but its implicit preferred sampler is BoTorch. Explicit --optuna-sampler botorch or BoTorch-only acquisitions require optuna-integration, botorch>=0.10, gpytorch, and torch. When BoTorch is only the implicit default, AIPerf falls back to TPE with a warning if this optional stack is unavailable; explicit BoTorch requests fail instead of silently changing semantics.
Fix:
2. Malformed --search-space String
Error message:
Other shapes from the same parser:
Cause:
parse_search_space in src/aiperf/orchestrator/search_planner/parsing.py implements the grammar PATH:LO,HI[:KIND] with KIND in {int, real} (default real). Common bugs: missing the : separator, swapping HI/LO, non-numeric bound, or a kind outside int|real.
Fix:
--search-space is repeatable; pass it once per dimension.
3. Search Path Doesn’t Resolve
Error message:
Cause:
The dotted path is resolved by _set_nested_value in src/aiperf/config/sweep/expand.py against the dict form of BenchmarkConfig. Named-list segments (e.g. phases.profiling.*) match on the entry’s name field. Typos like phase.profiling.concurrency (no s) or phases.profilling.concurrency (extra l) error loudly rather than silently creating a phantom phase.
Fix:
Common top-level segments: phases.<name>.<field> (typically profiling or warmup; <field> is a BasePhaseConfig scalar like concurrency, request_rate, request_count), endpoint.<field>, runtime.<field>.
4. --search-metric Uses an Aggregator-Suffixed Key
Cause:
The BO objective is the bare metric tag (e.g. output_token_throughput, time_to_first_token) — not the flattened _avg / _p99 form that appears in CSV/JSON exports. The statistic is selected separately via --search-stat (one of avg, p50, p90, p95, p99; default avg). See _extract_objective_vector in src/aiperf/orchestrator/search_planner/optuna_planner.py and AdaptiveSearchSweep.objectives[0].metric in src/aiperf/config/sweep/config.py.
Fix:
See “Objective Semantics” in /aiperf/sweeping-adaptive-search/bayesian-optimization for which metric tags are produced and how stats map to JSON fields.
5. --search-metric Names a Metric the Run Doesn’t Produce
Warning message:
Cause:
_extract_objective_vector in src/aiperf/orchestrator/search_planner/optuna_planner.py keeps trials only if r.summary_metrics[self._cfg.objectives[0].metric] is present. If the metric never appears (e.g. time_to_first_token against a non-streaming endpoint, or inter_token_latency for a single-token completion), every trial is filtered out, the iteration produces no usable objective, and the planner feeds Optuna a per-objective sentinel vector — see entry 6 for the mechanics.
Fix:
Confirm the metric is produced before driving a long BO run:
If the desired metric is missing, pick one that is produced or adjust the run to produce it (e.g. enable streaming for time-to-first-token).
6. All Trials in an Iteration Failed
Warning message:
Same as entry 5. The corresponding entry in search_history.json has objective_values: null.
Cause:
When every trial fails, the planner builds a per-objective sentinel via _failure_sentinel_vector (see src/aiperf/orchestrator/search_planner/optuna_planner.py) and feeds it to study.tell(trial, ...) so the ask/tell pairing stays consistent. Each sentinel is the worst-of-prior value for that objective plus a 10%-or-1.0 margin in the worse direction; if no prior history exists for that objective, it falls back to +/- NO_DATA_SENTINEL_LOSS. The sentinel value IS observed by Optuna’s surrogate (the GP sees a strictly-worse-than-anything-seen point so it deprioritizes that region), but the fallback value is NOT persisted to search_history.json — objective_values is set to null for that iteration, matching what /aiperf/api/search-history-api-reference describes.
This keeps the ask/tell loop consistent and lets the loop continue rather than aborting.
Fix:
The fallback is a degraded mode, not a clean signal — investigate the failures rather than letting them accumulate:
Common causes: server timeouts, OOM at high concurrency, endpoint refusing streaming, metric-collection error. Tighten server availability or narrow the search-space bounds before re-running. See /aiperf/api/search-history-api-reference for the search_history.json schema and how to filter sentinel iterations.
7. Mutual Exclusion: --search-* + Magic-List Flag
Error message:
Cause:
Magic-list flags (--concurrency 10,20,30) are promoted to a top-level sweep: block by _promote_magic_lists_to_sweep_block in src/aiperf/config/flags/converter.py. The converter’s Pydantic validation of AdaptiveSearchSweep (declared with extra="forbid" in src/aiperf/config/sweep/config.py) then rejects the combination — BO chooses iterations adaptively from continuous ranges, while a magic-list expects you to enumerate the discrete points up front.
Fix:
See the “grid vs BO” decision matrix in /aiperf/sweeping-adaptive-search/bayesian-optimization.
8. Mutual Exclusion: --search-* + Explicit sweep: YAML Block
Error message:
Cause:
Same guard as entry 7: AdaptiveSearchSweep’s extra="forbid" validator in src/aiperf/config/sweep/config.py rejects the merged dict. Triggered when an aiperf-config.yaml contains a top-level sweep: block AND the CLI invocation passes --search-* flags.
Fix:
Drop one or the other. If your config carries a leftover sweep: block from an earlier experiment, remove it before adding --search-*:
9. Mutual Exclusion: --search-* + --convergence-metric
Error message:
Raised as TypeError from _reject_search_plus_convergence in src/aiperf/config/flags/_converter_optionals.py when both --search-space (with its companion --search-* flags) and --convergence-metric are set on the same aiperf profile invocation.
Cause:
--convergence-metric is a trial-level adaptive stop (stop trials at a single benchmark point once the metric stabilizes); --search-* is an outer-loop adaptive search (choose the next benchmark point). The two are conceptually orthogonal but their composition is not yet well-defined: which value to report to the planner under early-stop, and whether to count convergence-stopped trials toward the per-iteration trial budget, both need explicit semantics.
Fix:
Pick one until composition is supported:
10. --search-initial-points >= --search-max-iterations
Error message:
Cause:
AdaptiveSearchSweep._check_initial_points_below_max_iterations in src/aiperf/config/sweep/config.py rejects the configuration. BO needs at least one iteration after the random Sobol-seeded initial points so the GP can fit and the sampler can propose informed points. Default for --search-initial-points is 5; --search-max-iterations has no default and is required whenever --search-space is set.
Fix:
Why this rule exists:
The Sobol-random phase exists to seed the GP with diverse points before it can fit a meaningful posterior. If the entire iteration budget is consumed by the random phase, the run is just expensive uniform sampling — there’s no BO-shaped value left to extract. The strict < ensures at least one GP-driven iteration runs.
Getting Help
If you encounter an error not covered in this guide:
-
Check the error message carefully - Pydantic errors include the field path, the constraint that failed, and the offending input value.
-
Review the documentation:
-
Report a bug if:
- The error message is unclear or unhelpful
- You believe the error is incorrect
- The suggested fix doesn’t work
Include in your bug report:
- Full command line you ran
- Complete error message
- AIPerf version (
aiperf --version) - What you expected to happen
See also
- Bayesian-Optimization Outer Loop — Canonical BO reference: algorithm choice, objective semantics, convergence criteria, grid-vs-BO decision matrix.
- Parameter Sweeps — Parameter sweeping tutorial and YAML reference.
- Adaptive Search — Adaptive search tutorial.
- Search History API Reference —
search_history.jsonschema and how to inspect per-iteration objective values.