Multi-Run Confidence Reporting
Multi-Run Confidence Reporting
Multi-Run Confidence Reporting
Multi-run confidence reporting allows you to run the same benchmark configuration multiple times to quantify measurement variance, assess repeatability, and compute confidence intervals for key metrics. This helps answer the critical question: “Is this performance difference real or just noise?”
When you run a single benchmark, the results can vary due to:
By running multiple trials of the same benchmark, you can:
In multi-run mode, the dashboard UI is rejected at startup; simple and none are supported. Whichever non-dashboard --ui you pass is used as-is — there is no automatic rewrite of the UI type based on --num-profile-runs.
Simple UI
Shows progress bars for each run - works well with multi-run mode.
No UI
Minimal output, fastest execution - ideal for automated runs or CI/CD pipelines.
The dashboard UI (--ui dashboard) is incompatible with multi-run mode due to terminal control constraints. If you explicitly try to use it, you’ll get an error:
This is a fundamental architectural limitation - Textual requires exclusive terminal control, which isn’t possible when the orchestrator coordinates multiple subprocess runs.
If you need live dashboard updates, run benchmarks individually:
Run the same benchmark 5 times:
Use 99% confidence intervals instead of the default 95%:
Add a 10-second cooldown between runs to reduce correlation:
When --num-profile-runs > 1, AIPerf creates a hierarchical output structure with an auto-generated directory name:
The artifact tree branches on three flags: whether a sweep is configured
(is_sweep), whether multiple trials run per cell (trials > 1), and
the sweep iteration order (REPEATED vs INDEPENDENT).
<dir_name> is the sweep variation {leaf_param_name}_{value} form (e.g.
concurrency_10). The example above is the second row (no sweep,
multi-run), so it does not use <dir_name>; its auto-generated base artifact
directory uses concurrency10. Sweep + INDEPENDENT multi-run uses
trial_NNNN for the inner dir.
The directory name is automatically generated based on your benchmark configuration:
llama-3-8b (from --model)openai-chat (from --endpoint-type)concurrency10 (from --concurrency) or request_rate100 (from --request-rate)Examples:
artifacts/gpt-4-openai-chat-concurrency50/artifacts/mistral-7b-openai-completions-request_rate10/artifacts/llama-2-13b-nim-embeddings-concurrency20/Each run’s artifacts are stored in separate directories (run_0001, run_0002, etc. for no-sweep multi-run; trial_0001, trial_0002, etc. for sweep + multi-run) and include:
profile_export_aiperf.json - Complete metrics for that runprofile_export_aiperf.csv - CSV export for that runprofile_export.jsonl - Per-request recordsinputs.json - Input prompts usedThis allows you to:
The aggregate/ directory contains statistics computed across all runs:
profile_export_aiperf_aggregate.json - Aggregated statisticsprofile_export_aiperf_aggregate.csv - Tabular view of aggregated metricsFor each metric, the aggregate output includes:
This section provides detailed mathematical definitions for each aggregate statistic computed across multiple runs.
Type: Aggregate Statistic
The average value of the metric across all successful runs.
Formula:
Example: If TTFT p99 values across 5 runs are [150ms, 152ms, 148ms, 155ms, 151ms], the mean is 151.2ms.
Type: Aggregate Statistic
Measures the spread or dispersion of metric values across runs. Uses sample standard deviation (N-1 degrees of freedom).
Formula:
Example: For the TTFT values above, std ≈ 2.59ms, indicating low variability.
Type: Aggregate Statistic
The smallest value observed across all runs.
Example: For the TTFT values above, min = 148ms.
Type: Aggregate Statistic
The largest value observed across all runs.
Example: For the TTFT values above, max = 155ms.
Type: Aggregate Statistic
A normalized measure of variability, expressed as a ratio (not percentage). Useful for comparing variability across metrics with different scales.
Formula:
Notes:
abs(mean) to handle metrics that can be negativeinf when mean is zero (division by zero)Example: For the TTFT values above, CV = 2.59 / 151.2 ≈ 0.017 (1.7%), indicating excellent repeatability.
Type: Aggregate Statistic
Measures the uncertainty in the estimated mean. Decreases as sample size increases.
Formula:
Example: For the TTFT values above with n=5, SE = 2.59 / sqrt(5) ≈ 1.16ms.
Notes:
Type: Aggregate Statistic
A range that likely contains the true population mean with a specified confidence level (default 95%).
Formula:
Where t_critical is the critical value from the t-distribution with (n-1) degrees of freedom.
Example: For the TTFT values above with 95% confidence:
[151.2 - 2.776 * 1.16, 151.2 + 2.776 * 1.16] = [148.0ms, 154.4ms]We’re 95% confident the true mean TTFT is between 148.0ms and 154.4ms.
Notes:
--confidence-level (default 0.95)Type: Aggregate Statistic
The critical value from the t-distribution used to compute confidence intervals. Depends on sample size and confidence level.
Formula:
Where:
alpha = 1 - confidence_leveldf = n - 1 (degrees of freedom)t.ppf is the percent point function (inverse CDF) of the t-distributionExample:
Notes:
The CV is a normalized measure of variability: CV = std / mean
Interpretation Guidelines:
CV < 0.05 (5%): Excellent repeatability, low noise
CV 0.05-0.10 (5-10%): Good repeatability, acceptable noise
CV 0.10-0.20 (10-20%): Fair repeatability, moderate variance
CV > 0.20 (>20%): High variability
Example:
This indicates good repeatability. The p99 TTFT varies by about 8% between runs, which is acceptable for most use cases.
The confidence interval tells you: “If we repeated this experiment many times, X% of the time the true mean would fall in this range.”
Interpretation Guidelines:
Narrow CI: High precision, confident in the estimate
Wide CI: Lower precision, more uncertainty
--num-profile-runsExample:
We’re 95% confident the true mean p99 TTFT is between 140.3ms and 165.1ms. The 24.8ms width suggests moderate uncertainty with 5 runs.
When comparing two configurations, consider:
Do the confidence intervals overlap?
Is the difference larger than the CV?
Example:
No overlap in CIs → Strong evidence that Config B is slower.
Quick check: 3 runs
Standard benchmarking: 5 runs
High-precision: 10 runs
Important: All runs use the same workload (prompts, ordering, scheduling) to ensure fair comparison.
AIPerf automatically:
--random-seed 42 if not specified (for multi-run consistency)This ensures that observed variance is due to real system noise, not artificial differences in the workload.
You can specify your own seed:
All 5 runs will use seed 123, ensuring identical workloads.
When using multi-run with warmup:
--no-profile-run-disable-warmup-after-first to run warmup before each run (useful for long cooldown periods)This default behavior is more efficient and provides more accurate aggregate statistics by measuring steady-state performance.
Possible causes:
Solutions:
--profile-run-cooldown-seconds to reduce correlation--warmup-request-count to stabilize server--num-profile-runs to better characterize varianceIf some runs fail, AIPerf will:
Example output:
If exactly one run succeeds, AIPerf does not raise — instead it enters a degraded single-run mode and emits a warning:
The aggregate JSON is still produced, with std=0, ci_low == ci_high == mean, and a single_run: true flag in metadata so downstream consumers can render an “n=1, no CI” badge instead of mistaking the zero-width interval for a real measurement.
If zero runs succeed, AIPerf raises:
Solution: Inspect the per-run logs for the underlying failure, fix the configuration or environment issue, and re-run.
If --num-profile-runs is large and each run takes a long time:
Reduce run duration:
--request-count 500 instead of --request-count 5000 (keep --num-prompts only when you need a larger synthetic dataset pool)--synthetic-input-tokens-mean 100Use cooldown strategically:
Run overnight:
This provides a good balance of precision and time investment.
After running, look at the CV for your key metrics:
Always use warmup to eliminate cold-start effects:
For reproducible experiments:
Save your command and results for future reference:
When comparing configurations:
--num-profile-runs--random-seedInstead of always running a fixed number of trials, you can specify a convergence criterion that stops benchmarking early once metrics stabilize. This saves time when results converge quickly and runs to the maximum when they don’t.
When --convergence-metric is set, AIPerf switches from FixedTrialsStrategy to AdaptiveStrategy:
min(3, num_profile_runs) trials--num-profile-runsThe IID property of runs is preserved — convergence operates on independent run-level statistics and never feeds aggregated data back into the decision loop.
Three statistical methods are available via --convergence-mode:
CI Width (default): Stops when the Student’s t confidence interval width relative to the mean falls below the threshold. Operates on run-level summary statistics.
CV (Coefficient of Variation): Stops when the CV (std/mean) across run-level values drops below the threshold.
Distribution (KS Test): Uses a two-sample Kolmogorov-Smirnov test on per-request JSONL data to detect when the latest run’s distribution matches prior runs. Catches bimodal behavior and tail shifts that summary statistics miss.
Distribution mode requires
--export-level recordsor--export-level rawbecause it reads per-request JSONL data. It is rejected with--export-level summary.
For ci_width and cv, a lower threshold is stricter (harder to converge). For distribution, the threshold is a KS test p-value — convergence triggers when p_value > threshold, so a higher threshold is stricter. AIPerf logs this at runtime:
All convergence flags require --num-profile-runs > 1. The --convergence-stat flag applies to ci_width and cv modes only (not distribution).
When --num-profile-runs is 2, AIPerf adjusts the minimum runs for convergence checks accordingly and logs a warning about reduced statistical power:
For meaningful convergence, 3+ runs is recommended.
If --convergence-metric contains a typo, AIPerf warns after the minimum runs complete with zero matching values:
When adaptive convergence is enabled and --export-level is records or raw, AIPerf produces an additional profile_export_aiperf_collated.json in the aggregate directory. This reads per-request JSONL from all runs, combines them into a single population per metric, and computes true combined percentiles (p50, p90, p95, p99).
This complements the standard confidence aggregation — confidence aggregation operates on run-level summary stats, while detailed aggregation gives a combined distribution view over all requests.
Multi-run works with all AIPerf features:
With GPU telemetry:
With server metrics:
With trace replay:
Load aggregate results in Python:
Multi-run confidence reporting helps you:
Quick Start:
With Adaptive Convergence:
Key Metrics:
For more details, see: