Multi-Run Confidence Reporting
Multi-Run Confidence Reporting
Overview
Multi-run confidence reporting allows you to run the same benchmark configuration multiple times to quantify measurement variance, assess repeatability, and compute confidence intervals for key metrics. This helps answer the critical question: “Is this performance difference real or just noise?”
What is Confidence Reporting?
When you run a single benchmark, the results can vary due to:
- System jitter (GPU clocks, background tasks)
- Network variance
- Server internal scheduling and batching dynamics
- Periodic stalls or transient errors
By running multiple trials of the same benchmark, you can:
- Quantify variance: Understand how much results vary between runs
- Assess repeatability: Determine if your measurements are stable
- Compute confidence intervals: Get honest uncertainty estimates
- Make informed decisions: Know if performance differences are statistically meaningful
UI Behavior in Multi-Run Mode
Multi-run mode automatically uses the simple UI by default for the best experience. The dashboard UI is not supported due to terminal control limitations.
Default UI Selection
When using --num-profile-runs > 1, AIPerf automatically sets --ui simple unless you explicitly specify a different UI:
You’ll see an informational message:
Supported UI Options
Simple UI (Default)
Shows progress bars for each run - works well with multi-run mode.
No UI
Minimal output, fastest execution - ideal for automated runs or CI/CD pipelines.
Dashboard UI Not Supported
The dashboard UI (--ui dashboard) is incompatible with multi-run mode due to terminal control constraints. If you explicitly try to use it, you’ll get an error:
This is a fundamental architectural limitation - Textual requires exclusive terminal control, which isn’t possible when the orchestrator coordinates multiple subprocess runs.
For Live Dashboard Monitoring
If you need live dashboard updates, run benchmarks individually:
Basic Usage
Simple Multi-Run Benchmark
Run the same benchmark 5 times:
With Custom Confidence Level
Use 99% confidence intervals instead of the default 95%:
With Cooldown Between Runs
Add a 10-second cooldown between runs to reduce correlation:
Output Structure
When --num-profile-runs > 1, AIPerf creates a hierarchical output structure with an auto-generated directory name:
Auto-Generated Directory Name
The directory name is automatically generated based on your benchmark configuration:
- Model name: e.g.,
llama-3-8b(from--model) - Service kind and endpoint type: e.g.,
openai-chat(from--endpoint-type) - Stimulus: e.g.,
concurrency_10(from--concurrency) orrequest_rate_100(from--request-rate)
Examples:
artifacts/gpt-4-openai-chat-concurrency_50/artifacts/mistral-7b-openai-completions-request_rate_10/artifacts/llama-2-13b-nim-embeddings-concurrency_20/
Per-Run Artifacts
Each run’s artifacts are stored in separate directories (run_0001, run_0002, etc.) and include:
profile_export_aiperf.json- Complete metrics for that runprofile_export_aiperf.csv- CSV export for that runprofile_export.jsonl- Per-request recordsinputs.json- Input prompts used
This allows you to:
- Debug outliers by examining specific runs
- Compare individual runs
- Investigate anomalies
Aggregate Artifacts
The aggregate/ directory contains statistics computed across all runs:
profile_export_aiperf_aggregate.json- Aggregated statisticsprofile_export_aiperf_aggregate.csv- Tabular view of aggregated metrics
Understanding Aggregate Statistics
For each metric, the aggregate output includes:
- mean: Average value across all runs
- std: Standard deviation (measure of spread)
- min: Minimum value observed
- max: Maximum value observed
- cv: Coefficient of Variation (normalized variability)
- se: Standard Error (uncertainty in the mean)
- ci_low, ci_high: Confidence interval bounds
- t_critical: t-distribution critical value used
Example Aggregate Output
Detailed Metric Definitions
This section provides detailed mathematical definitions for each aggregate statistic computed across multiple runs.
Mean (μ)
Type: Aggregate Statistic
The average value of the metric across all successful runs.
Formula:
Example: If TTFT p99 values across 5 runs are [150ms, 152ms, 148ms, 155ms, 151ms], the mean is 151.2ms.
Standard Deviation (σ)
Type: Aggregate Statistic
Measures the spread or dispersion of metric values across runs. Uses sample standard deviation (N-1 degrees of freedom).
Formula:
Example: For the TTFT values above, std ≈ 2.59ms, indicating low variability.
Minimum
Type: Aggregate Statistic
The smallest value observed across all runs.
Example: For the TTFT values above, min = 148ms.
Maximum
Type: Aggregate Statistic
The largest value observed across all runs.
Example: For the TTFT values above, max = 155ms.
Coefficient of Variation (CV)
Type: Aggregate Statistic
A normalized measure of variability, expressed as a ratio (not percentage). Useful for comparing variability across metrics with different scales.
Formula:
Notes:
- Uses
abs(mean)to handle metrics that can be negative - Returns
infwhen mean is zero (division by zero) - Lower CV indicates more consistent measurements
Example: For the TTFT values above, CV = 2.59 / 151.2 ≈ 0.017 (1.7%), indicating excellent repeatability.
Standard Error (SE)
Type: Aggregate Statistic
Measures the uncertainty in the estimated mean. Decreases as sample size increases.
Formula:
Example: For the TTFT values above with n=5, SE = 2.59 / sqrt(5) ≈ 1.16ms.
Notes:
- Smaller SE indicates more precise estimate of the true mean
- SE decreases proportionally to 1/sqrt(n)
Confidence Interval (CI)
Type: Aggregate Statistic
A range that likely contains the true population mean with a specified confidence level (default 95%).
Formula:
Where t_critical is the critical value from the t-distribution with (n-1) degrees of freedom.
Example: For the TTFT values above with 95% confidence:
- t_critical ≈ 2.776 (for n=5, df=4)
- CI =
[151.2 - 2.776 * 1.16, 151.2 + 2.776 * 1.16]= [148.0ms, 154.4ms]
We’re 95% confident the true mean TTFT is between 148.0ms and 154.4ms.
Notes:
- Uses t-distribution (not normal) for mathematically precise critical values
- Confidence level configurable via
--confidence-level(default 0.95) - CI width decreases with more runs (larger n)
t-Critical Value
Type: Aggregate Statistic
The critical value from the t-distribution used to compute confidence intervals. Depends on sample size and confidence level.
Formula:
Where:
alpha = 1 - confidence_leveldf = n - 1(degrees of freedom)t.ppfis the percent point function (inverse CDF) of the t-distribution
Example:
- For n=5 runs and 95% confidence: t_critical ≈ 2.776
- For n=10 runs and 95% confidence: t_critical ≈ 2.262
- For n=5 runs and 99% confidence: t_critical ≈ 4.604
Notes:
- Computed using scipy.stats.t.ppf() for mathematical precision
- Larger sample sizes have smaller t-critical values (approach normal distribution)
- Higher confidence levels have larger t-critical values (wider intervals)
Interpreting Results
Coefficient of Variation (CV)
The CV is a normalized measure of variability: CV = std / mean
Interpretation Guidelines:
-
CV < 0.05 (5%): Excellent repeatability, low noise
- Results are very stable
- High confidence in measurements
- Small differences are likely meaningful
-
CV 0.05-0.10 (5-10%): Good repeatability, acceptable noise
- Results are reasonably stable
- Moderate confidence in measurements
- Medium-sized differences are likely meaningful
-
CV 0.10-0.20 (10-20%): Fair repeatability, moderate variance
- Results show noticeable variation
- Consider running more trials
- Only large differences are clearly meaningful
-
CV > 0.20 (>20%): High variability
- Results are unstable
- Investigate sources of variance
- Increase number of runs or use cooldown
- Be cautious about drawing conclusions
Example:
This indicates good repeatability. The p99 TTFT varies by about 8% between runs, which is acceptable for most use cases.
Confidence Intervals (CI)
The confidence interval tells you: “If we repeated this experiment many times, X% of the time the true mean would fall in this range.”
Interpretation Guidelines:
-
Narrow CI: High precision, confident in the estimate
- The true mean is likely very close to the measured mean
- Small sample size may still be sufficient
-
Wide CI: Lower precision, more uncertainty
- The true mean could be anywhere in a broad range
- Consider increasing
--num-profile-runs - May need to investigate sources of variance
Example:
We’re 95% confident the true mean p99 TTFT is between 140.3ms and 165.1ms. The 24.8ms width suggests moderate uncertainty with 5 runs.
Comparing Configurations
When comparing two configurations, consider:
-
Do the confidence intervals overlap?
- No overlap → Strong evidence of a real difference
- Partial overlap → Likely a real difference, but less certain
- Complete overlap → Difference may not be meaningful
-
Is the difference larger than the CV?
- If Config A has mean=100ms (CV=10%) and Config B has mean=120ms
- Difference is 20%, which is 2× the CV
- This suggests a real difference
Example:
No overlap in CIs → Strong evidence that Config B is slower.
When to Use More Runs
Recommended Number of Runs
-
Quick check: 3 runs
- Minimum for basic statistics
- Good for initial exploration
-
Standard benchmarking: 5 runs
- Good balance of time and precision
- Recommended for most use cases
-
High-precision: 10 runs
- When you need very precise estimates
- When comparing small differences
- When variance is high
Signs You Need More Runs
- High CV (>10%): More runs will reduce uncertainty
- Wide confidence intervals: More runs will narrow the CI
- Overlapping CIs when comparing: More runs may separate them
- Inconsistent results: More runs will clarify the true mean
Workload Consistency
Important: All runs use the same workload (prompts, ordering, scheduling) to ensure fair comparison.
AIPerf automatically:
- Sets
--random-seed 42if not specified (for multi-run consistency) - Uses the same prompts in the same order for all runs
- Uses the same request timing patterns
This ensures that observed variance is due to real system noise, not artificial differences in the workload.
Manual Seed Control
You can specify your own seed:
All 5 runs will use seed 123, ensuring identical workloads.
Warmup Behavior
When using multi-run with warmup:
- By default, warmup runs once before the first profile run only
- Subsequent profile runs (2-5) measure steady-state performance without warmup
- Warmup metrics are automatically excluded from results
- Use
--profile-run-disable-warmup-after-first falseto run warmup before each run (useful for long cooldown periods)
This default behavior is more efficient and provides more accurate aggregate statistics by measuring steady-state performance.
Troubleshooting
High Variance (CV > 20%)
Possible causes:
- System is under load from other processes
- Network instability
- Server batching/scheduling dynamics
- Insufficient warmup
Solutions:
- Use
--profile-run-cooldown-secondsto reduce correlation - Increase
--warmup-request-countto stabilize server - Run benchmarks during low-load periods
- Investigate server configuration
- Increase
--num-profile-runsto better characterize variance
Failed Runs
If some runs fail, AIPerf will:
- Continue with remaining runs
- Compute statistics over successful runs only
- Report failed runs in aggregate metadata
Example output:
Insufficient Successful Runs
If fewer than 2 runs succeed, you’ll get an error:
Solution: Increase --num-profile-runs or fix the underlying issue causing failures.
Very Long Benchmark Times
If --num-profile-runs is large and each run takes a long time:
-
Reduce run duration:
- Use fewer prompts:
--num-prompts 500instead of--num-prompts 5000 - Use shorter prompts:
--synthetic-input-tokens-mean 100
- Use fewer prompts:
-
Use cooldown strategically:
- Only add cooldown if you see high correlation between runs
- Start without cooldown and add if needed
-
Run overnight:
- For production validation with many runs
Best Practices
1. Start with 5 Runs
This provides a good balance of precision and time investment.
2. Check CV First
After running, look at the CV for your key metrics:
- CV < 10%: Results are trustworthy
- CV > 10%: Consider more runs or investigate variance
3. Use Warmup
Always use warmup to eliminate cold-start effects:
4. Set Random Seed for Reproducibility
For reproducible experiments:
5. Document Your Configuration
Save your command and results for future reference:
6. Compare Apples to Apples
When comparing configurations:
- Use the same
--num-profile-runs - Use the same
--random-seed - Use the same workload parameters
Adaptive Convergence (Early Stopping)
Instead of always running a fixed number of trials, you can specify a convergence criterion that stops benchmarking early once metrics stabilize. This saves time when results converge quickly and runs to the maximum when they don’t.
How It Works
When --convergence-metric is set, AIPerf switches from FixedTrialsStrategy to AdaptiveStrategy:
- Runs at least
min(3, num_profile_runs)trials - After each run, checks whether the convergence criterion is satisfied
- Stops early if converged, otherwise continues up to
--num-profile-runs
The IID property of runs is preserved — convergence operates on independent run-level statistics and never feeds aggregated data back into the decision loop.
Convergence Modes
Three statistical methods are available via --convergence-mode:
CI Width (default): Stops when the Student’s t confidence interval width relative to the mean falls below the threshold. Operates on run-level summary statistics.
CV (Coefficient of Variation): Stops when the CV (std/mean) across run-level values drops below the threshold.
Distribution (KS Test): Uses a two-sample Kolmogorov-Smirnov test on per-request JSONL data to detect when the latest run’s distribution matches prior runs. Catches bimodal behavior and tail shifts that summary statistics miss.
Distribution mode requires
--export-level recordsor--export-level rawbecause it reads per-request JSONL data. It is rejected with--export-level summary.
Threshold Semantics
For ci_width and cv, a lower threshold is stricter (harder to converge). For distribution, the threshold is a KS test p-value — convergence triggers when p_value > threshold, so a higher threshold is stricter. AIPerf logs this at runtime:
CLI Flags Reference
All convergence flags require --num-profile-runs > 1. The --convergence-stat flag applies to ci_width and cv modes only (not distribution).
Reduced Run Counts
When --num-profile-runs is 2, AIPerf adjusts the minimum runs for convergence checks accordingly and logs a warning about reduced statistical power:
For meaningful convergence, 3+ runs is recommended.
Unrecognized Metric Names
If --convergence-metric contains a typo, AIPerf warns after the minimum runs complete with zero matching values:
Detailed Aggregation
When adaptive convergence is enabled and --export-level is records or raw, AIPerf produces an additional profile_export_aiperf_collated.json in the aggregate directory. This reads per-request JSONL from all runs, combines them into a single population per metric, and computes true combined percentiles (p50, p90, p95, p99).
This complements the standard confidence aggregation — confidence aggregation operates on run-level summary stats, while detailed aggregation gives a combined distribution view over all requests.
Output Structure with Adaptive Convergence
Example Collated Output
Advanced Usage
Combining with Other Features
Multi-run works with all AIPerf features:
With GPU telemetry:
With server metrics:
With trace replay:
Analyzing Results Programmatically
Load aggregate results in Python:
Summary
Multi-run confidence reporting helps you:
- ✅ Quantify measurement variance
- ✅ Assess repeatability with CV
- ✅ Compute confidence intervals
- ✅ Make statistically informed decisions
- ✅ Debug outliers with per-run artifacts
Quick Start:
With Adaptive Convergence:
With Adaptive Convergence:
Key Metrics:
- CV < 10%: Good repeatability
- Narrow CI: High precision
- No CI overlap: Strong evidence of difference
For more details, see:
- CLI Options - Full parameter reference
- Metrics Reference - Detailed metric descriptions
- Architecture - How multi-run orchestration works