Parameter Sweeping
Overview
Parameter sweeping allows you to benchmark across multiple parameter values (e.g., concurrency levels) in a single command. This enables systematic performance characterization, identification of optimal configurations, and understanding of how your system scales.
Instead of running separate benchmarks for each concurrency level, parameter sweeping automates the process and provides comprehensive analysis including:
- Pareto optimal configurations (best trade-offs)
- Best configurations for key metrics (throughput, latency)
- Confidence intervals when combined with multi-run mode
- Organized hierarchical output structure
What is Parameter Sweeping?
When you run a parameter sweep, AIPerf:
- Executes benchmarks at each parameter value sequentially
- Organizes results hierarchically for easy navigation
- Computes aggregate statistics across all values
- Identifies optimal configurations based on your objectives
- Analyzes performance across parameter combinations
This helps answer questions like:
- “What’s the optimal concurrency for my workload?”
- “How does throughput scale with concurrency?”
- “Where does latency start to degrade?”
- “What’s the best trade-off between throughput and latency?”
UI Behavior in Parameter Sweep Mode
Parameter sweep mode automatically uses the simple UI by default for the best experience. The dashboard UI is not supported due to terminal control limitations.
Default UI Selection
When using --concurrency with a list of values, AIPerf automatically sets --ui simple unless you explicitly specify a different UI:
You’ll see an informational message:
Supported UI Options
Simple UI (Default)
Shows progress bars for each sweep value - works well with parameter sweeps.
No UI
Minimal output, fastest execution - ideal for automated runs or CI/CD pipelines.
Dashboard UI Not Supported
The dashboard UI (--ui dashboard) is incompatible with parameter sweep mode. If you explicitly try to use it, you’ll get an error:
Basic Usage
Simple Concurrency Sweep
Sweep across multiple concurrency values:
This runs 4 separate benchmarks with concurrency values of 10, 20, 30, and 40.
Output Structure (Single Sweep)
When running a simple sweep without confidence runs:
Each concurrency value has its own directory with complete benchmark results. The sweep_aggregate/ directory contains analysis across all values.
Combining Sweep with Confidence Reporting
You can combine parameter sweeping with multi-run confidence reporting to quantify variance at each parameter value. This provides the most comprehensive analysis.
Understanding Sweep Modes
When combining sweeps with confidence runs, you can choose between two execution modes:
Repeated Mode (Default)
Executes the full sweep pattern multiple times. This preserves dynamic system behavior as load changes.
Execution pattern with --concurrency 10,20,30 --num-profile-runs 5:
Use when:
- You want to capture how the system behaves as load changes
- You’re testing dynamic scaling or batching behavior
- You want to measure real-world performance patterns
Independent Mode
Executes all trials at each sweep value before moving to the next. This isolates each parameter value for independent measurement.
Execution pattern with --concurrency 10,20,30 --num-profile-runs 5:
Use when:
- You want to isolate each concurrency level
- You’re measuring steady-state performance at each value
- You want to minimize correlation between different parameter values
To use independent mode:
Output Structure (Sweep + Confidence)
When combining sweep with confidence runs (repeated mode):
Structure explanation:
profile_runs/trial_NNNN/: Each trial’s raw results for all sweep valuesaggregate/concurrency_VV/: Confidence statistics for each concurrency value across all trialssweep_aggregate/: Cross-value comparison and analysis
Understanding Sweep Aggregates
The sweep aggregate output provides comprehensive analysis across all parameter values.
Example Sweep Aggregate
Interpreting Per-Combination Metrics
For each parameter combination, you get:
- mean: Average across all trials (if using confidence runs)
- std: Standard deviation (variability between trials)
- cv: Coefficient of Variation (normalized variability)
- ci_low, ci_high: Confidence interval bounds
- min, max: Range of observed values
What to look for:
- Low CV (
<10%): Consistent performance at this concurrency level - High CV (
>20%): High variability, may need more trials or investigation - Narrow CI: High confidence in the mean estimate
- Wide CI: More uncertainty, consider more trials
Pareto Optimal Configurations
A configuration is Pareto optimal if no other configuration is strictly better on ALL objectives simultaneously. For parameter sweeps, AIPerf uses two competing objectives:
- Throughput (maximize)
- Latency (minimize, using p99 TTFT)
Understanding Pareto Optimality
In the example above, "pareto_optimal": [{"concurrency": 10}, {"concurrency": 20}, {"concurrency": 30}, {"concurrency": 40}] means:
-
Concurrency 10: Best latency (125.4ms), but lower throughput (95.2 req/s)
- No other config has both better latency AND better throughput
-
Concurrency 20: Good latency (145.2ms) with moderate throughput (175.8 req/s)
- Not dominated by any other configuration
- Better latency than 30 and 40, though lower throughput
-
Concurrency 30: Good balance (245.3 req/s, 180.5ms)
- Better throughput than 10, better latency than 40
- Represents a middle ground trade-off
-
Concurrency 40: Best throughput (255.1 req/s), but higher latency (285.7ms)
- No other config has both better throughput AND better latency
Concurrency 20 is ALSO Pareto optimal because:
- While concurrency 30 has higher throughput (245.3 vs 175.8), it has WORSE latency (180.5 vs 145.2)
- For a configuration to dominate another, it must be better or equal on ALL objectives
- Since 30’s latency is worse, it does not dominate 20
Choosing from Pareto Optimal Points
All Pareto optimal points are valid choices depending on your priorities:
- Latency-sensitive applications (real-time chat, interactive): Choose concurrency 10
- Moderate latency with good throughput: Choose concurrency 20
- Balanced workloads (general purpose): Choose concurrency 30
- Throughput-focused (batch processing, high load): Choose concurrency 40
There’s no single “best” - it depends on your service level objectives (SLOs).
Visualizing the Pareto Frontier
All points on the frontier (●) are Pareto optimal. Each represents a different trade-off between throughput and latency.
Mode Comparison: Repeated vs Independent
When to Use Repeated Mode (Default)
Use repeated mode when:
- You want to capture dynamic system behavior as load changes
- You’re testing systems with dynamic batching or scaling
- You want to measure real-world performance patterns
- You care about how the system transitions between load levels
Example:
Execution: Each trial runs the full sweep [10→20→30→40], preserving dynamic behavior.
Benefits:
- Captures system warm-up and adaptation effects
- Measures performance as load changes (realistic)
- Identifies if previous load affects current performance
Drawbacks:
- Results may show correlation between consecutive values
- Harder to isolate individual parameter effects
When to Use Independent Mode
Use independent mode when:
- You want to isolate each parameter value
- You’re measuring steady-state performance
- You want to minimize correlation between values
- You’re comparing configurations independently
Example:
Execution: All 5 trials at concurrency 10, then all 5 at 20, etc.
Benefits:
- Each value measured independently
- No correlation between different parameter values
- Clearer isolation of parameter effects
Drawbacks:
- Doesn’t capture dynamic behavior
- May miss system adaptation effects
- Longer total runtime (no shared warm-up)
Comparison Table
Workload Consistency and Random Seeds
Default Seed Behavior
By default, AIPerf uses different random seeds for each sweep value to avoid artificial correlation:
Seed derivation:
- Base seed: 42 (auto-set) or user-specified via
--random-seed - Per-value seeds:
base_seed + sweep_index- Concurrency 10: seed = 42 + 0 = 42
- Concurrency 20: seed = 42 + 1 = 43
- Concurrency 30: seed = 42 + 2 = 44
- Concurrency 40: seed = 42 + 3 = 45
Why different seeds?
- Avoids artificial correlation between sweep values
- Each value gets a different but reproducible workload
- More realistic performance characterization
Using Same Seed Across Values
If you want to use the same workload for all sweep values:
Seed behavior:
- All sweep values use the same seed (42 or user-specified)
- Identical prompts, ordering, and timing patterns
- Useful for comparing how different concurrency levels handle the exact same workload
When to use same seed:
- You want to isolate the effect of the parameter change
- You’re debugging specific workload behavior
- You want perfectly correlated comparisons
When NOT to use same seed:
- General performance characterization (use default)
- You want to avoid artificial correlation
- You’re measuring typical performance
Custom Base Seed
Specify your own base seed:
Per-value seeds will be: 123, 124, 125, 126 (unless --parameter-sweep-same-seed is used).
Cooldown Between Sweep Values
Use cooldown to allow the system to recover between parameter values:
When to Use Cooldown
Use cooldown when:
- System needs time to stabilize between load changes
- You’re testing systems with caching or memory effects
- You want to minimize correlation between consecutive values
- You’re running on shared infrastructure
Typical values:
- 0 seconds (default): No cooldown, fastest execution
- 10-30 seconds: Light cooldown for basic stabilization
- 60+ seconds: Heavy cooldown for systems with long memory effects
Combining Trial and Sweep Cooldowns
When using both sweep and confidence runs, you can set cooldowns at both levels:
Cooldown application (repeated mode):
--profile-run-cooldown-seconds: Between trials (between complete sweeps)--parameter-sweep-cooldown-seconds: Between sweep values within a trial
Cooldown application (independent mode):
--profile-run-cooldown-seconds: Between trials within a sweep value--parameter-sweep-cooldown-seconds: Between sweep values
Troubleshooting
High Variance at Some Values
Symptom: Some concurrency values show high CV (>20%) while others are stable.
Possible causes:
- That concurrency level is near a system threshold
- Resource contention at that load level
- Batching or scheduling effects
Solutions:
- Increase
--num-profile-runsfor that value - Add
--parameter-sweep-cooldown-secondsto reduce correlation - Investigate system behavior at that load level
- Check for resource bottlenecks (CPU, memory, GPU)
Unexpected Pareto Optimal Points
Symptom: A configuration you expected to be dominated is Pareto optimal.
Possible causes:
- High variance in measurements
- System has non-linear scaling behavior
- Measurement artifacts
Solutions:
- Increase
--num-profile-runsto reduce variance - Check CV for those values - high CV indicates instability
- Examine per-trial results for outliers
- Add cooldown to reduce correlation
No Clear Inflection Points
Symptom: Trend analysis doesn’t show clear inflection points.
Possible causes:
- Linear scaling across the range tested
- Need wider range of parameter values
- System hasn’t reached capacity
Solutions:
- Extend the sweep range (e.g.,
--concurrency 10,20,30,40,50,60) - Use finer granularity (e.g.,
--concurrency 10,15,20,25,30) - Push the system harder to find limits
Very Long Benchmark Times
Symptom: Sweep takes too long to complete.
Solutions:
- Reduce prompts per run:
--num-prompts 500instead of--num-prompts 5000 - Reduce trials:
--num-profile-runs 3instead of--num-profile-runs 5 - Remove cooldown: Set cooldowns to 0 if not needed
- Reduce sweep range: Test fewer values initially
- Run overnight: For comprehensive production validation
Failed Sweep Values
Symptom: Some sweep values fail while others succeed.
Behavior:
- AIPerf continues with remaining values
- Failed values excluded from aggregate analysis
- Failure details in sweep aggregate metadata
Example output:
Solutions:
- Investigate why that value fails (too high load?)
- Adjust server configuration for higher load
- Increase timeout values if needed
- Check system resources at that load level
Best Practices
1. Start with a Wide Range
Begin with a wide range to understand the full performance envelope:
Then narrow down based on results.
2. Use Confidence Runs for Production
For production validation, always combine sweep with confidence runs:
This quantifies variance and provides confidence intervals.
3. Check CV Before Drawing Conclusions
Always check the Coefficient of Variation (CV) for each value:
- CV < 10%: Results are trustworthy
- CV > 20%: Need more trials or investigation
4. Use Warmup
Always use warmup to eliminate cold-start effects:
5. Document Your Findings
Save your sweep aggregate and document your conclusions:
6. Compare Apples to Apples
When comparing different configurations:
- Use the same sweep values
- Use the same number of trials
- Use the same random seed (or same seed derivation)
- Use the same workload parameters
7. Understand Your Objectives
Choose Pareto optimal points based on your SLOs:
- Latency SLO: Choose the lowest latency Pareto point
- Throughput SLO: Choose the highest throughput Pareto point
- Balanced: Choose the middle Pareto point
Advanced Usage
Combining with Other Features
Parameter sweeping works with all AIPerf features:
With GPU telemetry:
With server metrics:
With goodput constraints:
Analyzing Results Programmatically
Load sweep aggregate results in Python:
Creating Custom Visualizations
Summary
Parameter sweeping helps you:
- ✅ Systematically characterize performance across parameter values
- ✅ Identify optimal configurations with Pareto analysis
- ✅ Compare performance across different parameter combinations
- ✅ Quantify variance with confidence intervals
- ✅ Make data-driven capacity planning decisions
Quick Start:
Key Concepts:
- Pareto optimal: Best trade-off configurations
- Best configurations: Highest throughput and lowest latency points
- Sweep modes: Repeated (dynamic) vs Independent (isolated)
- CV < 10%: Good repeatability
For more details, see:
- Sweep Aggregates API Reference - Complete data format documentation
- Multi-Run Confidence - Understanding confidence intervals
- CLI Options - Full parameter reference
- Metrics Reference - Detailed metric descriptions