Parameter Sweeping | NVIDIA AIPerf Documentation

Overview

Parameter sweeping allows you to benchmark across multiple parameter values (e.g., concurrency levels) in a single command. This enables systematic performance characterization, identification of optimal configurations, and understanding of how your system scales.

Instead of running separate benchmarks for each concurrency level, parameter sweeping automates the process and provides comprehensive analysis including:

Pareto optimal configurations (best trade-offs)
Best configurations for key metrics (throughput, latency)
Confidence intervals when combined with multi-run mode
Organized hierarchical output structure

What is Parameter Sweeping?

When you run a parameter sweep, AIPerf:

Executes benchmarks at each parameter value sequentially
Organizes results hierarchically for easy navigation
Computes aggregate statistics across all values
Identifies optimal configurations based on your objectives
Analyzes performance across parameter combinations

This helps answer questions like:

“What’s the optimal concurrency for my workload?”
“How does throughput scale with concurrency?”
“Where does latency start to degrade?”
“What’s the best trade-off between throughput and latency?”

UI Behavior in Parameter Sweep Mode

Parameter sweep mode automatically uses the simple UI by default for the best experience. The dashboard UI is not supported due to terminal control limitations.

Default UI Selection

When using --concurrency with a list of values, AIPerf automatically sets --ui simple unless you explicitly specify a different UI:

$ # These are equivalent - simple UI is auto-selected
$ ```bash
> aiperf profile --concurrency 10,20,30,40 ...
> aiperf profile --concurrency 10,20,30,40 --ui simple ...

You’ll see an informational message:

Parameter sweep mode: UI automatically set to 'simple' (use '--ui none' to disable UI output)

Supported UI Options

Simple UI (Default)

$ aiperf profile \
> ```bash
> aiperf profile \
>   --concurrency 10,20,30,40 \
>   --ui simple \
>   ...

Shows progress bars for each sweep value - works well with parameter sweeps.

No UI

$ aiperf profile \
>   --concurrency 10,20,30,40 \
> ```bash
> aiperf profile \
>   --concurrency 10,20,30,40 \
>   --ui none \
>   ...

Minimal output, fastest execution - ideal for automated runs or CI/CD pipelines.

Dashboard UI Not Supported

The dashboard UI (--ui dashboard) is incompatible with parameter sweep mode. If you explicitly try to use it, you’ll get an error:

$ aiperf profile --concurrency 10,20,30,40 --ui dashboard ...

ValueError: Dashboard UI is not supported with parameter sweeps
due to terminal control limitations. Please use '--ui simple' or '--ui none' instead.

Basic Usage

Simple Concurrency Sweep

Sweep across multiple concurrency values:

$ aiperf profile \
>   --model llama-3-8b \
>   --endpoint-type chat \
>   --url http://localhost:8000/v1/chat/completions \
>   --concurrency 10,20,30,40 \
>   --num-prompts 1000

This runs 4 separate benchmarks with concurrency values of 10, 20, 30, and 40.

Output Structure (Single Sweep)

When running a simple sweep without confidence runs:

artifacts/
  llama-3-8b-openai-chat-concurrency_sweep/
    concurrency_10/
      profile_export_aiperf.json
      profile_export_aiperf.csv
      profile_export.jsonl
      inputs.json
    concurrency_20/
      profile_export_aiperf.json
      ...
    concurrency_30/
      ...
    concurrency_40/
      ...
    sweep_aggregate/
      profile_export_aiperf_sweep.json
      profile_export_aiperf_sweep.csv

Each concurrency value has its own directory with complete benchmark results. The sweep_aggregate/ directory contains analysis across all values.

Combining Sweep with Confidence Reporting

You can combine parameter sweeping with multi-run confidence reporting to quantify variance at each parameter value. This provides the most comprehensive analysis.

$ aiperf profile \
>   --model llama-3-8b \
>   --endpoint-type chat \
>   --url http://localhost:8000/v1/chat/completions \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   --num-prompts 1000

Understanding Sweep Modes

When combining sweeps with confidence runs, you can choose between two execution modes:

Repeated Mode (Default)

Executes the full sweep pattern multiple times. This preserves dynamic system behavior as load changes.

Execution pattern with --concurrency 10,20,30 --num-profile-runs 5:

Trial 1: [10 → 20 → 30]
Trial 2: [10 → 20 → 30]
Trial 3: [10 → 20 → 30]
Trial 4: [10 → 20 → 30]
Trial 5: [10 → 20 → 30]

Use when:

You want to capture how the system behaves as load changes
You’re testing dynamic scaling or batching behavior
You want to measure real-world performance patterns

Independent Mode

Executes all trials at each sweep value before moving to the next. This isolates each parameter value for independent measurement.

Execution pattern with --concurrency 10,20,30 --num-profile-runs 5:

Value 10: [trial1, trial2, trial3, trial4, trial5]
Value 20: [trial1, trial2, trial3, trial4, trial5]
Value 30: [trial1, trial2, trial3, trial4, trial5]

Use when:

You want to isolate each concurrency level
You’re measuring steady-state performance at each value
You want to minimize correlation between different parameter values

To use independent mode:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   --parameter-sweep-mode independent \
>   ...

Output Structure (Sweep + Confidence)

When combining sweep with confidence runs (repeated mode):

artifacts/
  llama-3-8b-openai-chat-concurrency_sweep/
    profile_runs/
      trial_0001/
        concurrency_10/
          profile_export_aiperf.json
          ...
        concurrency_20/
          ...
        concurrency_30/
          ...
        concurrency_40/
          ...
      trial_0002/
        concurrency_10/
        concurrency_20/
        concurrency_30/
        concurrency_40/
      ...
      trial_0005/
        ...
    aggregate/
      concurrency_10/
        profile_export_aiperf_aggregate.json  # Confidence stats across 5 trials
        profile_export_aiperf_aggregate.csv
      concurrency_20/
        ...
      concurrency_30/
        ...
      concurrency_40/
        ...
    sweep_aggregate/
      profile_export_aiperf_sweep.json      # Comparison across concurrency values
      profile_export_aiperf_sweep.csv

Structure explanation:

profile_runs/trial_NNNN/: Each trial’s raw results for all sweep values
aggregate/concurrency_VV/: Confidence statistics for each concurrency value across all trials
sweep_aggregate/: Cross-value comparison and analysis

Understanding Sweep Aggregates

The sweep aggregate output provides comprehensive analysis across all parameter values.

Example Sweep Aggregate

1 {
2   "metadata": {
3     "aggregation_type": "sweep",
4     "sweep_parameters": [
5       {
6         "name": "concurrency",
7         "values": [10, 20, 30, 40]
8       }
9     ],
10     "num_combinations": 4,
11     "num_trials_per_value": 5,
12     "sweep_mode": "repeated",
13     "confidence_level": 0.95
14   },
15   "per_combination_metrics": [
16     {
17       "parameters": {
18         "concurrency": 10
19       },
20       "metrics": {
21         "request_throughput_avg": {
22           "mean": 95.2,
23           "std": 3.1,
24           "min": 91.5,
25           "max": 99.0,
26           "cv": 0.033,
27           "ci_low": 91.6,
28           "ci_high": 98.8,
29           "unit": "requests/sec"
30         },
31         "ttft_p99_ms": {
32           "mean": 125.4,
33           "std": 8.2,
34           "cv": 0.065,
35           "ci_low": 115.7,
36           "ci_high": 135.1,
37           "unit": "ms"
38         }
39       }
40     },
41     {
42       "parameters": {
43         "concurrency": 20
44       },
45       "metrics": {
46         "request_throughput_avg": {
47           "mean": 175.8,
48           "std": 5.4,
49           "cv": 0.031,
50           "unit": "requests/sec"
51         },
52         "ttft_p99_ms": {
53           "mean": 145.2,
54           "std": 10.1,
55           "cv": 0.070,
56           "unit": "ms"
57         }
58       }
59     },
60     {
61       "parameters": {
62         "concurrency": 30
63       },
64       "metrics": {
65         "request_throughput_avg": {
66           "mean": 245.3,
67           "std": 8.2,
68           "cv": 0.033,
69           "unit": "requests/sec"
70         },
71         "ttft_p99_ms": {
72           "mean": 180.5,
73           "std": 12.4,
74           "cv": 0.069,
75           "unit": "ms"
76         }
77       }
78     },
79     {
80       "parameters": {
81         "concurrency": 40
82       },
83       "metrics": {
84         "request_throughput_avg": {
85           "mean": 255.1,
86           "std": 12.3,
87           "cv": 0.048,
88           "unit": "requests/sec"
89         },
90         "ttft_p99_ms": {
91           "mean": 285.7,
92           "std": 18.5,
93           "cv": 0.065,
94           "unit": "ms"
95         }
96       }
97     }
98   ],
99   "best_configurations": {
100     "best_throughput": {
101       "parameters": {
102         "concurrency": 40
103       },
104       "metric": 255.1,
105       "unit": "requests/sec"
106     },
107     "best_latency_p99": {
108       "parameters": {
109         "concurrency": 10
110       },
111       "metric": 125.4,
112       "unit": "ms"
113     }
114   },
115   "pareto_optimal": [
116     {"concurrency": 10},
117     {"concurrency": 20},
118     {"concurrency": 30},
119     {"concurrency": 40}
120   ]
121 }

Interpreting Per-Combination Metrics

For each parameter combination, you get:

mean: Average across all trials (if using confidence runs)
std: Standard deviation (variability between trials)
cv: Coefficient of Variation (normalized variability)
ci_low, ci_high: Confidence interval bounds
min, max: Range of observed values

What to look for:

Low CV (<10%): Consistent performance at this concurrency level
High CV (>20%): High variability, may need more trials or investigation
Narrow CI: High confidence in the mean estimate
Wide CI: More uncertainty, consider more trials

Pareto Optimal Configurations

A configuration is Pareto optimal if no other configuration is strictly better on ALL objectives simultaneously. For parameter sweeps, AIPerf uses two competing objectives:

Throughput (maximize)
Latency (minimize, using p99 TTFT)

Understanding Pareto Optimality

In the example above, "pareto_optimal": [{"concurrency": 10}, {"concurrency": 20}, {"concurrency": 30}, {"concurrency": 40}] means:

Concurrency 10: Best latency (125.4ms), but lower throughput (95.2 req/s)
- No other config has both better latency AND better throughput
Concurrency 20: Good latency (145.2ms) with moderate throughput (175.8 req/s)
- Not dominated by any other configuration
- Better latency than 30 and 40, though lower throughput
Concurrency 30: Good balance (245.3 req/s, 180.5ms)
- Better throughput than 10, better latency than 40
- Represents a middle ground trade-off
Concurrency 40: Best throughput (255.1 req/s), but higher latency (285.7ms)
- No other config has both better throughput AND better latency

Concurrency 20 is ALSO Pareto optimal because:

While concurrency 30 has higher throughput (245.3 vs 175.8), it has WORSE latency (180.5 vs 145.2)
For a configuration to dominate another, it must be better or equal on ALL objectives
Since 30’s latency is worse, it does not dominate 20

Choosing from Pareto Optimal Points

All Pareto optimal points are valid choices depending on your priorities:

Latency-sensitive applications (real-time chat, interactive): Choose concurrency 10
Moderate latency with good throughput: Choose concurrency 20
Balanced workloads (general purpose): Choose concurrency 30
Throughput-focused (batch processing, high load): Choose concurrency 40

There’s no single “best” - it depends on your service level objectives (SLOs).

Visualizing the Pareto Frontier

Throughput (req/s)
    ^
260 |                    ● 40 (Pareto optimal)
240 |              ● 30 (Pareto optimal)
220 |
200 |
180 |        ● 20 (Pareto optimal)
160 |
140 |
120 |
100 | ● 10 (Pareto optimal)
 80 |
    +----------------------------------------> Latency (ms)
      120   140   160   180   200   220   240   260   280

All points on the frontier (●) are Pareto optimal. Each represents a different trade-off between throughput and latency.

Mode Comparison: Repeated vs Independent

When to Use Repeated Mode (Default)

Use repeated mode when:

You want to capture dynamic system behavior as load changes
You’re testing systems with dynamic batching or scaling
You want to measure real-world performance patterns
You care about how the system transitions between load levels

Example:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   --parameter-sweep-mode repeated \
>   ...

Execution: Each trial runs the full sweep [10→20→30→40], preserving dynamic behavior.

Benefits:

Captures system warm-up and adaptation effects
Measures performance as load changes (realistic)
Identifies if previous load affects current performance

Drawbacks:

Results may show correlation between consecutive values
Harder to isolate individual parameter effects

When to Use Independent Mode

Use independent mode when:

You want to isolate each parameter value
You’re measuring steady-state performance
You want to minimize correlation between values
You’re comparing configurations independently

Example:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   --parameter-sweep-mode independent \
>   ...

Execution: All 5 trials at concurrency 10, then all 5 at 20, etc.

Benefits:

Each value measured independently
No correlation between different parameter values
Clearer isolation of parameter effects

Drawbacks:

Doesn’t capture dynamic behavior
May miss system adaptation effects
Longer total runtime (no shared warm-up)

Comparison Table

Aspect	Repeated Mode	Independent Mode
Execution	[sweep] × N trials	N trials × [sweep]
Dynamic behavior	✅ Preserved	❌ Not captured
Isolation	❌ May have correlation	✅ Fully isolated
Use case	Real-world patterns	Steady-state comparison
Warm-up	Shared across sweep	Per value
Default	✅ Yes	No

Workload Consistency and Random Seeds

Default Seed Behavior

By default, AIPerf uses different random seeds for each sweep value to avoid artificial correlation:

$ # Default behavior
$ aiperf profile --concurrency 10,20,30,40 ...

Seed derivation:

Base seed: 42 (auto-set) or user-specified via --random-seed
Per-value seeds: base_seed + sweep_index
- Concurrency 10: seed = 42 + 0 = 42
- Concurrency 20: seed = 42 + 1 = 43
- Concurrency 30: seed = 42 + 2 = 44
- Concurrency 40: seed = 42 + 3 = 45

Why different seeds?

Avoids artificial correlation between sweep values
Each value gets a different but reproducible workload
More realistic performance characterization

Using Same Seed Across Values

If you want to use the same workload for all sweep values:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --parameter-sweep-same-seed \
>   ...

Seed behavior:

All sweep values use the same seed (42 or user-specified)
Identical prompts, ordering, and timing patterns
Useful for comparing how different concurrency levels handle the exact same workload

When to use same seed:

You want to isolate the effect of the parameter change
You’re debugging specific workload behavior
You want perfectly correlated comparisons

When NOT to use same seed:

General performance characterization (use default)
You want to avoid artificial correlation
You’re measuring typical performance

Custom Base Seed

Specify your own base seed:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --random-seed 123 \
>   ...

Per-value seeds will be: 123, 124, 125, 126 (unless --parameter-sweep-same-seed is used).

Cooldown Between Sweep Values

Use cooldown to allow the system to recover between parameter values:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --parameter-sweep-cooldown-seconds 30.0 \
>   ...

When to Use Cooldown

Use cooldown when:

System needs time to stabilize between load changes
You’re testing systems with caching or memory effects
You want to minimize correlation between consecutive values
You’re running on shared infrastructure

Typical values:

0 seconds (default): No cooldown, fastest execution
10-30 seconds: Light cooldown for basic stabilization
60+ seconds: Heavy cooldown for systems with long memory effects

Combining Trial and Sweep Cooldowns

When using both sweep and confidence runs, you can set cooldowns at both levels:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   --profile-run-cooldown-seconds 10.0 \
>   --parameter-sweep-cooldown-seconds 30.0 \
>   ...

Cooldown application (repeated mode):

--profile-run-cooldown-seconds: Between trials (between complete sweeps)
--parameter-sweep-cooldown-seconds: Between sweep values within a trial

Cooldown application (independent mode):

--profile-run-cooldown-seconds: Between trials within a sweep value
--parameter-sweep-cooldown-seconds: Between sweep values

Troubleshooting

High Variance at Some Values

Symptom: Some concurrency values show high CV (>20%) while others are stable.

Possible causes:

That concurrency level is near a system threshold
Resource contention at that load level
Batching or scheduling effects

Solutions:

Increase --num-profile-runs for that value
Add --parameter-sweep-cooldown-seconds to reduce correlation
Investigate system behavior at that load level
Check for resource bottlenecks (CPU, memory, GPU)

Unexpected Pareto Optimal Points

Symptom: A configuration you expected to be dominated is Pareto optimal.

Possible causes:

High variance in measurements
System has non-linear scaling behavior
Measurement artifacts

Solutions:

Increase --num-profile-runs to reduce variance
Check CV for those values - high CV indicates instability
Examine per-trial results for outliers
Add cooldown to reduce correlation

No Clear Inflection Points

Symptom: Trend analysis doesn’t show clear inflection points.

Possible causes:

Linear scaling across the range tested
Need wider range of parameter values
System hasn’t reached capacity

Solutions:

Extend the sweep range (e.g., --concurrency 10,20,30,40,50,60)
Use finer granularity (e.g., --concurrency 10,15,20,25,30)
Push the system harder to find limits

Very Long Benchmark Times

Symptom: Sweep takes too long to complete.

Solutions:

Reduce prompts per run: --num-prompts 500 instead of --num-prompts 5000
Reduce trials: --num-profile-runs 3 instead of --num-profile-runs 5
Remove cooldown: Set cooldowns to 0 if not needed
Reduce sweep range: Test fewer values initially
Run overnight: For comprehensive production validation

Failed Sweep Values

Symptom: Some sweep values fail while others succeed.

Behavior:

AIPerf continues with remaining values
Failed values excluded from aggregate analysis
Failure details in sweep aggregate metadata

Example output:

1 {
2   "metadata": {
3     "num_combinations": 4,
4     "num_successful_runs": 3,
5     "failed_runs": [
6       {
7         "run_index": 3,
8         "error": "Connection timeout after 60s"
9       }
10     ]
11   }
12 }

Solutions:

Investigate why that value fails (too high load?)
Adjust server configuration for higher load
Increase timeout values if needed
Check system resources at that load level

Best Practices

1. Start with a Wide Range

Begin with a wide range to understand the full performance envelope:

$ aiperf profile --concurrency 5,10,20,40,80 ...

Then narrow down based on results.

2. Use Confidence Runs for Production

For production validation, always combine sweep with confidence runs:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --num-profile-runs 5 \
>   ...

This quantifies variance and provides confidence intervals.

3. Check CV Before Drawing Conclusions

Always check the Coefficient of Variation (CV) for each value:

CV < 10%: Results are trustworthy
CV > 20%: Need more trials or investigation

4. Use Warmup

Always use warmup to eliminate cold-start effects:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --warmup-request-count 100 \
>   ...

5. Document Your Findings

Save your sweep aggregate and document your conclusions:

$ # Save command for reproducibility
$ echo "aiperf profile --concurrency 10,20,30,40 ..." > benchmark_command.txt
$ 
$ # Document findings
$ cat > findings.md << EOF
$ ## Benchmark Results
$ 
$ - Optimal concurrency: 30 (best balance)
$ - Pareto optimal points: 10, 30, 40
$ - Throughput inflection: 30 (scaling slows)
$ - Latency inflection: 40 (sharp degradation)
$ 
$ Recommendation: Operate at concurrency 30 for production.
$ EOF

6. Compare Apples to Apples

When comparing different configurations:

Use the same sweep values
Use the same number of trials
Use the same random seed (or same seed derivation)
Use the same workload parameters

7. Understand Your Objectives

Choose Pareto optimal points based on your SLOs:

Latency SLO: Choose the lowest latency Pareto point
Throughput SLO: Choose the highest throughput Pareto point
Balanced: Choose the middle Pareto point

Advanced Usage

Combining with Other Features

Parameter sweeping works with all AIPerf features:

With GPU telemetry:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --gpu-telemetry http://localhost:9400/metrics \
>   ...

With server metrics:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --server-metrics http://localhost:8000/metrics \
>   ...

With goodput constraints:

$ aiperf profile \
>   --concurrency 10,20,30,40 \
>   --goodput "time_to_first_token:100 inter_token_latency:10" \
>   ...

Analyzing Results Programmatically

Load sweep aggregate results in Python:

1 import json
2 import pandas as pd
3 
4 # Load sweep aggregate
5 with open('artifacts/.../sweep_aggregate/profile_export_aiperf_sweep.json') as f:
6     sweep = json.load(f)
7 
8 # Extract throughput and latency for each combination
9 data = []
10 for combo in sweep['per_combination_metrics']:
11     params = combo['parameters']
12     metrics = combo['metrics']
13     data.append({
14         'concurrency': params['concurrency'],
15         'throughput': metrics['request_throughput_avg']['mean'],
16         'latency_p99': metrics['ttft_p99_ms']['mean'],
17         'throughput_cv': metrics['request_throughput_avg'].get('cv', 0),
18         'latency_cv': metrics['ttft_p99_ms'].get('cv', 0),
19     })
20 
21 df = pd.DataFrame(data).sort_values('concurrency')
22 print(df)
23 
24 # Identify Pareto optimal points
25 pareto_optimal = sweep['pareto_optimal']
26 pareto_concurrency_values = [p['concurrency'] for p in pareto_optimal]
27 print(f"Pareto optimal concurrency values: {pareto_concurrency_values}")
28 
29 # Get best configurations
30 best_configs = sweep['best_configurations']
31 print(f"Best throughput: {best_configs['best_throughput']['parameters']}")
32 print(f"Best latency: {best_configs['best_latency_p99']['parameters']}")

Creating Custom Visualizations

1 import matplotlib.pyplot as plt
2 
3 # Plot throughput vs latency (Pareto frontier)
4 fig, ax = plt.subplots(figsize=(10, 6))
5 
6 for _, row in df.iterrows():
7     is_pareto = row['concurrency'] in pareto_concurrency_values
8     marker = 'o' if is_pareto else 'x'
9     color = 'blue' if is_pareto else 'gray'
10     label = f"C={row['concurrency']}" + (" (Pareto)" if is_pareto else "")
11     ax.scatter(row['latency_p99'], row['throughput'],
12                marker=marker, s=100, color=color,
13                label=label)
14 
15 ax.set_xlabel('Latency P99 (ms)')
16 ax.set_ylabel('Throughput (req/s)')
17 ax.set_title('Pareto Frontier: Throughput vs Latency')
18 ax.legend()
19 ax.grid(True, alpha=0.3)
20 plt.savefig('pareto_frontier.png')

Summary

Parameter sweeping helps you:

✅ Systematically characterize performance across parameter values
✅ Identify optimal configurations with Pareto analysis
✅ Compare performance across different parameter combinations
✅ Quantify variance with confidence intervals
✅ Make data-driven capacity planning decisions

Quick Start:

$ # Simple sweep
$ aiperf profile --concurrency 10,20,30,40 [other options]
$ 
$ # Sweep with confidence (recommended)
$ aiperf profile --concurrency 10,20,30,40 --num-profile-runs 5 [other options]

Key Concepts:

Pareto optimal: Best trade-off configurations
Best configurations: Highest throughput and lowest latency points
Sweep modes: Repeated (dynamic) vs Independent (isolated)
CV < 10%: Good repeatability

For more details, see:

Sweep Aggregates API Reference - Complete data format documentation
Multi-Run Confidence - Understanding confidence intervals
CLI Options - Full parameter reference
Metrics Reference - Detailed metric descriptions