Warmup Phase Configuration
The warmup phase runs before your actual benchmark to prepare the system for steady-state measurement. This guide explains when and how to configure warmup for accurate benchmarking results.
Why Use Warmup?
When benchmarking starts, several “cold-start” effects can pollute your measurements:
Cold-start effects include:
Quick Start
Add warmup with a simple request count:
Sample Output (Successful Run):
This sends 50 warmup requests before the 500 profiling requests begin. Warmup metrics are discarded.
Warmup Trigger Options
You can trigger warmup with count-based or duration-based stopping:
Count-Based Warmup
Duration-Based Warmup
Combined (First One Wins)
Warmup-Specific Load Settings
By default, warmup inherits your profiling settings. Override them for different warmup behavior:
Different Concurrency
Sample Output (Successful Run):
Warmup runs at 20 concurrent requests, then profiling runs at 100.
Different Request Rate
Sample Output (Successful Run):
Warmup sends at 10 QPS, then profiling runs at 50 QPS.
Different Arrival Pattern
Sample Output (Successful Run):
Warmup uses predictable constant arrivals; profiling uses gamma arrivals with reduced variance (smoothness > 1 = smoother than Poisson).
Warmup with Ramping
Warmup can include its own gradual ramp-up:
Sample Output (Successful Run):
Timeline:
Grace Period
By default, AIPerf waits indefinitely for all warmup responses before starting profiling. When using duration-based warmup (--warmup-duration), you can limit this wait:
This prevents slow warmup responses from delaying the profiling phase indefinitely.
Multi-Turn Warmup
For multi-turn benchmarks, warmup by session count ensures complete conversations:
This completes 10 full conversations (each ~5 turns) before profiling begins.
Prefill Concurrency Warmup
When using prefill concurrency to limit simultaneous prefill operations, you can configure warmup separately:
Warmup runs with lower limits (20 concurrent, 2 prefill), then profiling uses full limits.
Examples
Minimal Warmup
Just warm up connections and caches:
Production-Like Warmup
Simulate gradual traffic increase:
Long-Context Warmup
For long prompts, use lower warmup concurrency to avoid OOM:
CLI Reference
Stop Conditions (at least one required for warmup)
Load Settings (inherit from profiling if not set)
Ramping (inherit from profiling if not set)
Other
Troubleshooting
Related Documentation
- Gradual Ramping — Smooth ramp-up for concurrency and rate
- Prefill Concurrency — Memory-safe long-context benchmarking
- Timing Modes Reference — Complete CLI compatibility matrix