Gradual Ramping
Gradual ramping lets you increase concurrency and request rate smoothly over time, rather than jumping to full load immediately. This prevents overwhelming your server at benchmark start.
Why Use Ramping?
When a benchmark starts, immediately hitting your target load can cause problems:
Problems without ramping:
- Connection storms — Hundreds of simultaneous connections overwhelming the server
- Memory spikes — Sudden KV-cache allocation causing OOM or degraded performance
- Misleading metrics — Cold-start effects polluting your steady-state measurements
Benefits of ramping:
- Server warms up gradually (caches, JIT compilation, connection pools)
- Early detection of capacity limits before hitting full load
- Cleaner measurements once you reach steady state
What Can Be Ramped?
AIPerf supports ramping three dimensions:
Each ramps from a low starting value up to your target over the specified duration.
Basic Usage
Ramping Concurrency
Gradually increase from 1 concurrent request to 100 over 30 seconds:
Sample Output (Successful Run):
What happens:
Ramping Request Rate
Gradually increase from a low starting rate to 100 QPS over 60 seconds:
Sample Output (Successful Run):
What happens:
The starting rate is calculated proportionally: start = target * (update_interval / duration). With default settings (0.1s updates), ramping to 100 QPS over 60 seconds starts at ~0.17 QPS (not zero).
Combining Rate and Concurrency Ramping
Ramp both dimensions together for maximum control:
Sample Output (Successful Run):
Both ramp in parallel, reaching their targets at 30 seconds.
Prefill Concurrency Ramping
For long-context workloads, you may want to limit how many requests are in the “prefill” phase (processing input tokens) simultaneously. This prevents memory spikes from multiple large prompts being processed at once.
Sample Output (Successful Run):
This limits prefill to 20 concurrent requests (ramped over 20 seconds), while allowing up to 100 total concurrent requests.
Warmup Phase Ramping
Each phase can have its own ramp settings. Warmup uses --warmup-* prefixed options:
Common Scenarios
High-Concurrency Stress Test
Ramp slowly to avoid overwhelming the server, then sustain full load:
The 60-second ramp gives the server time to allocate resources (~8 new connections per second).
Long-Context Memory Protection
Limit memory spikes from large prompts with prefill concurrency:
Only 5 requests process their 32K input tokens simultaneously, preventing KV-cache OOM.
Capacity Discovery
Find your server’s limits by ramping slowly and watching for degradation:
Watch latency and throughput metrics as load increases. When latency spikes or errors appear, you’ve found the limit.
How It Works
Concurrency Ramping (Discrete Steps)
Concurrency increases by +1 at evenly spaced intervals:
- 100 concurrency over 30 seconds = +1 every ~0.3 seconds
- 500 concurrency over 60 seconds = +1 every ~0.12 seconds
Each step allows one more concurrent request immediately.
Request Rate Ramping (Smooth Interpolation)
Rate updates continuously (every 0.1 seconds by default):
- 100 QPS over 60 seconds = updates ~600 times, smoothly increasing
- Linear interpolation from start rate to target rate
This creates smooth traffic curves without sudden jumps.
Quick Reference
Key behaviors:
- Concurrency starts at 1 and increases by +1 at even intervals
- Request rate starts proportionally low and interpolates smoothly
- Ramps complete exactly at the specified duration
- After ramping, values stay at the target for the rest of the phase
Related Documentation
- Prefill Concurrency — Memory-safe long-context benchmarking with prefill limiting
- Request Rate with Concurrency — Combining rate and concurrency controls
- Timing Modes Reference — Complete CLI compatibility matrix