Prefill Concurrency: Fine-Grained Benchmarking Control
Prefill Concurrency: Fine-Grained Benchmarking Control
Prefill concurrency (--prefill-concurrency) limits how many requests can be in the prefill phase simultaneously—the compute-intensive phase where the LLM processes input tokens before generating output. Instead of tuning request rate broadly, this gives you fine-grained control over how much queueing occurs at the prefill stage—especially valuable for disaggregated serving architectures where you want to directly control TTFT behavior.
Why Prefill Concurrency Matters
Every LLM request has two phases:
Limiting simultaneous prefills also prevents memory exhaustion when benchmarking long prompts.
How It Works
AIPerf limits how many requests can be in the prefill phase at once:
Once a request receives its first token, it releases its prefill slot and moves to decode—allowing the next request to start prefilling.
Requires --streaming to be enabled. Without streaming, AIPerf can’t detect when the first token arrives.
Coordinated omission trade-off: When requests wait for prefill slots, the benchmark operates as a closed loop, throttling itself to match server capacity. This is coordinated omission—your measured latencies will be lower than what users would experience if traffic kept arriving at the original rate. For accurate latency measurement, use open-loop benchmarking (request rate without prefill limits).
Two Concurrency Limits
AIPerf has two separate limits that work together:
--concurrency— Session concurrency: total active requests at once (per-request in single-turn mode, per-conversation in multi-turn mode)--prefill-concurrency— Prefill concurrency: how many can be in prefill phase at once
Example:
This means:
- Up to 50 requests can be active at once
- But only 5 can be reading their prompts (prefilling) at the same time
- The other 45 are either waiting to prefill OR already generating responses
Examples
Controlling Prefill Queue Depth
Benchmark with 16K token prompts, limiting how many can prefill simultaneously:
Sample Output (Successful Run):
What happens:
- 30 total concurrent sessions allowed
- Only 3 can prefill their 16K tokens simultaneously
Gradual Prefill Ramp-Up
Ramp prefill concurrency gradually to observe how TTFT changes as queue depth increases:
Sample Output (Successful Run):
Ramp behavior:
Combined with Request Rate
Prefill concurrency works with all scheduling modes:
Sample Output (Successful Run):
Requests arrive at 10 QPS, up to 100 can be active, but only 10 can prefill at once.
Troubleshooting
CLI Reference
Constraints:
--prefill-concurrencymust be ≤--concurrency(if both set)- Requires
--streamingto be enabled - Works with all scheduling modes (
--request-rate,--user-centric-rate,--fixed-schedule, burst mode)
Related Documentation
- Gradual Ramping — Smooth ramp-up for all concurrency dimensions
- Request Rate with Concurrency — Combining rate and concurrency controls
- User-Centric Timing — Multi-turn benchmarking for KV cache
- Timing Modes Reference — Complete CLI compatibility matrix