This guide provides a comprehensive reference for all load generator CLI options in AIPerf, including a compatibility matrix showing which options work together.
AIPerf determines how to schedule requests based on which CLI options you specify:
When multiple options are specified, AIPerf uses this priority:
--fixed-schedule or mooncake_trace dataset → Timestamp-based scheduling--user-centric-rate → Per-user turn gap scheduling--request-rate → Rate-based scheduling with arrival patterns--concurrency only → Burst mode (as fast as possible within limits)Arrival Pattern Values:
constant - Fixed inter-arrival times (1/rate)poisson - Exponential inter-arrivals (default with --request-rate)gamma - Tunable smoothness via --arrival-smoothnessconcurrency_burst - As fast as possible within concurrency limits (auto-set when no rate specified)Concurrency behavior by configuration:
--request-rate: Concurrency acts as a ceiling; requests scheduled by rate are blocked if at limit--concurrency only (no rate options): Concurrency is the primary driver; sends as fast as possible within limit--fixed-schedule: Concurrency acts as a ceiling; requests fire at scheduled times but blocked if at limit--user-centric-rate: Concurrency acts as a ceiling; user turns fire based on turn_gap but blocked if at limitImportant: If
--concurrencyis not set, session concurrency limiting is disabled (unlimited). For--user-centric-ratemode, consider setting--concurrencyto at least--num-usersto ensure all users can have in-flight requests.
See also: Prefill Concurrency Tutorial for detailed guidance on memory-safe long-context benchmarking.
Warmup options work independently of the main benchmark configuration. The warmup phase always uses rate-based scheduling internally.
--request-rate (Rate-Based Scheduling)Sends requests at a target average rate with configurable arrival patterns.
--concurrency Only (Burst Mode)Sends requests as fast as possible within concurrency limits. Triggered when no rate option is specified.
--fixed-schedule (Trace Replay)Replays requests at exact timestamps from dataset metadata. Used for trace replay benchmarking.
--user-centric-rate (KV Cache Benchmarking)Per-user rate limiting for KV cache benchmarking. Each user has a consistent gap between their turns.
Key formula: turn_gap = num_users / user_centric_rate
With --num-users 15 and --user-centric-rate 1.0, each user has 15 seconds between their turns.
For complete KV cache benchmarking, also configure shared system prompts and user context prompts. See the User-Centric Timing Tutorial for full configuration including
--shared-system-prompt-length,--user-context-prompt-length, and other prompt options.
See also: Multi-URL Load Balancing Tutorial for detailed configuration and examples.