Combining --request-rate with --concurrency gives you precise control over both request timing and the maximum number of concurrent connections. This dual-control approach is essential for:
When both parameters are specified, AIPerf uses a sleep-then-gate pattern for each request:
The sleep controls when requests attempt to launch (the rate), while the semaphore controls whether they can proceed (the concurrency ceiling).
No catch-up behavior: When the concurrency limit is reached, the system does not attempt to “catch up” by issuing requests faster once slots free up. The schedule continues at the configured rate.
Sustaining max concurrency: If your request rate is faster than your server’s average response time, the system will naturally reach and sustain max concurrency. For example, at 100 req/s with 1 second average response time, new requests arrive every 10ms but each takes 1 second to complete—so slots are always waiting to be filled, keeping the system at the concurrency ceiling. Conversely, if the request rate is slower than response time, slots free up faster than new requests arrive, so concurrency may never reach the maximum.
Ramp-up time formula: ramp_up_time = concurrency / request_rate
The sleep intervals in step 1 are determined by your chosen arrival pattern (--arrival-pattern or --request-rate-mode). AIPerf supports four distribution patterns:
poisson (default) — Uses exponentially-distributed inter-arrival times to mimic natural traffic with randomized spacing. Ideal for realistic load testing and capacity planning. Add --random-seed for reproducible random patterns.
constant — Requests arrive at precisely evenly-spaced intervals for deterministic, predictable load. Ideal for reproducible benchmarks and regression testing.
gamma — Uses gamma-distributed inter-arrival times with tunable smoothness via --arrival-smoothness. Values <1.0 create bursty traffic, >1.0 creates smoother traffic, and 1.0 is equivalent to Poisson. Compatible with vLLM’s --burstiness parameter.
concurrency_burst — Issues requests as fast as possible (zero interval), useful for stress testing or when concurrency is the only rate limiter. Often used internally for warmup phases.
The following examples demonstrate both request rate modes in action. Make sure you’ve set up the server first (see above).
This example demonstrates realistic traffic simulation with a fast ramp-up (0.5 seconds). The Poisson distribution creates natural variance in request timing while maintaining an average rate of 200 req/s, capped at 100 concurrent requests.
Sample Output (Successful Run):
The Poisson distribution creates natural request spacing. You’ll notice requests don’t complete at exactly even intervals, but maintain the 200 req/s average rate. With max concurrency set to 100, the system reaches that ceiling after ~0.5 seconds (100 requests / 200 req/s).
This example uses evenly-spaced requests with a moderate ramp-up (1 second). Requests arrive at exactly 20ms intervals (50 req/s), making results highly reproducible for benchmarking and regression testing.
Sample Output (Successful Run):
With constant mode, requests arrive at precisely 20ms intervals (1 / 50 req/s = 20ms). This creates predictable, reproducible load patterns ideal for regression testing. The system reaches max concurrency of 50 after exactly 1 second (50 requests / 50 req/s).
Here are practical scenarios where combining request rate with max concurrency is particularly valuable:
Test how your server handles thousands of concurrent users with a controlled ramp-up to avoid overwhelming connection establishment. A longer ramp-up gives the server time to allocate resources gradually.
Recommended mode: Either
poisson(for realistic variance) orconstant(for predictable ramp-up)
Validate server behavior when clients respect both throughput and concurrency constraints. This tests rate-limiting logic and ensures proper 429/503 responses when appropriate.
Recommended mode:
constantfor precise, reproducible rate testing
Simulate organic user behavior with natural variance in request timing. The Poisson distribution models real-world patterns where users don’t arrive at perfectly regular intervals.
Recommended mode:
poissonwith--random-seedfor reproducible realistic traffic
Key Parameters:
--request-rate <number> — Target requests per second--concurrency <number> — Maximum concurrent requests (acts as ceiling)--arrival-pattern <pattern> — Request timing distribution (default: poisson)
poisson, constant, gamma, concurrency_burst--arrival-smoothness <number> — Smoothness for gamma distribution (default: 1.0)--random-seed <number> — Makes random patterns reproducibleBehavior:
Choosing a pattern:
poisson for realistic traffic simulation with natural varianceconstant for reproducible benchmarks with predictable timinggamma with --arrival-smoothness for tunable burstinessconcurrency_burst for maximum throughput stress tests--random-seed for reproducible random patterns