Request Rate with Max Concurrency
Combining --request-rate with --concurrency gives you precise control over both request timing and the maximum number of concurrent connections. This dual-control approach is essential for:
- Avoiding thundering herd — prevents simultaneous request bursts that overwhelm servers
- Testing real-world API constraints — validates behavior under actual rate and concurrency limits
- Simulating realistic clients — models bandwidth constraints and connection pool limits
- Validating resource protection — ensures servers handle properly-constrained traffic
- Controlled capacity testing — scales load gradually without resource exhaustion
How It Works
When both parameters are specified, AIPerf uses a sleep-then-gate pattern for each request:
- Sleep — wait according to request rate timing (based on arrival pattern)
- Check concurrency — attempt to acquire a semaphore slot
- Issue request — send to server if slot acquired
- On completion — release semaphore slot for next request
The sleep controls when requests attempt to launch (the rate), while the semaphore controls whether they can proceed (the concurrency ceiling).
No catch-up behavior: When the concurrency limit is reached, the system does not attempt to “catch up” by issuing requests faster once slots free up. The schedule continues at the configured rate.
Sustaining max concurrency: If your request rate is faster than your server’s average response time, the system will naturally reach and sustain max concurrency. For example, at 100 req/s with 1 second average response time, new requests arrive every 10ms but each takes 1 second to complete—so slots are always waiting to be filled, keeping the system at the concurrency ceiling. Conversely, if the request rate is slower than response time, slots free up faster than new requests arrive, so concurrency may never reach the maximum.
Ramp-up time formula: ramp_up_time = concurrency / request_rate
Choosing Your Arrival Pattern
The sleep intervals in step 1 are determined by your chosen arrival pattern (--arrival-pattern or --request-rate-mode). AIPerf supports four distribution patterns:
poisson (default) — Uses exponentially-distributed inter-arrival times to mimic natural traffic with randomized spacing. Ideal for realistic load testing and capacity planning. Add --random-seed for reproducible random patterns.
constant — Requests arrive at precisely evenly-spaced intervals for deterministic, predictable load. Ideal for reproducible benchmarks and regression testing.
gamma — Uses gamma-distributed inter-arrival times with tunable smoothness via --arrival-smoothness. Values <1.0 create bursty traffic, >1.0 creates smoother traffic, and 1.0 is equivalent to Poisson. Compatible with vLLM’s --burstiness parameter.
concurrency_burst — Issues requests as fast as possible (zero interval), useful for stress testing or when concurrency is the only rate limiter. Often used internally for warmup phases.
Setting Up the Server
Running the Examples
The following examples demonstrate both request rate modes in action. Make sure you’ve set up the server first (see above).
Poisson Mode: Natural Traffic Patterns
This example demonstrates realistic traffic simulation with a fast ramp-up (0.5 seconds). The Poisson distribution creates natural variance in request timing while maintaining an average rate of 200 req/s, capped at 100 concurrent requests.
Sample Output (Successful Run):
The Poisson distribution creates natural request spacing. You’ll notice requests don’t complete at exactly even intervals, but maintain the 200 req/s average rate. With max concurrency set to 100, the system reaches that ceiling after ~0.5 seconds (100 requests / 200 req/s).
Constant Mode: Deterministic Timing
This example uses evenly-spaced requests with a moderate ramp-up (1 second). Requests arrive at exactly 20ms intervals (50 req/s), making results highly reproducible for benchmarking and regression testing.
Sample Output (Successful Run):
With constant mode, requests arrive at precisely 20ms intervals (1 / 50 req/s = 20ms). This creates predictable, reproducible load patterns ideal for regression testing. The system reaches max concurrency of 50 after exactly 1 second (50 requests / 50 req/s).
Common Use Cases
Here are practical scenarios where combining request rate with max concurrency is particularly valuable:
High-Concurrency Testing at Scale
Test how your server handles thousands of concurrent users with a controlled ramp-up to avoid overwhelming connection establishment. A longer ramp-up gives the server time to allocate resources gradually.
Recommended mode: Either
poisson(for realistic variance) orconstant(for predictable ramp-up)
API Rate Limit Validation
Validate server behavior when clients respect both throughput and concurrency constraints. This tests rate-limiting logic and ensures proper 429/503 responses when appropriate.
Recommended mode:
constantfor precise, reproducible rate testing
Realistic User Traffic Simulation
Simulate organic user behavior with natural variance in request timing. The Poisson distribution models real-world patterns where users don’t arrive at perfectly regular intervals.
Recommended mode:
poissonwith--random-seedfor reproducible realistic traffic
Quick Reference
Key Parameters:
--request-rate <number>— Target requests per second--concurrency <number>— Maximum concurrent requests (acts as ceiling)--arrival-pattern <pattern>— Request timing distribution (default:poisson)- Options:
poisson,constant,gamma,concurrency_burst
- Options:
--arrival-smoothness <number>— Smoothness for gamma distribution (default: 1.0)--random-seed <number>— Makes random patterns reproducible
Behavior:
- Sleep-then-gate pattern: sleep based on rate, then acquire concurrency slot
- Continuation turns block on concurrency; new sessions skip if no slot available
- No catch-up: the schedule continues at the configured rate regardless of blocking
- Sustained concurrency: achieved when request rate exceeds server response time
Choosing a pattern:
- Use
poissonfor realistic traffic simulation with natural variance - Use
constantfor reproducible benchmarks with predictable timing - Use
gammawith--arrival-smoothnessfor tunable burstiness - Use
concurrency_burstfor maximum throughput stress tests - Add
--random-seedfor reproducible random patterns