Arrival Patterns: Simulating Realistic Traffic

View as Markdown

When benchmarking with --request-rate, AIPerf can vary how requests arrive over time. The --arrival-pattern option controls the distribution of inter-arrival times, letting you simulate everything from perfectly regular traffic to bursty real-world patterns.

Why Arrival Patterns Matter

Real traffic doesn’t arrive at perfectly regular intervals. Traffic comes in bursts—quiet periods followed by sudden spikes. How your server handles this variance affects real-world performance.

Constant Pattern: Poisson Pattern: Gamma (bursty):
| | | | | | | | | || | | | ||| | ||| |
└──────────────────▶ └──────────────────▶ └──────────────────▶
Perfect spacing Natural variance Clustered bursts
(unrealistic) (typical traffic) (stress testing)

Quick Start

$# Default: Poisson (realistic)
$aiperf profile --request-rate 50 ...
$
$# Explicit: Constant (deterministic)
$aiperf profile --request-rate 50 --arrival-pattern constant ...
$
$# Bursty: Gamma with low smoothness
$aiperf profile --request-rate 50 --arrival-pattern gamma --arrival-smoothness 0.5 ...

Available Patterns

Constant

$--arrival-pattern constant

Requests arrive at perfectly regular intervals: exactly 1/rate seconds apart.

Inter-arrival times:
10 QPS → every 100ms: |····|····|····|····|····|····|
0 100 200 300 400 500 600 ms

Use cases:

  • Baseline measurements with no variance
  • Debugging timing issues
  • Comparing against variable patterns
  • Deterministic, reproducible tests

Poisson (Default)

$--arrival-pattern poisson

Requests arrive according to a Poisson process—the mathematical model for random events at a constant average rate. Inter-arrival times follow an exponential distribution.

Inter-arrival times (exponential):
10 QPS average: |··|······|·|···|····|··|·······|···|
Varied gaps, same average rate over time

Characteristics:

  • Mean inter-arrival = 1/rate (same as constant)
  • Variance = (1/rate)² (natural randomness)
  • Sometimes requests cluster, sometimes gaps appear
  • Models real user behavior where arrivals are independent

Use cases:

  • Default realistic traffic simulation
  • Standard load testing
  • Comparing to theoretical queueing models

Gamma (Tunable Burstiness)

$--arrival-pattern gamma --arrival-smoothness <value>

Gamma distribution generalizes Poisson with a smoothness parameter that controls how bursty or regular arrivals are:

SmoothnessBehaviorVarianceUse Case
< 1.0Bursty — clustered arrivals with gapsHigherStress testing, worst-case scenarios
= 1.0Poisson — natural randomnessMediumSame as --arrival-pattern poisson
> 1.0Smooth — more regular arrivalsLowerControlled testing, less noise
Smoothness = 0.5 (bursty):
|||| ||| ||||| ||
Clusters of requests with quiet gaps
Smoothness = 1.0 (Poisson):
| || | | | || | | || |
Natural variance
Smoothness = 2.0 (smooth):
| | | | | | | | | | | | | |
More regular, approaches constant

Mathematical note: The smoothness parameter is the Gamma distribution’s shape parameter (k). Scale is automatically computed to maintain the correct mean rate.

Concurrency Burst

$# No --request-rate, just --concurrency
$aiperf profile --concurrency 50 ...

When you omit --request-rate and only specify --concurrency, AIPerf uses burst mode: zero delay between request dispatches, limited only by the concurrency semaphore.

Burst mode (concurrency=3):
[Req1]────────────────────────────▶
[Req2]────────────────────────────▶
[Req3]────────────────────────────▶
[Req4]──────────────────────▶ ← Starts when any slot frees

Use cases:

  • Maximum throughput discovery
  • Saturation testing
  • Finding server capacity limits

vLLM Compatibility

AIPerf’s --arrival-smoothness is compatible with vLLM’s --burstiness parameter:

$# Same distribution as vLLM with --burstiness 0.5
$aiperf profile \
> --request-rate 50 \
> --arrival-pattern gamma \
> --arrival-smoothness 0.5 \
> ...

This allows direct comparison between AIPerf and vLLM benchmark results when using the same smoothness/burstiness value.

Examples

Baseline vs Realistic Comparison

Compare how your server handles ideal vs realistic traffic:

$# Run 1: Constant (baseline)
$aiperf profile \
> --model your-model \
> --url localhost:8000 \
> --endpoint-type chat \
> --streaming \
> --request-rate 100 \
> --arrival-pattern constant \
> --benchmark-duration 60 \
> --output-dir results/constant
$
$**Expected Output (Run 1):**

INFO Starting AIPerf System INFO Using Request_Rate strategy with constant arrival pattern INFO AIPerf System is PROFILING

Profiling: [01:00] - Running for 60 seconds…

INFO Benchmark completed successfully INFO Results saved to: results/constant/

NVIDIA AIPerf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ Request Latency (ms) │ 178.45 │ 156.23 │ 212.34 │ 205.67 │ 176.89 │ │ Time to First Token (ms) │ 45.67 │ 38.12 │ 58.34 │ 56.23 │ 44.90 │ │ Inter Token Latency (ms) │ 11.23 │ 9.45 │ 14.67 │ 14.12 │ 11.01 │ │ Request Throughput (req/s) │ 98.45 │ - │ - │ - │ - │ └────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: results/constant/profile_export_aiperf.json

# Run 2: Poisson (realistic)
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--streaming \
--request-rate 100 \
--arrival-pattern poisson \
--benchmark-duration 60 \
--output-dir results/poisson

Expected Output (Run 2):

INFO Starting AIPerf System
INFO Using Request_Rate strategy with poisson arrival pattern
INFO AIPerf System is PROFILING
Profiling: [01:00] - Running for 60 seconds...
INFO Benchmark completed successfully
INFO Results saved to: results/poisson/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 182.34 │ 148.56 │ 267.89 │ 245.67 │ 179.12 │
│ Time to First Token (ms) │ 47.89 │ 35.67 │ 78.23 │ 72.45 │ 46.34 │
│ Inter Token Latency (ms) │ 11.67 │ 8.90 │ 19.34 │ 17.89 │ 11.23 │
│ Request Throughput (req/s) │ 96.78 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: results/poisson/profile_export_aiperf.json

Compare TTFT and throughput between runs. Higher variance under Poisson indicates sensitivity to traffic patterns.

Stress Testing with Bursty Traffic

Test how your server handles request bursts:

$aiperf profile \
> --model your-model \
> --url localhost:8000 \
> --endpoint-type chat \
> --streaming \
> --request-rate 100 \
> --arrival-pattern gamma \
> --arrival-smoothness 0.3 \
> --benchmark-duration 120

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Using Request_Rate strategy with gamma arrival pattern (smoothness: 0.3)
INFO AIPerf System is PROFILING
Profiling: [02:00] - Running for 120 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/your-model-chat-rate100/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 198.67 │ 142.34 │ 398.12 │ 356.78 │ 189.45 │
│ Time to First Token (ms) │ 52.34 │ 34.56 │ 112.34 │ 98.67 │ 49.23 │
│ Inter Token Latency (ms) │ 12.89 │ 8.23 │ 28.45 │ 24.67 │ 12.01 │
│ Request Throughput (req/s) │ 93.45 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/your-model-chat-rate100/profile_export_aiperf.json

Smoothness of 0.3 creates highly bursty traffic—several requests arrive nearly simultaneously, then quiet periods.

Smooth Traffic for Noise Reduction

Reduce variance in measurements for controlled experiments:

$aiperf profile \
> --model your-model \
> --url localhost:8000 \
> --endpoint-type chat \
> --streaming \
> --request-rate 50 \
> --arrival-pattern gamma \
> --arrival-smoothness 5.0 \
> --benchmark-duration 60

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Using Request_Rate strategy with gamma arrival pattern (smoothness: 5.0)
INFO AIPerf System is PROFILING
Profiling: [01:00] - Running for 60 seconds...
INFO Benchmark completed successfully
INFO Results saved to: artifacts/your-model-chat-rate50/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 165.23 │ 148.90 │ 189.45 │ 184.56 │ 164.12 │
│ Time to First Token (ms) │ 42.67 │ 36.89 │ 52.34 │ 50.12 │ 42.01 │
│ Inter Token Latency (ms) │ 10.89 │ 9.23 │ 13.45 │ 13.01 │ 10.67 │
│ Request Throughput (req/s) │ 49.23 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/your-model-chat-rate50/profile_export_aiperf.json

Smoothness of 5.0 produces very regular arrivals, reducing measurement noise while still having some natural variance.

Progressive Burstiness Test

Run multiple benchmarks with increasing burstiness to find where performance degrades:

$for smoothness in 2.0 1.0 0.7 0.5 0.3; do
$ aiperf profile \
> --model your-model \
> --url localhost:8000 \
> --endpoint-type chat \
> --streaming \
> --request-rate 100 \
> --arrival-pattern gamma \
> --arrival-smoothness $smoothness \
> --benchmark-duration 60 \
> --output-dir results/smoothness_$smoothness
$done

Warmup with Stable Pattern, Profile with Realistic

Use constant arrivals during warmup, then realistic patterns for profiling:

$aiperf profile \
> --model your-model \
> --url localhost:8000 \
> --endpoint-type chat \
> --streaming \
> --request-rate 100 \
> --arrival-pattern gamma \
> --arrival-smoothness 0.8 \
> --warmup-arrival-pattern constant \
> --warmup-duration 30 \
> --benchmark-duration 120

CLI Reference

OptionTypeDefaultDescription
--arrival-patternstrpoissonPattern for request arrivals: constant, poisson, gamma
--arrival-smoothnessfloatNoneGamma smoothness: <1 = bursty, 1 = Poisson, >1 = smooth. Defaults to 1.0 when using gamma pattern.
--warmup-arrival-patternstrInheritsOverride pattern for warmup phase

Constraints:

  • --arrival-pattern requires --request-rate to be set
  • --arrival-smoothness only applies when --arrival-pattern gamma
  • Cannot use with --user-centric-rate (deterministic per-user scheduling)
  • Cannot use with --fixed-schedule (timestamp-based scheduling)

Pattern Selection Guide

GoalPatternSmoothness
Reproducible baselineconstantN/A
Realistic traffic simulationpoissonN/A
Match vLLM benchmarkgammaSame as vLLM --burstiness
Stress test burst handlinggamma0.3 - 0.7
Reduce measurement noisegamma2.0 - 5.0
Maximum throughputN/A (burst mode)N/A

Understanding the Math

For those who want to understand the statistical properties:

PatternDistributionMeanVarianceCV (Coeff. of Variation)
ConstantDegenerate1/λ00
PoissonExponential1/λ1/λ²1
Gamma(k)Gamma1/λ1/(k·λ²)1/√k

Where λ = request rate and k = smoothness.

  • CV (Coefficient of Variation) = standard deviation / mean
  • Lower CV = more regular arrivals
  • Gamma with k=1 equals Poisson (CV=1)
  • As k→∞, Gamma approaches Constant (CV→0)