Load Generator Options Reference

View as Markdown

This guide provides a comprehensive reference for all load generator CLI options in AIPerf, including a compatibility matrix showing which options work together.

Request Scheduling Options

AIPerf determines how to schedule requests based on which CLI options you specify:

CLI OptionUse CaseDescription
--request-rateRate-based load testingSchedule requests at a target QPS with configurable arrival patterns
--concurrency (alone)Saturation/throughput testingSend requests as fast as possible within concurrency limits
--fixed-scheduleTrace replayReplay requests at exact timestamps from dataset
--user-centric-rateKV cache benchmarkingPer-user rate limiting with consistent turn gaps

Option Priority

When multiple options are specified, AIPerf uses this priority:

  1. --fixed-schedule or mooncake_trace dataset → Timestamp-based scheduling
  2. --user-centric-rate → Per-user turn gap scheduling
  3. --request-rate → Rate-based scheduling with arrival patterns
  4. --concurrency only → Burst mode (as fast as possible within limits)

Compatibility Matrix

Legend

  • Compatible - Option works with this configuration
  • ⚠️ Conditional - Works with restrictions (see notes)
  • Incompatible - Option conflicts or is ignored
  • 🔧 Required - Option is required for this configuration

Scheduling Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--request-rateConflicts with --user-centric-rate
--user-centric-rate🔧Requires --num-users
--fixed-schedule🔧Requires trace dataset with timestamps
--num-users🔧Required with --user-centric-rate; raises error otherwise
--request-rate-ramp-durationRaises error with --fixed-schedule or --user-centric-rate

Stop Conditions (at least one required)

Option--request-rate--fixed-schedule--user-centric-rateNotes
--request-countMutually exclusive with --num-sessions
--num-sessionsMutually exclusive with --request-count
--benchmark-durationEnables --benchmark-grace-period

Arrival Pattern Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--arrival-patternConflicts with --user-centric-rate; values: constant, poisson, gamma
--arrival-smoothness⚠️Only with --arrival-pattern gamma

Arrival Pattern Values:

  • constant - Fixed inter-arrival times (1/rate)
  • poisson - Exponential inter-arrivals (default with --request-rate)
  • gamma - Tunable smoothness via --arrival-smoothness
  • concurrency_burst - As fast as possible within concurrency limits (auto-set when no rate specified)

Concurrency Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--concurrencyLimits concurrent sessions with any scheduling option
--prefill-concurrency⚠️⚠️⚠️Requires --streaming; must be ≤ --concurrency
--concurrency-ramp-durationWorks with any scheduling option
--prefill-concurrency-ramp-duration⚠️⚠️⚠️Requires --streaming; works with any scheduling option

Concurrency behavior by configuration:

  • With --request-rate: Concurrency acts as a ceiling; requests scheduled by rate are blocked if at limit
  • With --concurrency only (no rate options): Concurrency is the primary driver; sends as fast as possible within limit
  • With --fixed-schedule: Concurrency acts as a ceiling; requests fire at scheduled times but blocked if at limit
  • With --user-centric-rate: Concurrency acts as a ceiling; user turns fire based on turn_gap but blocked if at limit

Important: If --concurrency is not set, session concurrency limiting is disabled (unlimited). For --user-centric-rate mode, consider setting --concurrency to at least --num-users to ensure all users can have in-flight requests.

See also: Prefill Concurrency Tutorial for detailed guidance on memory-safe long-context benchmarking.

Grace Period Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--benchmark-grace-period⚠️⚠️⚠️Requires --benchmark-duration; default: 30s (--user-centric-rate defaults to ∞ when duration-based)

Fixed Schedule Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--fixed-schedule-auto-offsetRaises error without --fixed-schedule; conflicts with --fixed-schedule-start-offset
--fixed-schedule-start-offsetRaises error without --fixed-schedule; conflicts with --fixed-schedule-auto-offset
--fixed-schedule-end-offsetRaises error without --fixed-schedule; must be ≥ start offset

Request Cancellation Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--request-cancellation-ratePercentage (0-100)
--request-cancellation-delay⚠️⚠️⚠️Requires --request-cancellation-rate; raises error otherwise

Dataset Options

Option--request-rate--fixed-schedule--user-centric-rateNotes
--dataset-sampling-strategyNot compatible with --fixed-schedule

Session Configuration

Option--request-rate--fixed-schedule--user-centric-rateNotes
--session-turns-mean⚠️--user-centric-rate requires ≥ 2
--session-turns-stddev

Warmup Options

Warmup options work independently of the main benchmark configuration. The warmup phase always uses rate-based scheduling internally.

OptionAll ConfigurationsNotes
--warmup-request-countStop condition for warmup; mutually exclusive with --num-warmup-sessions
--warmup-durationStop condition for warmup
--num-warmup-sessionsStop condition for warmup; mutually exclusive with --warmup-request-count
--warmup-concurrencyFalls back to --concurrency
--warmup-prefill-concurrency⚠️Requires --streaming
--warmup-request-rateFalls back to --request-rate
--warmup-arrival-patternFalls back to --arrival-pattern
--warmup-grace-period⚠️Requires warmup to be enabled; default: ∞
--warmup-concurrency-ramp-durationFalls back to --concurrency-ramp-duration
--warmup-prefill-concurrency-ramp-duration⚠️Requires --streaming
--warmup-request-rate-ramp-durationFalls back to --request-rate-ramp-duration

Configuration Examples

Using --request-rate (Rate-Based Scheduling)

Sends requests at a target average rate with configurable arrival patterns.

$# Poisson arrivals at 10 QPS
$aiperf profile --url localhost:8000 --model llama \
> --request-rate 10 \
> --arrival-pattern poisson \
> --request-count 100
$
$# Constant arrivals with concurrency limit
$aiperf profile --url localhost:8000 --model llama \
> --request-rate 20 \
> --arrival-pattern constant \
> --concurrency 5 \
> --benchmark-duration 60

Using --concurrency Only (Burst Mode)

Sends requests as fast as possible within concurrency limits. Triggered when no rate option is specified.

$# Maximum throughput within concurrency=10
$aiperf profile --url localhost:8000 --model llama \
> --concurrency 10 \
> --request-count 100
$
$# Prefill-limited throughput
$aiperf profile --url localhost:8000 --model llama \
> --concurrency 20 \
> --prefill-concurrency 5 \
> --streaming \
> --benchmark-duration 60

Using --fixed-schedule (Trace Replay)

Replays requests at exact timestamps from dataset metadata. Used for trace replay benchmarking.

$# Replay mooncake trace
$aiperf profile --url localhost:8000 --model llama \
> --input-file trace.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule
$
$# With time window filtering
$aiperf profile --url localhost:8000 --model llama \
> --input-file trace.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --fixed-schedule-start-offset 60000 \
> --fixed-schedule-end-offset 120000

Using --user-centric-rate (KV Cache Benchmarking)

Per-user rate limiting for KV cache benchmarking. Each user has a consistent gap between their turns.

$# 15 users at 1 QPS total (basic example)
$aiperf profile --url localhost:8000 --model llama \
> --user-centric-rate 1.0 \
> --num-users 15 \
> --session-turns-mean 20 \
> --streaming \
> --benchmark-duration 300

Key formula: turn_gap = num_users / user_centric_rate

With --num-users 15 and --user-centric-rate 1.0, each user has 15 seconds between their turns.

For complete KV cache benchmarking, also configure shared system prompts and user context prompts. See the User-Centric Timing Tutorial for full configuration including --shared-system-prompt-length, --user-context-prompt-length, and other prompt options.


Common Validation Errors

ErrorCauseSolution
--user-centric-rate cannot be used together with --request-rate or --arrival-patternConflicting optionsUse only one scheduling option
--user-centric-rate requires --num-users to be setMissing required optionAdd --num-users
--user-centric-rate requires multi-turn conversations (--session-turns-mean >= 2)Single-turn with --user-centric-rateUse --request-rate for single-turn or increase --session-turns-mean
--benchmark-grace-period can only be used with duration-based benchmarkingGrace period without durationAdd --benchmark-duration
--warmup-grace-period can only be used when warmup is enabledWarmup grace without warmupAdd --warmup-request-count, --warmup-duration, or --num-warmup-sessions
--prefill-concurrency requires --streaming to be enabledPrefill without streamingAdd --streaming
--arrival-smoothness can only be used with --arrival-pattern gammaWrong arrival patternChange to --arrival-pattern gamma
Dataset sampling strategy is not compatible with fixed schedule modeSampling with --fixed-scheduleRemove --dataset-sampling-strategy
Both a request-count and number of conversations are setConflicting stop conditionsUse only one of --request-count or --num-sessions
Both --warmup-request-count and --num-warmup-sessions are setConflicting warmup stop conditionsUse only one of --warmup-request-count or --num-warmup-sessions
--num-users can only be used with --user-centric-rate--num-users without --user-centric-rateAdd --user-centric-rate or remove --num-users
--request-cancellation-delay can only be used with --request-cancellation-rateDelay without cancellation rateAdd --request-cancellation-rate or remove --request-cancellation-delay
--fixed-schedule-* can only be used with --fixed-scheduleFixed schedule options without --fixed-scheduleAdd --fixed-schedule or remove the offset options
--request-rate-ramp-duration cannot be used with --user-centric-rateRate ramping with --user-centric-rateRemove --request-rate-ramp-duration
--request-rate-ramp-duration cannot be used with --fixed-scheduleRate ramping with --fixed-scheduleRemove --request-rate-ramp-duration

Quick Reference: Which Options to Use

┌─────────────────────────────────────────────────────────────────┐
│ Which options should I use? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Replaying a trace with timestamps? │
│ └─► --fixed-schedule (with mooncake_trace dataset) │
│ │
│ Multi-turn KV cache benchmarking? │
│ └─► --user-centric-rate + --num-users │
│ │
│ Controlled request rate testing? │
│ └─► --request-rate (+ optional --arrival-pattern) │
│ │
│ Maximum throughput / saturation testing? │
│ └─► --concurrency only (no rate options) │
│ │
└─────────────────────────────────────────────────────────────────┘

Full Options Reference

Scheduling Options

OptionTypeDefaultDescription
--request-ratefloatNoneTarget QPS; enables rate-based scheduling
--user-centric-ratefloatNonePer-user QPS; enables turn-gap scheduling (requires --num-users)
--fixed-scheduleboolfalseEnable timestamp-based scheduling from dataset
--num-usersintNoneConcurrent users (required with --user-centric-rate)
--arrival-patternenumpoissonRequest arrival distribution: constant, poisson, gamma (only with --request-rate)
--arrival-smoothnessfloat1.0Gamma distribution shape (only with --arrival-pattern gamma)
--request-rate-ramp-durationfloatNoneSeconds to ramp request rate from proportional minimum to target (only with --request-rate)

Concurrency Options

OptionTypeDefaultDescription
--concurrencyintNoneMax concurrent sessions; drives throughput when no rate option specified
--prefill-concurrencyintNoneMax requests in prefill stage (requires --streaming)
--concurrency-ramp-durationfloatNoneSeconds to ramp concurrency from 1 to target
--prefill-concurrency-ramp-durationfloatNoneSeconds to ramp prefill concurrency

Stop Conditions

OptionTypeDefaultDescription
--benchmark-durationfloatNoneMax duration in seconds for benchmarking
--benchmark-grace-periodfloat30.0Grace period after duration ends (requires --benchmark-duration)
--request-countintAutoMax requests to send
--num-sessionsintNoneNumber of conversations to run

Request Cancellation

OptionTypeDefaultDescription
--request-cancellation-ratefloatNonePercentage of requests to cancel (0-100)
--request-cancellation-delayfloat0.0Seconds to wait before cancelling (requires --request-cancellation-rate)

Warmup Options

OptionTypeDefaultDescription
--warmup-request-countintNoneMax warmup requests; mutually exclusive with --num-warmup-sessions
--warmup-durationfloatNoneMax warmup duration in seconds
--num-warmup-sessionsintNoneNumber of warmup sessions; mutually exclusive with --warmup-request-count
--warmup-concurrencyint--concurrencyWarmup max concurrent requests
--warmup-prefill-concurrencyint--prefill-concurrencyWarmup prefill concurrency
--warmup-request-ratefloat--request-rateWarmup request rate
--warmup-arrival-patternenum--arrival-patternWarmup arrival pattern
--warmup-grace-periodfloatSeconds to wait for warmup responses
--warmup-concurrency-ramp-durationfloat--concurrency-ramp-durationWarmup concurrency ramp
--warmup-prefill-concurrency-ramp-durationfloat--prefill-concurrency-ramp-durationWarmup prefill ramp
--warmup-request-rate-ramp-durationfloat--request-rate-ramp-durationWarmup rate ramp

Fixed Schedule Options

OptionTypeDefaultDescription
--fixed-schedule-auto-offsetboolfalseAuto-offset timestamps to start at 0 (requires --fixed-schedule)
--fixed-schedule-start-offsetintNoneStart offset in milliseconds (requires --fixed-schedule)
--fixed-schedule-end-offsetintNoneEnd offset in milliseconds (requires --fixed-schedule)

Session Configuration

OptionTypeDefaultDescription
--session-turns-meanfloat1.0Mean turns per session (--user-centric-rate requires ≥ 2)
--session-turns-stddevfloat0.0Standard deviation of turns
--dataset-sampling-strategyenumshuffleDataset sampling: sequential, shuffle (not with --fixed-schedule)

Multi-URL Load Balancing

OptionTypeDefaultDescription
--urllistlocalhost:8000One or more endpoint URLs; multiple URLs enable load balancing
--url-strategyenumround_robinStrategy for distributing requests across multiple URLs

See also: Multi-URL Load Balancing Tutorial for detailed configuration and examples.