Multi-Turn Conversations
Multi-Turn Conversations
Multi-turn conversations allow you to benchmark chat-based models with realistic back-and-forth dialogue patterns. This feature simulates real-world scenarios where users engage in extended conversations with multiple exchanges, rather than isolated single-turn queries.
Overview
Multi-turn benchmarking provides several advantages:
- Realistic Chat Simulation: Model actual user interactions with conversational AI systems
- Context Window Testing: Evaluate performance as conversation history grows
- Session-based Load: Test how servers handle sustained multi-turn sessions
- Memory and State Management: Identify issues with conversation state handling
- Conversation Flow Analysis: Measure performance degradation over multiple turns
Understanding Request Control Options
AIPerf provides different options for controlling the number of requests depending on whether you’re running single-turn or multi-turn benchmarks:
--request-count: Controls the total number of single-turn requests to send. Use this for traditional single-turn benchmarks.--conversation-num: Controls the total number of conversations (sessions) to send in multi-turn scenarios. Each conversation may contain multiple turns (requests).
These options are mutually exclusive in their intent - use --request-count for single-turn benchmarking and --conversation-num for multi-turn benchmarking to avoid confusion.
Dataset Generation vs Request Execution
The --num-dataset-entries option controls how many unique prompts are generated in the dataset. This is separate from the number of requests or conversations:
--num-dataset-entries: Number of unique prompt entries to generate in the dataset--request-count: Number of single-turn requests to send (for single-turn benchmarks)--conversation-num: Number of conversations to send (for multi-turn benchmarks)
The dataset entries are reused/sampled as needed to fulfill the total request or conversation count. For example, you might generate 100 unique prompts (--num-dataset-entries 100) but send 1000 requests that sample from those prompts. --dataset-sampling-strategy determines how the pool of prompts is sampled when building payloads.
Core Parameters
Conversation Control
--conversation-num <N>: Total number of unique conversation sessions to execute- Aliases:
--num-conversations,--num-sessions - Each conversation represents a complete multi-turn dialogue session
- Aliases:
Turn Configuration
-
--conversation-turn-mean <N>: Average number of turns per conversation- Default: 1 (single-turn)
- Aliases:
--session-turns-mean
-
--conversation-turn-stddev <N>: Standard deviation for number of turns- Default: 0 (fixed number of turns)
- Aliases:
--session-turns-stddev
Turn Delays
-
--conversation-turn-delay-mean <MS>: Average delay between turns in milliseconds- Default: 0ms
- Simulates realistic user “think time” between messages
- Aliases:
--session-turn-delay-mean
-
--conversation-turn-delay-stddev <MS>: Standard deviation for turn delays- Default: 0ms
- Adds natural variance to delays
- Aliases:
--session-turn-delay-stddev
Setting Up the Server
Basic Multi-Turn Examples
Fixed-Length Conversations
Run a simple multi-turn benchmark with a fixed number of turns per conversation:
Sample Output (Successful Run):
This command will:
- Execute 10 separate conversation sessions
- Each conversation will have exactly 3 turns (requests)
- Total requests sent: 10 conversations × 3 turns = 30 requests
- 2 conversations will run concurrently
Variable-Length Conversations
Add variance to the number of turns per conversation for more realistic patterns:
This creates conversations with varying lengths (typically 3-7 turns), simulating natural conversation patterns where some users ask quick questions and others engage in deeper discussions.
Advanced Multi-Turn Scenarios
Realistic User Behavior with Turn Delays
Simulate real user “think time” between turns to model actual human interaction patterns:
The turn delays simulate realistic pauses as users read responses and formulate follow-up questions. This is critical for:
- Testing connection keep-alive mechanisms
- Evaluating server-side session state management
- Measuring sustained performance under realistic load
High-Concurrency Multi-Turn Sessions
Test how your server handles many simultaneous multi-turn conversations:
This benchmark:
- Maintains 50 active conversations simultaneously
- Tests session isolation and resource management
- Identifies scalability bottlenecks with multiple concurrent sessions
Request Rate with Multi-Turn Conversations
Combine request rate control with multi-turn conversations for controlled, sustained load:
This approach is ideal for:
- Modeling steady conversation arrival patterns
- Avoiding thundering herd problems during testing
- Measuring performance under controlled, sustained multi-turn load
Use Cases
Customer Support Chatbot Testing
Simulate realistic customer support interactions with varying conversation lengths:
Context Window Stress Testing
Test model performance with long conversations that accumulate substantial context:
Each turn in a conversation includes the full conversation history, so:
- Turn 1: ~300 tokens input
- Turn 5: ~300 + (4 × 250) = ~1300 tokens input
- Turn 15: ~300 + (14 × 250) = ~3800 tokens input
This helps identify performance degradation as context grows.
Burst Traffic Simulation
Simulate sudden spikes in conversation activity:
How Multi-Turn Works
Message History Accumulation
In multi-turn conversations, each subsequent turn includes the complete conversation history:
Turn 1:
Turn 2:
Turn 3:
This accumulation means:
- Input token count grows with each turn
- Later turns have increasingly large context to process
Real-World Conversation Flow
AIPerf simulates realistic multi-turn conversations by modeling natural user behavior patterns. Here’s how a typical multi-turn conversation flows:
Turn 0 (First Turn):
- User sends initial message → AI responds
- No delay before this turn (users don’t wait to start conversations)
Turn 1 (Second Turn):
- DELAY: User reads AI’s response and thinks about next message (configurable delay applied)
- User sends follow-up message → AI responds
Turn 2 (Third Turn):
- DELAY: User reads AI’s response and thinks about next message (configurable delay applied)
- User sends next message → AI responds
…and so on for subsequent turns
This flow pattern ensures benchmarks reflect real-world usage where:
- Users need time to read and process AI responses
- There’s natural thinking/typing time between messages
- The first message is sent immediately when starting a conversation
- Delays are applied before sending each subsequent turn
The delays between turns are controlled by:
--conversation-turn-delay-mean: Average delay in milliseconds (e.g., 2000ms = 2 seconds)--conversation-turn-delay-stddev: Variation in delays to simulate natural human behavior--conversation-turn-delay-ratio: Scaling factor for all delays
Execution Flow
- Dataset Generation: AIPerf generates the specified number of conversations, each with a random number of turns based on your mean and stddev
- Conversation Distribution: Conversations are distributed to workers according to concurrency and rate limits
- Turn Execution: For each conversation:
- Execute turn 1 (first turn, no delay), wait for response
- Append assistant’s response to conversation history
- Apply turn delay (simulating user reading/thinking time)
- Execute turn 2 with accumulated history, wait for response
- Apply turn delay
- Repeat for all remaining turns in the conversation
- Metrics Collection: Metrics are collected per-turn and aggregated across all conversations
Quick Reference
Conversation Control:
--conversation-num <N>— Number of conversation sessions (for multi-turn)--request-count <N>— Number of requests (for single-turn)--num-dataset-entries <N>— Number of unique prompts to generate
Turn Configuration:
--conversation-turn-mean <N>— Average turns per conversation (default: 1)--conversation-turn-stddev <N>— Standard deviation of turns (default: 0)
Turn Delays:
--conversation-turn-delay-mean <MS>— Average delay between turns in ms (default: 0)--conversation-turn-delay-stddev <MS>— Standard deviation of delays in ms (default: 0)
Best Practices:
- Start with lower concurrency when testing multi-turn (2-5) to understand baseline behavior
- Use turn delays to model realistic user interaction patterns
- Monitor context window growth in long conversations (turns × output tokens)
- Consider using
--request-rateto control conversation start rate for more predictable load - Use
--random-seedfor reproducible conversation patterns
See also:
- Conversation Context Mode — Control how conversation history accumulates (deltas vs message arrays, with or without pre-canned responses)