*** sidebar-title: Multi-Turn Conversations --------------------- For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/aiperf/tutorials/datasets-inputs/llms.txt. For full documentation content, see https://docs.nvidia.com/aiperf/tutorials/datasets-inputs/llms-full.txt. # Multi-Turn Conversations Multi-turn conversations allow you to benchmark chat-based models with realistic back-and-forth dialogue patterns. This feature simulates real-world scenarios where users engage in extended conversations with multiple exchanges, rather than isolated single-turn queries. ## Overview Multi-turn benchmarking provides several advantages: - **Realistic Chat Simulation**: Model actual user interactions with conversational AI systems - **Context Window Testing**: Evaluate performance as conversation history grows - **Session-based Load**: Test how servers handle sustained multi-turn sessions - **Memory and State Management**: Identify issues with conversation state handling - **Conversation Flow Analysis**: Measure performance degradation over multiple turns **Understanding Request Control Options** AIPerf provides different options for controlling the number of requests depending on whether you're running single-turn or multi-turn benchmarks: - **`--request-count`**: Controls the total number of **single-turn requests** to send. Use this for traditional single-turn benchmarks. - **`--conversation-num`**: Controls the total number of **conversations (sessions)** to send in multi-turn scenarios. Each conversation may contain multiple turns (requests). These options are mutually exclusive in their intent - use `--request-count` for single-turn benchmarking and `--conversation-num` for multi-turn benchmarking to avoid confusion. **Dataset Generation vs Request Execution** The `--num-dataset-entries` option controls how many **unique prompts** are generated in the dataset. This is separate from the number of requests or conversations: - `--num-dataset-entries`: Number of unique prompt entries to generate in the dataset - `--request-count`: Number of single-turn requests to send (for single-turn benchmarks) - `--conversation-num`: Number of conversations to send (for multi-turn benchmarks) The dataset entries are reused/sampled as needed to fulfill the total request or conversation count. For example, you might generate 100 unique prompts (`--num-dataset-entries 100`) but send 1000 requests that sample from those prompts. `--dataset-sampling-strategy` determines how the pool of prompts is sampled when building payloads. ## Core Parameters ### Conversation Control - **`--conversation-num `**: Total number of unique conversation sessions to execute - Aliases: `--num-conversations`, `--num-sessions` - Each conversation represents a complete multi-turn dialogue session ### Turn Configuration - **`--conversation-turn-mean `**: Average number of turns per conversation - Default: 1 (single-turn) - Aliases: `--session-turns-mean` - **`--conversation-turn-stddev `**: Standard deviation for number of turns - Default: 0 (fixed number of turns) - Aliases: `--session-turns-stddev` ### Turn Delays - **`--conversation-turn-delay-mean `**: Average delay between turns in milliseconds - Default: 0ms - Simulates realistic user "think time" between messages - Aliases: `--session-turn-delay-mean` - **`--conversation-turn-delay-stddev `**: Standard deviation for turn delays - Default: 0ms - Adds natural variance to delays - Aliases: `--session-turn-delay-stddev` ## Setting Up the Server ```bash # Start vLLM server for multi-turn benchmarking docker pull vllm/vllm-openai:latest docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ --model Qwen/Qwen3-0.6B \ --host 0.0.0.0 --port 8000 & ``` ## Basic Multi-Turn Examples ### Fixed-Length Conversations Run a simple multi-turn benchmark with a fixed number of turns per conversation: ```bash # Run 10 conversations, each with exactly 3 turns aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 10 \ --conversation-turn-mean 3 \ --conversation-turn-stddev 0 \ --synthetic-input-tokens-mean 200 \ --output-tokens-mean 150 \ --concurrency 2 \ --random-seed 42 ``` **Sample Output (Successful Run):** ``` INFO Starting AIPerf System INFO Multi-turn mode: 10 conversations, 3 turns each (30 total requests) INFO AIPerf System is PROFILING Profiling: 30/30 |████████████████████████| 100% [00:45<00:00] Conversations: 10/10 complete INFO Benchmark completed successfully INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency2/ NVIDIA AIPerf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ Request Latency (ms) │ 234.56 │ 198.23 │ 287.45 │ 276.12 │ 232.45 │ │ Time to First Token (ms) │ 52.34 │ 45.12 │ 68.23 │ 65.45 │ 51.89 │ │ Output Token Count (tokens) │ 150.00 │ 150.00 │ 150.00 │ 150.00 │ 150.00 │ └─────────────────────────────┴────────┴────────┴────────┴────────┴────────┘ JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency2/profile_export_aiperf.json ``` This command will: - Execute 10 separate conversation sessions - Each conversation will have exactly 3 turns (requests) - Total requests sent: 10 conversations × 3 turns = 30 requests - 2 conversations will run concurrently ### Variable-Length Conversations Add variance to the number of turns per conversation for more realistic patterns: ```bash # Run conversations with variable lengths (mean: 5, stddev: 2) aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 20 \ --conversation-turn-mean 5 \ --conversation-turn-stddev 2 \ --synthetic-input-tokens-mean 150 \ --output-tokens-mean 100 \ --concurrency 4 \ --random-seed 42 ``` This creates conversations with varying lengths (typically 3-7 turns), simulating natural conversation patterns where some users ask quick questions and others engage in deeper discussions. ## Advanced Multi-Turn Scenarios ### Realistic User Behavior with Turn Delays Simulate real user "think time" between turns to model actual human interaction patterns: ```bash # Add realistic delays between turns (mean: 2000ms, stddev: 500ms) aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 15 \ --conversation-turn-mean 4 \ --conversation-turn-stddev 1 \ --conversation-turn-delay-mean 2000 \ --conversation-turn-delay-stddev 500 \ --synthetic-input-tokens-mean 180 \ --output-tokens-mean 120 \ --concurrency 3 \ --random-seed 42 ``` The turn delays simulate realistic pauses as users read responses and formulate follow-up questions. This is critical for: - Testing connection keep-alive mechanisms - Evaluating server-side session state management - Measuring sustained performance under realistic load ### High-Concurrency Multi-Turn Sessions Test how your server handles many simultaneous multi-turn conversations: ```bash # Run 50 concurrent conversations with variable lengths aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 100 \ --conversation-turn-mean 6 \ --conversation-turn-stddev 2 \ --synthetic-input-tokens-mean 250 \ --output-tokens-mean 200 \ --concurrency 50 \ --random-seed 42 ``` This benchmark: - Maintains 50 active conversations simultaneously - Tests session isolation and resource management - Identifies scalability bottlenecks with multiple concurrent sessions ### Request Rate with Multi-Turn Conversations Combine request rate control with multi-turn conversations for controlled, sustained load: ```bash # Start new conversations at 5 conversations/second aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 30 \ --conversation-turn-mean 4 \ --request-rate 5 \ --request-rate-mode poisson \ --synthetic-input-tokens-mean 200 \ --output-tokens-mean 150 \ --random-seed 42 ``` This approach is ideal for: - Modeling steady conversation arrival patterns - Avoiding thundering herd problems during testing - Measuring performance under controlled, sustained multi-turn load ## Use Cases ### Customer Support Chatbot Testing Simulate realistic customer support interactions with varying conversation lengths: ```bash # Model customer support conversations: # - Average 6-8 turns per conversation # - Natural delays between user messages # - Mix of short and long input/output sequences aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 50 \ --conversation-turn-mean 7 \ --conversation-turn-stddev 2 \ --conversation-turn-delay-mean 3000 \ --conversation-turn-delay-stddev 1000 \ --synthetic-input-tokens-mean 150 \ --synthetic-input-tokens-stddev 50 \ --output-tokens-mean 200 \ --output-tokens-stddev 80 \ --concurrency 10 \ --random-seed 42 ``` ### Context Window Stress Testing Test model performance with long conversations that accumulate substantial context: ```bash # Test long conversations with growing context aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 10 \ --conversation-turn-mean 15 \ --conversation-turn-stddev 3 \ --synthetic-input-tokens-mean 300 \ --output-tokens-mean 250 \ --concurrency 2 \ --random-seed 42 ``` Each turn in a conversation includes the full conversation history, so: - Turn 1: ~300 tokens input - Turn 5: ~300 + (4 × 250) = ~1300 tokens input - Turn 15: ~300 + (14 × 250) = ~3800 tokens input This helps identify performance degradation as context grows. ### Burst Traffic Simulation Simulate sudden spikes in conversation activity: ```bash # Simulate burst of conversation starts aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --url localhost:8000 \ --conversation-num 100 \ --conversation-turn-mean 3 \ --concurrency 50 \ --synthetic-input-tokens-mean 180 \ --output-tokens-mean 120 \ --random-seed 42 ``` ## How Multi-Turn Works ### Message History Accumulation In multi-turn conversations, each subsequent turn includes the complete conversation history: **Turn 1:** ```json { "messages": [ {"role": "user", "content": "What is machine learning?"} ] } ``` **Turn 2:** ```json { "messages": [ {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is..."}, {"role": "user", "content": "Can you give an example?"} ] } ``` **Turn 3:** ```json { "messages": [ {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is..."}, {"role": "user", "content": "Can you give an example?"}, {"role": "assistant", "content": "Sure! One example is..."}, {"role": "user", "content": "How does it differ from traditional programming?"} ] } ``` This accumulation means: - Input token count grows with each turn - Later turns have increasingly large context to process ### Real-World Conversation Flow AIPerf simulates realistic multi-turn conversations by modeling natural user behavior patterns. Here's how a typical multi-turn conversation flows: **Turn 0 (First Turn):** - User sends initial message → AI responds - **No delay** before this turn (users don't wait to start conversations) **Turn 1 (Second Turn):** - **DELAY**: User reads AI's response and thinks about next message (configurable delay applied) - User sends follow-up message → AI responds **Turn 2 (Third Turn):** - **DELAY**: User reads AI's response and thinks about next message (configurable delay applied) - User sends next message → AI responds **...and so on for subsequent turns** This flow pattern ensures benchmarks reflect real-world usage where: - Users need time to read and process AI responses - There's natural thinking/typing time between messages - The first message is sent immediately when starting a conversation - Delays are applied **before** sending each subsequent turn The delays between turns are controlled by: - `--conversation-turn-delay-mean`: Average delay in milliseconds (e.g., 2000ms = 2 seconds) - `--conversation-turn-delay-stddev`: Variation in delays to simulate natural human behavior - `--conversation-turn-delay-ratio`: Scaling factor for all delays ### Execution Flow 1. **Dataset Generation**: AIPerf generates the specified number of conversations, each with a random number of turns based on your mean and stddev 2. **Conversation Distribution**: Conversations are distributed to workers according to concurrency and rate limits 3. **Turn Execution**: For each conversation: - Execute turn 1 (first turn, no delay), wait for response - Append assistant's response to conversation history - Apply turn delay (simulating user reading/thinking time) - Execute turn 2 with accumulated history, wait for response - Apply turn delay - Repeat for all remaining turns in the conversation 4. **Metrics Collection**: Metrics are collected per-turn and aggregated across all conversations ## Quick Reference **Conversation Control:** - `--conversation-num ` — Number of conversation sessions (for multi-turn) - `--request-count ` — Number of requests (for single-turn) - `--num-dataset-entries ` — Number of unique prompts to generate **Turn Configuration:** - `--conversation-turn-mean ` — Average turns per conversation (default: 1) - `--conversation-turn-stddev ` — Standard deviation of turns (default: 0) **Turn Delays:** - `--conversation-turn-delay-mean ` — Average delay between turns in ms (default: 0) - `--conversation-turn-delay-stddev ` — Standard deviation of delays in ms (default: 0) **Best Practices:** - Start with lower concurrency when testing multi-turn (2-5) to understand baseline behavior - Use turn delays to model realistic user interaction patterns - Monitor context window growth in long conversations (turns × output tokens) - Consider using `--request-rate` to control conversation start rate for more predictable load - Use `--random-seed` for reproducible conversation patterns **See also:** - [Conversation Context Mode](../reference/conversation-context-mode.md) — Control how conversation history accumulates (deltas vs message arrays, with or without pre-canned responses)