Benchmark Multi-Turn Chat with GenAI-Perf#

GenAI-Perf allows you to benchmark multi-turn chat. This can be used for simulating multiple turns in a conversation in a way that matches real-world user behavior.

You can use either synthetic data or a custom dataset. This tutorial will guide you through setting up a model server and running a profiling session with simulated conversations or a predefined dataset.

Table of Contents#

Start a Chat Model Server
Approach 1: Benchmark with Synthetic Data
Approach 2: Benchmark with a Custom Dataset
Review the Output

Start a Chat Model Server#

First, launch a vLLM server with an chat endpoint:

docker run -it --net=host --rm --gpus=all \
  vllm/vllm-openai:latest \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --dtype float16 \
  --max-model-len 1024

Approach 1: Benchmark with Synthetic Data#

Use synthetic data to simulate multiple chat sessions with controlled token input and response lengths.

genai-perf profile \
  -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --endpoint-type chat \
  --num-sessions 10 \
  --session-concurrency 5 \
  --session-turns-mean 2 \
  --session-turns-stddev 0 \
  --session-turn-delay-mean 1000 \
  --session-turn-delay-stddev 5 \
  --synthetic-input-tokens-mean 50 \
  --output-tokens-mean 50 \
  --num-prefix-prompts 3 \
  --prefix-prompt-length 15

Understand Key Arguments#

Required Arguments#

--num-sessions 10: Simulates 10 independent chat sessions.
--session-concurrency 5: Enables session mode and runs up to 5 sessions in parallel.

Optional Arguments#

--session-turns-mean 2: Each session has an average of 2 turns.
--session-turn-delay-mean 1000: Introduces a 1-second delay between user turns (simulating real-world interaction).
--synthetic-input-tokens-mean 50: Each user input averages 50 tokens.
--output-tokens-mean 50: Each model response averages 50 tokens.
--num-prefix-prompts 3: Uses a pool of 3 system prompts for the first turn in each session.
--prefix-prompt-length 15: Each prefix prompt contains 15 tokens.

Approach 2: Benchmark with a Custom Dataset#

If you prefer to benchmark using a predefined dataset, create a JSONL input file with the dataset.

Example Input File#

echo '{"session_id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6", "delay": 1000, "input_length": 50, "output_length": 10}
{"session_id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6", "delay": 2000, "input_length": 50, "output_length": 10}
{"session_id": "f81d4fae-7dec-11d0-a765-00a0c91e6bf6", "input_length": 100, "output_length": 10}
{"session_id": "113059749145936325402354257176981405696", "delay": 1000, "input_length": 25, "output_length": 20}
{"session_id": "113059749145936325402354257176981405696", "input_length": 20, "output_length": 20}' > inputs.jsonl

Understand Key Arguments#

Most of the arguments are the same as the synthetic data approach. The new ones are detailed below.

Optional Arguments#

--session-delay-ratio: Modifies the delays in the payload file. The delays are multiplied by this ratio. This makes it easier to tune delays to represent different scenarios.

Understand Key Fields#

Required Fields#

delay: Sets the delay in milliseconds to wait after receiving a response before sending the next request. This field is required except for the last turn in a session.

Optional Fields#

input_length: Sets the token length of the input for this request.
output_length: Sets the token length of the output for this request.
text: Provides the prompt text, if you prefer to bring your own rather than have it be synthetically generated.

Run GenAI-Perf with Custom Input#

genai-perf profile \
  -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --endpoint-type chat \
  --input-file payload:inputs.jsonl \
  --session-concurrency 2 \
  --session-delay-ratio 0.5

Review the Output#

Example output:

                             NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃                         Statistic ┃    avg ┃   min ┃    max ┃    p99 ┃    p90 ┃   p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│              Request Latency (ms) │  80.88 │ 50.25 │ 124.29 │ 123.21 │ 113.50 │ 97.31 │
│   Output Sequence Length (tokens) │  14.00 │ 10.00 │  20.00 │  20.00 │  20.00 │ 20.00 │
│    Input Sequence Length (tokens) │  38.80 │ 22.00 │  50.00 │  50.00 │  50.00 │ 50.00 │
│ Output Token Throughput (per sec) │ 315.70 │   N/A │    N/A │    N/A │    N/A │   N/A │
│      Request Throughput (per sec) │  22.55 │   N/A │    N/A │    N/A │    N/A │   N/A │
│             Request Count (count) │   5.00 │   N/A │    N/A │    N/A │    N/A │   N/A │
└───────────────────────────────────┴────────┴───────┴────────┴────────┴────────┴───────┘