Trace Replay with Mooncake Traces

View as Markdown

This tutorial covers replaying production traces using the Mooncake trace format. Trace replay benchmarking reproduces real-world traffic patterns with precise timing control, enabling performance validation and capacity planning under realistic load.

When to Use This Tutorial

Use this approach when you need to:

  • Replay production traffic patterns captured from real systems
  • Validate performance with industry-standard Mooncake FAST’25 traces
  • Test system behavior under specific temporal load patterns
  • Reproduce benchmark results for regression testing

For other use cases:

Start a vLLM Server

Launch a vLLM server with a chat model:

$docker pull vllm/vllm-openai:latest
$docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
> --model Qwen/Qwen3-0.6B

Verify the server is ready:

$curl -s localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Mooncake Trace Format

Mooncake provides a specification and sample datasets for trace replay that can be replayed for performance benchmarking.

Mooncake traces use a JSONL file where each line represents a request with timing information.

Required fields for trace replay:

  • timestamp: Request arrival time in milliseconds
  • input_length: Number of input tokens
  • output_length: Number of output tokens
  • hash_ids: List of block hashes (optional)
  • tools: List of OpenAI-compatible tool definitions (optional, requires messages)

Example entry:

1{"timestamp": 0, "input_length": 655, "output_length": 52, "hash_ids": [0, 1, 2]}

Profile using a Custom Trace File

Create a trace file with timing information:

$cat > custom_trace.jsonl << 'EOF'
${"timestamp": 0, "input_length": 1200, "output_length": 52, "hash_ids": [0, 1, 2]}
${"timestamp": 105, "input_length": 1800, "output_length": 26, "hash_ids": [0, 3, 4, 5]}
${"timestamp": 274, "input_length": 1300, "output_length": 52, "hash_ids": [1, 4, 6]}
$EOF

Run AIPerf with the trace file:

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --input-file custom_trace.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule

The --fixed-schedule flag tells AIPerf to send requests at the exact timestamps specified in the trace. This reproduces the original timing pattern.

Using Pre-formatted Messages

Instead of synthetic prompts generated from input_length and hash_ids, you can provide an OpenAI-compatible messages array directly per trace entry. This is useful for replaying captured conversations (e.g., coding agent sessions) with exact prompt content.

Each entry’s messages field contains the full conversation history up to that point. In multi-turn sessions, later entries include prior turns so the server receives the complete context:

1{"session_id": "sess-1", "messages": [{"role": "user", "content": "Hello"}], "output_length": 50, "timestamp": 0}
2{"session_id": "sess-1", "messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}, {"role": "user", "content": "How are you?"}], "output_length": 30, "timestamp": 2000}

The messages field is mutually exclusive with input_length and text_input. When set, the messages array is sent directly to the API payload, bypassing prompt synthesis entirely. The model’s actual response is not carried forward between turns — each turn uses its pre-defined messages.

Tool Definitions

When replaying conversations that involve tool use (function calling), include the tools field alongside messages to provide the tool definitions the model needs:

1{"messages": [{"role": "user", "content": "What's the weather?"}], "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}}], "output_length": 50, "timestamp": 0}

The tools field is only valid when messages is provided. It is injected directly into the API payload as the tools parameter.

Profile using real Mooncake Trace

For real-world benchmarking, use the FAST25 production trace data from the Mooncake research paper:

$# Download the Mooncake trace data
$curl -Lo mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl
$
$# Create a subset for quick testing
$head -n 10 mooncake_trace.jsonl > mooncake_trace_short.jsonl
$
$# Run the trace replay
$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --input-file mooncake_trace_short.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule