Trace Replay with Mooncake Traces
This tutorial covers replaying production traces using the Mooncake trace format. Trace replay benchmarking reproduces real-world traffic patterns with precise timing control, enabling performance validation and capacity planning under realistic load.
When to Use This Tutorial
Use this approach when you need to:
- Replay production traffic patterns captured from real systems
- Validate performance with industry-standard Mooncake FAST’25 traces
- Test system behavior under specific temporal load patterns
- Reproduce benchmark results for regression testing
For other use cases:
- Custom prompts without timing: See Custom Prompt Benchmarking
- Precise timestamp control for any dataset: See Fixed Schedule
- Multi-turn conversations from files: See Multi-Turn Conversations
Start a vLLM Server
Launch a vLLM server with a chat model:
Verify the server is ready:
Mooncake Trace Format
Mooncake provides a specification and sample datasets for trace replay that can be replayed for performance benchmarking.
Mooncake traces use a JSONL file where each line represents a request with timing information.
Required fields for trace replay:
timestamp: Request arrival time in millisecondsinput_length: Number of input tokensoutput_length: Number of output tokenshash_ids: List of block hashes (optional)tools: List of OpenAI-compatible tool definitions (optional, requiresmessages)
Example entry:
Profile using a Custom Trace File
Create a trace file with timing information:
Run AIPerf with the trace file:
The --fixed-schedule flag tells AIPerf to send requests at the exact timestamps specified in the trace. This reproduces the original timing pattern.
Using Pre-formatted Messages
Instead of synthetic prompts generated from input_length and hash_ids, you can provide an OpenAI-compatible messages array directly per trace entry. This is useful for replaying captured conversations (e.g., coding agent sessions) with exact prompt content.
Each entry’s messages field contains the full conversation history up to that point. In multi-turn sessions, later entries include prior turns so the server receives the complete context:
The messages field is mutually exclusive with input_length and text_input. When set, the messages array is sent directly to the API payload, bypassing prompt synthesis entirely. The model’s actual response is not carried forward between turns — each turn uses its pre-defined messages.
Tool Definitions
When replaying conversations that involve tool use (function calling), include the tools field alongside messages to provide the tool definitions the model needs:
The tools field is only valid when messages is provided. It is injected directly into the API payload as the tools parameter.
Profile using real Mooncake Trace
For real-world benchmarking, use the FAST25 production trace data from the Mooncake research paper: