This tutorial covers replaying production traces using the Mooncake trace format. Trace replay benchmarking reproduces real-world traffic patterns with precise timing control, enabling performance validation and capacity planning under realistic load.
Use this approach when you need to:
For other use cases:
Launch a vLLM server with a chat model:
Verify the server is ready:
Mooncake provides a specification and sample datasets for trace replay that can be replayed for performance benchmarking.
Mooncake traces use a JSONL file where each line represents a request with timing information.
Required fields for trace replay:
timestamp: Request arrival time in millisecondsinput_length: Number of input tokensoutput_length: Number of output tokenshash_ids: List of block hashes (optional)tools: List of OpenAI-compatible tool definitions (optional, requires messages)extra: Dict of vendor extras (optional). Shallow-merged into the top of the request body at dispatch; user-supplied keys win over --extra-inputs.Example entry:
Create a trace file with timing information:
Run AIPerf with the trace file:
The --fixed-schedule flag tells AIPerf to send requests at the exact timestamps specified in the trace. This reproduces the original timing pattern.
When you supply a trace dataset (--custom-dataset-type mooncake_trace, bailian_trace, burst_gpt_trace, …) and the file’s first record carries a timestamp field, AIPerf automatically switches the profiling phase to fixed-schedule mode and fills --request-count from the number of trace entries. You can pass --fixed-schedule explicitly for clarity, but it’s no longer required.
To override the auto-promotion — for example, to replay the same trace under a fresh --concurrency or --request-rate setting and ignore the captured timestamps — pass --no-fixed-schedule:
AIPerf refuses parameter sweeps (e.g. --concurrency 1,2,4) against an auto-promoted trace; either pin a single value or pass --no-fixed-schedule to keep your sweep semantics.
Instead of synthetic prompts generated from input_length and hash_ids, you can provide an OpenAI-compatible messages array directly per trace entry. This is useful for replaying captured conversations (e.g., coding agent sessions) with exact prompt content.
Each entry’s messages field contains the full conversation history up to that point. In multi-turn sessions, later entries include prior turns so the server receives the complete context:
The messages field is mutually exclusive with input_length and text_input. When set, the messages array is sent directly to the API payload, bypassing prompt synthesis entirely. The model’s actual response is not carried forward between turns — each turn uses its pre-defined messages.
When replaying conversations that involve tool use (function calling), include the tools field alongside messages to provide the tool definitions the model needs:
The tools field is only valid when messages is provided. It is injected directly into the API payload as the tools parameter.
Use the extra field to inject arbitrary key-value pairs into the HTTP payload for individual trace entries. This works alongside (and after) the global --extra-inputs flag, so per-entry values override global defaults for the same top-level key.
Merge semantics: Merging is shallow — a per-entry {"nvext": {...}} replaces the entire global nvext key. Deep merge is not performed.
For real-world benchmarking, use the FAST25 production trace data from the Mooncake research paper: