Trace Replay with Mooncake Traces | NVIDIA AIPerf Documentation

This tutorial covers replaying production traces using the Mooncake trace format. Trace replay benchmarking reproduces real-world traffic patterns with precise timing control, enabling performance validation and capacity planning under realistic load.

When to Use This Tutorial

Use this approach when you need to:

Replay production traffic patterns captured from real systems
Validate performance with industry-standard Mooncake FAST’25 traces
Test system behavior under specific temporal load patterns
Reproduce benchmark results for regression testing

For other use cases:

Custom prompts without timing: See Custom Prompt Benchmarking
Precise timestamp control for any dataset: See Fixed Schedule
Multi-turn conversations from files: See Multi-Turn Conversations

Start a vLLM Server

Launch a vLLM server with a chat model:

$ docker pull vllm/vllm-openai:latest
$ docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
>   --model Qwen/Qwen3-0.6B

Verify the server is ready:

$ curl -s localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Mooncake Trace Format

Mooncake provides a specification and sample datasets for trace replay that can be replayed for performance benchmarking.

Mooncake traces use a JSONL file where each line represents a request with timing information.

Required fields for trace replay:

timestamp: Request arrival time in milliseconds
input_length: Number of input tokens
output_length: Number of output tokens
hash_ids: List of block hashes (optional)
tools: List of OpenAI-compatible tool definitions (optional, requires messages)
extra: Dict of vendor extras (optional). Shallow-merged into the top of the request body at dispatch; user-supplied keys win over --extra-inputs.

Example entry:

1 {"timestamp": 0, "input_length": 655, "output_length": 52, "hash_ids": [0, 1, 2]}

Profile using a Custom Trace File

Create a trace file with timing information:

$ cat > custom_trace.jsonl << 'EOF'
$ {"timestamp": 0, "input_length": 1200, "output_length": 52, "hash_ids": [0, 1, 2]}
$ {"timestamp": 105, "input_length": 1800, "output_length": 26, "hash_ids": [0, 3, 4, 5]}
$ {"timestamp": 274, "input_length": 1300, "output_length": 52, "hash_ids": [1, 4, 6]}
$ EOF

Run AIPerf with the trace file:

$ aiperf profile \
>     --model Qwen/Qwen3-0.6B \
>     --endpoint-type chat \
>     --streaming \
>     --url localhost:8000 \
>     --input-file custom_trace.jsonl \
>     --custom-dataset-type mooncake_trace \
>     --fixed-schedule

The --fixed-schedule flag tells AIPerf to send requests at the exact timestamps specified in the trace. This reproduces the original timing pattern.

Automatic Fixed-Schedule Promotion

When you supply a trace dataset (--custom-dataset-type mooncake_trace, bailian_trace, burst_gpt_trace, …) and the file’s first record carries a timestamp field, AIPerf automatically switches the profiling phase to fixed-schedule mode and fills --request-count from the number of trace entries. You can pass --fixed-schedule explicitly for clarity, but it’s no longer required.

To override the auto-promotion — for example, to replay the same trace under a fresh --concurrency or --request-rate setting and ignore the captured timestamps — pass --no-fixed-schedule:

$ aiperf profile \
>     --model Qwen/Qwen3-0.6B \
>     --endpoint-type chat \
>     --url localhost:8000 \
>     --input-file custom_trace.jsonl \
>     --custom-dataset-type mooncake_trace \
>     --no-fixed-schedule \
>     --concurrency 4 \
>     --request-count 100

AIPerf refuses parameter sweeps (e.g. --concurrency 1,2,4) against an auto-promoted trace; either pin a single value or pass --no-fixed-schedule to keep your sweep semantics.

Using Pre-formatted Messages

Instead of synthetic prompts generated from input_length and hash_ids, you can provide an OpenAI-compatible messages array directly per trace entry. This is useful for replaying captured conversations (e.g., coding agent sessions) with exact prompt content.

Each entry’s messages field contains the full conversation history up to that point. In multi-turn sessions, later entries include prior turns so the server receives the complete context:

1 {"session_id": "sess-1", "messages": [{"role": "user", "content": "Hello"}], "output_length": 50, "timestamp": 0}
2 {"session_id": "sess-1", "messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}, {"role": "user", "content": "How are you?"}], "output_length": 30, "timestamp": 2000}

The messages field is mutually exclusive with input_length and text_input. When set, the messages array is sent directly to the API payload, bypassing prompt synthesis entirely. The model’s actual response is not carried forward between turns — each turn uses its pre-defined messages.

Tool Definitions

When replaying conversations that involve tool use (function calling), include the tools field alongside messages to provide the tool definitions the model needs:

1 {"messages": [{"role": "user", "content": "What's the weather?"}], "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}}], "output_length": 50, "timestamp": 0}

The tools field is only valid when messages is provided. It is injected directly into the API payload as the tools parameter.

Per-Request Extra Inputs

Use the extra field to inject arbitrary key-value pairs into the HTTP payload for individual trace entries. This works alongside (and after) the global --extra-inputs flag, so per-entry values override global defaults for the same top-level key.

1 {"input_length": 100, "output_length": 50, "timestamp": 0, "extra": {"nvext": {"priority": 99}}}
2 {"input_length": 200, "output_length": 30, "timestamp": 500}
3 {"messages": [{"role": "user", "content": "Hello"}], "output_length": 50, "timestamp": 1000, "extra": {"routing": "fast"}}

Merge semantics: Merging is shallow — a per-entry {"nvext": {...}} replaces the entire global nvext key. Deep merge is not performed.

Profile using real Mooncake Trace

For real-world benchmarking, use the FAST25 production trace data from the Mooncake research paper:

$ # Download the Mooncake trace data
$ curl -Lo mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl
$ 
$ # Create a subset for quick testing
$ head -n 10 mooncake_trace.jsonl > mooncake_trace_short.jsonl
$ 
$ # Run the trace replay
$ aiperf profile \
>     --model Qwen/Qwen3-0.6B \
>     --endpoint-type chat \
>     --streaming \
>     --url localhost:8000 \
>     --input-file mooncake_trace_short.jsonl \
>     --custom-dataset-type mooncake_trace \
>     --fixed-schedule