Prefix Data Synthesis Tutorial
Learn how to analyze and generate synthetic traces with controlled prefix-sharing patterns for KV cache benchmarking.
Overview
The prefix synthesis feature enables:
- Analyze existing traces to understand prefix patterns and cache characteristics
- Synthesize new traces that preserve structural properties while allowing controlled scaling
- Benchmark with realistic prefix-sharing patterns from production traces
Prerequisites
- AIPerf installed and configured
- A mooncake-format trace file (JSONL format)
- Basic understanding of prefix caching and KV cache mechanics
What is Prefix Synthesis?
In Large Language Models, prefix caching allows reusing previously computed KV cache entries when the same text prefix appears in multiple requests. The prefix synthesis feature helps you:
- Understand prefix-sharing patterns in your workload
- Generate synthetic traces that maintain these patterns
- Scale traces (more requests, longer contexts, etc.) while preserving statistical properties
Step 1: Analyze Your Traces
Analyze an existing trace file to extract statistics:
Output example:
Understanding the Statistics
Summary metrics:
- Total requests: Number of individual requests in the trace
- Unique prefixes: How many distinct prefix patterns were observed
- Prefix groups: Number of distinct shared first blocks (first blocks appearing in 2+ sequences)
- Cache hit rate: Percentage of tokens that could be reused (assuming infinite cache)
- Prefix reuse ratio: How many prefixes appear more than once
Percentile statistics (computed for ISL, OSL, context length, unique prompt length, and hit rate):
Percentiles are calculated using linear interpolation: for percentile p with n sorted values, compute index k = (n - 1) * p, then interpolate between values[floor(k)] and values[ceil(k)].
These metrics help you understand how much prefix caching could benefit your workload.
Step 2: Run Benchmarks with Synthesis Parameters
Synthesis happens automatically when you run aiperf profile with mooncake traces and synthesis parameters. The trace is transformed in-memory before benchmarking:
This runs a benchmark using the original trace characteristics. Adjust the multipliers to scale different aspects.
Sample Output (Successful Run):
Understanding Synthesis Parameters
--synthesis-speedup-ratio (default: 1.0)
Scale timestamps to simulate faster or slower request rates:
1.0: No change, request times identical2.0: 2x faster (timestamps halved)0.5: 2x slower (timestamps doubled)
Example: Simulate 2x more concurrent load:
--synthesis-prefix-len-multiplier (default: 1.0)
Scale the length of core prefix paths (shared prefixes):
1.0: No change1.5: Extend shared prefixes by 50%0.5: Reduce shared prefixes by 50%
Example: Simulate longer context windows:
--synthesis-prefix-root-multiplier (default: 1)
Distribute traces across N independent radix trees:
1: All traces share the same prefix tree (default)2: Traces randomly assigned to 2 independent trees (50% each)3: Traces randomly assigned to 3 independent trees (33% each)
Each tree has identical structure but different hash IDs, so traces in different trees cannot share prefixes. This reduces the effective cache hit rate by splitting the workload.
Example: Simulate lower cache hit rates with more diverse prefix roots:
--synthesis-prompt-len-multiplier (default: 1.0)
Scale the length of unique prompts (non-shared portions):
1.0: No change2.0: Double unique prompt lengths0.5: Halve unique prompt lengths
Example: Simulate shorter user prompts:
--synthesis-max-isl (optional)
Filter traces by maximum input sequence length. Traces with input_length > max_isl are skipped:
- Not set: No filtering
4096: Skip traces with more than 4,096 input tokens
Example: Filter out long contexts:
--synthesis-max-osl (optional)
Cap traces to a maximum output sequence length. Traces with output_length > max_osl are capped to max_osl:
- Not set: No capping
2048: Cap output_length to 2,048 tokens
Example: Cap output lengths to 2,048 tokens:
Advanced Examples
Scenario 1: Simulate High Cache Hit Rate
Analyze original traces to understand their cache characteristics, then benchmark with boosted prefix reuse:
Scenario 2: Load Testing with Scaled Timeline
Compress timestamps to simulate 10x faster request rate:
Scenario 3: Stress Testing with Extended Context
Benchmark with longer contexts while maintaining prefix patterns:
Scenario 4: Controlled Multi-Turn Simulation
Benchmark with more diverse prefix patterns for multi-turn scenarios:
Understanding Trace Format
The mooncake trace format is JSONL (JSON Lines), where each line is a JSON object representing one request:
Required fields:
input_length: Number of input tokens
Optional fields:
output_length: Expected output tokenstimestamp: Absolute timestamp in milliseconds (for fixed schedules)hash_ids: List of hash IDs representing prefix blockssession_id: Conversation/session identifier for multi-turndelay: Milliseconds to wait before sending (for multi-turn)
Tips and Best Practices
1. Analyze Before Benchmarking
Always run analyze-trace first to understand your data:
2. Start with Small Changes
Test parameters incrementally rather than changing everything at once:
3. Compare Multiple Parameter Sets
Run benchmarks with different synthesis parameters to compare:
4. Preserve Real Patterns
The synthesis preserves statistical properties. For best results:
- Use realistic input traces from production
- Avoid extreme multiplier values (typically 0.5-3.0)
- Compare results against baseline (no synthesis parameters)
Troubleshooting
Issue: “Input file not found”
Solution: Verify the file path is correct:
Issue: “No unique prefixes found”
Solution: Your trace file doesn’t have hash_ids. Synthesis will still work with input_length and output_length fields, but prefix caching information won’t be available.
Issue: Low cache hit rate
Solution: Your workload has low prefix reuse. Try:
- Increasing
--synthesis-prefix-len-multiplierto extend shared prefixes - Using
--synthesis-prefix-root-multiplierto create more diverse patterns - Analyzing a different trace file that has more reuse