Prefix Data Synthesis Tutorial

View as Markdown

Learn how to analyze and generate synthetic traces with controlled prefix-sharing patterns for KV cache benchmarking.

Overview

The prefix synthesis feature enables:

  • Analyze existing traces to understand prefix patterns and cache characteristics
  • Synthesize new traces that preserve structural properties while allowing controlled scaling
  • Benchmark with realistic prefix-sharing patterns from production traces

Prerequisites

  • AIPerf installed and configured
  • A mooncake-format trace file (JSONL format)
  • Basic understanding of prefix caching and KV cache mechanics

What is Prefix Synthesis?

In Large Language Models, prefix caching allows reusing previously computed KV cache entries when the same text prefix appears in multiple requests. The prefix synthesis feature helps you:

  1. Understand prefix-sharing patterns in your workload
  2. Generate synthetic traces that maintain these patterns
  3. Scale traces (more requests, longer contexts, etc.) while preserving statistical properties

Step 1: Analyze Your Traces

Analyze an existing trace file to extract statistics:

$aiperf analyze-trace traces/production.jsonl \
> --block-size 512 \
> --output-file analysis.json

Output example:

Trace Analysis Report
============================================================
Total requests: 10,000
Unique prefixes: 2,543
Prefix groups: 156
Cache hit rate: 68.5%
Prefix reuse ratio: 45.2%
ISL (Input Sequence Length):
Min: 512
P25: 1,024
Median: 1,920
P75: 2,816
Max: 4,096
Mean: 1,920.3
Std Dev: 856.2
OSL (Output Sequence Length):
Min: 64
P25: 96
Median: 144
P75: 208
Max: 512
Mean: 156.8
Std Dev: 72.4
============================================================

Understanding the Statistics

Summary metrics:

  • Total requests: Number of individual requests in the trace
  • Unique prefixes: How many distinct prefix patterns were observed
  • Prefix groups: Number of distinct shared first blocks (first blocks appearing in 2+ sequences)
  • Cache hit rate: Percentage of tokens that could be reused (assuming infinite cache)
  • Prefix reuse ratio: How many prefixes appear more than once

Percentile statistics (computed for ISL, OSL, context length, unique prompt length, and hit rate):

StatisticDescription
minMinimum value
p2525th percentile (Q1)
median50th percentile (P50)
p7575th percentile (Q3)
maxMaximum value
meanArithmetic mean
std_devStandard deviation (population)

Percentiles are calculated using linear interpolation: for percentile p with n sorted values, compute index k = (n - 1) * p, then interpolate between values[floor(k)] and values[ceil(k)].

These metrics help you understand how much prefix caching could benefit your workload.

Step 2: Run Benchmarks with Synthesis Parameters

Synthesis happens automatically when you run aiperf profile with mooncake traces and synthesis parameters. The trace is transformed in-memory before benchmarking:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --synthesis-speedup-ratio 1.0 \
> --synthesis-prefix-len-multiplier 1.0 \
> --synthesis-prefix-root-multiplier 1 \
> --synthesis-prompt-len-multiplier 1.0

This runs a benchmark using the original trace characteristics. Adjust the multipliers to scale different aspects.

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Loaded mooncake trace with 10,000 requests
INFO Synthesis parameters: speedup=1.0, prefix_len=1.0, root=1, prompt_len=1.0
INFO AIPerf System is PROFILING
Profiling: 10000/10000 |████████████████████████| 100% [15:34<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency1/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1234.56 │ 456.78 │ 3456.89 │ 3123.45 │ 1189.23 │
│ Time to First Token (ms) │ 156.78 │ 89.34 │ 345.67 │ 321.45 │ 145.23 │
│ Inter Token Latency (ms) │ 13.45 │ 10.23 │ 18.90 │ 18.34 │ 13.12 │
│ Input Sequence Length (tokens) │ 1920.30 │ 512.00 │ 4096.00 │ 3876.54 │ 1920.00 │
│ Output Token Count (tokens) │ 156.80 │ 64.00 │ 512.00 │ 487.23 │ 144.00 │
│ Request Throughput (req/s) │ 10.78 │ - │ - │ - │ - │
└────────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency1/profile_export_aiperf.json

Understanding Synthesis Parameters

--synthesis-speedup-ratio (default: 1.0)

Scale timestamps to simulate faster or slower request rates:

  • 1.0: No change, request times identical
  • 2.0: 2x faster (timestamps halved)
  • 0.5: 2x slower (timestamps doubled)

Example: Simulate 2x more concurrent load:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-speedup-ratio 2.0 \
> ...

--synthesis-prefix-len-multiplier (default: 1.0)

Scale the length of core prefix paths (shared prefixes):

  • 1.0: No change
  • 1.5: Extend shared prefixes by 50%
  • 0.5: Reduce shared prefixes by 50%

Example: Simulate longer context windows:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-len-multiplier 1.5 \
> ...

--synthesis-prefix-root-multiplier (default: 1)

Distribute traces across N independent radix trees:

  • 1: All traces share the same prefix tree (default)
  • 2: Traces randomly assigned to 2 independent trees (50% each)
  • 3: Traces randomly assigned to 3 independent trees (33% each)

Each tree has identical structure but different hash IDs, so traces in different trees cannot share prefixes. This reduces the effective cache hit rate by splitting the workload.

Example: Simulate lower cache hit rates with more diverse prefix roots:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-root-multiplier 3 \
> ...

--synthesis-prompt-len-multiplier (default: 1.0)

Scale the length of unique prompts (non-shared portions):

  • 1.0: No change
  • 2.0: Double unique prompt lengths
  • 0.5: Halve unique prompt lengths

Example: Simulate shorter user prompts:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prompt-len-multiplier 0.7 \
> ...

--synthesis-max-isl (optional)

Filter traces by maximum input sequence length. Traces with input_length > max_isl are skipped:

  • Not set: No filtering
  • 4096: Skip traces with more than 4,096 input tokens

Example: Filter out long contexts:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-max-isl 4096 \
> ...

--synthesis-max-osl (optional)

Cap traces to a maximum output sequence length. Traces with output_length > max_osl are capped to max_osl:

  • Not set: No capping
  • 2048: Cap output_length to 2,048 tokens

Example: Cap output lengths to 2,048 tokens:

$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-max-osl 2048 \
> ...

Advanced Examples

Scenario 1: Simulate High Cache Hit Rate

Analyze original traces to understand their cache characteristics, then benchmark with boosted prefix reuse:

$# Analyze original
$aiperf analyze-trace prod.jsonl --output-file analysis.json
$
$# Benchmark with more prefix reuse
$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-root-multiplier 5 \
> --synthesis-prompt-len-multiplier 0.8 \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat

Scenario 2: Load Testing with Scaled Timeline

Compress timestamps to simulate 10x faster request rate:

$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-speedup-ratio 10.0 \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat

Scenario 3: Stress Testing with Extended Context

Benchmark with longer contexts while maintaining prefix patterns:

$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-len-multiplier 2.0 \
> --synthesis-max-isl 8192 \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat

Scenario 4: Controlled Multi-Turn Simulation

Benchmark with more diverse prefix patterns for multi-turn scenarios:

$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-root-multiplier 10 \
> --synthesis-prompt-len-multiplier 1.2 \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat

Understanding Trace Format

The mooncake trace format is JSONL (JSON Lines), where each line is a JSON object representing one request:

1{
2 "input_length": 512,
3 "output_length": 128,
4 "timestamp": 0,
5 "hash_ids": [1, 2, 3],
6 "session_id": "session-1",
7 "delay": 0
8}

Required fields:

  • input_length: Number of input tokens

Optional fields:

  • output_length: Expected output tokens
  • timestamp: Absolute timestamp in milliseconds (for fixed schedules)
  • hash_ids: List of hash IDs representing prefix blocks
  • session_id: Conversation/session identifier for multi-turn
  • delay: Milliseconds to wait before sending (for multi-turn)

Tips and Best Practices

1. Analyze Before Benchmarking

Always run analyze-trace first to understand your data:

$aiperf analyze-trace your_trace.jsonl --output-file analysis.json

2. Start with Small Changes

Test parameters incrementally rather than changing everything at once:

$# Test prefix scaling alone
$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-len-multiplier 1.2 \
> --model Qwen/Qwen3-0.6B --endpoint-type chat
$
$# Test speedup alone
$aiperf profile \
> --input-file traces/production.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-speedup-ratio 2.0 \
> --model Qwen/Qwen3-0.6B --endpoint-type chat

3. Compare Multiple Parameter Sets

Run benchmarks with different synthesis parameters to compare:

$# Conservative: slight increase in cache hits
$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-len-multiplier 1.1 \
> --synthesis-prefix-root-multiplier 2 \
> --model Qwen/Qwen3-0.6B --endpoint-type chat
$
$# Aggressive: strong cache hit focus
$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-prefix-len-multiplier 2.0 \
> --synthesis-prefix-root-multiplier 5 \
> --model Qwen/Qwen3-0.6B --endpoint-type chat
$
$# Load test: faster request rate
$aiperf profile \
> --input-file prod.jsonl \
> --custom-dataset-type mooncake_trace \
> --synthesis-speedup-ratio 5.0 \
> --model Qwen/Qwen3-0.6B --endpoint-type chat

4. Preserve Real Patterns

The synthesis preserves statistical properties. For best results:

  • Use realistic input traces from production
  • Avoid extreme multiplier values (typically 0.5-3.0)
  • Compare results against baseline (no synthesis parameters)

Troubleshooting

Issue: “Input file not found”

Error: Input file not found: traces/production.jsonl

Solution: Verify the file path is correct:

$ls -la traces/production.jsonl

Issue: “No unique prefixes found”

Total requests: 1000
Unique prefixes: 0

Solution: Your trace file doesn’t have hash_ids. Synthesis will still work with input_length and output_length fields, but prefix caching information won’t be available.

Issue: Low cache hit rate

Cache hit rate: 5.2%

Solution: Your workload has low prefix reuse. Try:

  • Increasing --synthesis-prefix-len-multiplier to extend shared prefixes
  • Using --synthesis-prefix-root-multiplier to create more diverse patterns
  • Analyzing a different trace file that has more reuse