KV Router A/B Testing
This guide walks you through setting up and running A/B benchmarks to compare Dynamo’s KV Smart Router against standard round-robin routing on a Kubernetes cluster.
Overview
Dynamo’s KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
- Deploy two identical Dynamo configurations: a. A vllm server for Qwen3-32B with 8 workers (aggregated) WITHOUT KV Smart Router enabled b. A vllm server for Qwen3-32B with 8 workers (aggregated) WITH KV Smart Router enabled
- Run controlled benchmarks using AIPerf
- Compare performance metrics to evaluate KV router effectiveness
Prerequisites: Kubernetes cluster with GPUs, kubectl, helm
Prerequisites
Required Tools
kubectl(configured with cluster access)helm(v3+)- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (8+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace
Knowledge Requirements
- Basic Kubernetes concepts (namespaces, pods, services)
- Familiarity with LLM inference concepts
- Command-line proficiency
Architecture
This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.
Key Difference: Deployment B sets DYN_ROUTER_MODE=kv on the frontend to enable KV cache-aware routing.
Phase 1: Namespace and Infrastructure Setup
Step 1.1: Create Namespace
Step 1.2: Create HuggingFace Token Secret (optional)
If the model you’re seeking to deploy requires HF token to download (Llama family models require this), replace YOUR_HF_TOKEN with your actual HuggingFace token:
Step 1.3: Install Dynamo Platform
If your cluster uses namespace-restricted Dynamo operators, you’ll need to install the Dynamo platform in the workload namespace. Follow the Dynamo Kubernetes Installation Guide to install the platform in dynamo-bench.
Key Configuration Notes:
- If your cluster uses namespace restrictions, ensure
dynamo-operator.namespaceRestriction.enabled=trueis set during installation - Adjust version tags to match your cluster’s available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation
Step 1.4: Verify Infrastructure
Expect operator, etcd, and nats pods Running before deploying the graph.
Phase 2: Deploy Model Serving
Step 2.1: Create Deployment YAMLs
Create router-off-deployment.yaml (baseline):
Create router-on-deployment.yaml (KV router ON):
Step 2.2: Deploy Router-ON First
💡 Optimization Tip: Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with ReadWriteMany access mode to cache the model.
First, create the PVC in the same namespace as your deployment (e.g. dynamo-bench). Use a storage class that supports ReadWriteMany:
Apply it: kubectl apply -f pvc-model-cache.yaml
Then reference the existing PVC in your DynamoGraphDeployment by adding the following under spec (and under VllmDecodeWorker, add volumeMounts):
With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.
Step 2.3: Monitor Deployment Progress
Wait for all pods to reach Running status and pass readiness probes.
Expected Timeline:
- With shared PVC (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
- Without shared PVC: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget 1-2 hours for full deployment (workers start in parallel but are limited by node scheduling)
The deployment’s startup probe (initialDelaySeconds: 120, periodSeconds: 30, failureThreshold: 60) allows up to 32 minutes per pod for model download and initialization.
Step 2.4: Verify Workers Are Healthy
⚠️ CRITICAL CHECKPOINT: Before running benchmarks, you MUST verify equal worker health. Unequal worker counts will invalidate your comparison results.
All 8 must show 1/1 Running and Ready. Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).
Phase 3: Prepare Benchmark Dataset
Understanding the Mooncake Toolagent Trace
For this A/B comparison, we use the Mooncake FAST’25 Toolagent Trace, published by Mooncake AI (USENIX FAST’25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production tool-agent workloads — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains 23,608 requests spanning ~59 minutes of real-time traffic.
Why the toolagent trace? Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router’s real-world performance gains.
What’s in the dataset? Each trace entry contains:
- Timestamp: When the request arrived (for realistic request timing)
- Input/output lengths: Number of tokens in prompts and responses
- Block hash IDs: Cryptographic hashes representing KV cache blocks (no user text; explained below)
Sample trace entries (showing prefix reuse):
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a 512-token block, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The KV Smart Router routes requests with matching hash IDs to the same worker, maximizing cache hits.
Key Dataset Properties:
- ✅ Realistic timing: Request arrival patterns from production tool-agent workloads
- ✅ High prefix overlap: 59% cache ratio (Mooncake FAST’25 paper); iterative tool calls within sessions produce natural prefix reuse
- ✅ Privacy-preserving: No actual text — only hash-based cache block identifiers
- ✅ Reproducible: Public dataset enables fair comparisons across different systems
Download and Prepare the Dataset
Phase 4: Set Up Benchmark Environment
Step 4.1: Deploy Benchmark Pod
Create benchmark-job.yaml:
This pod installs tmux and aiperf on startup so benchmarks can run inside a tmux session that survives kubectl exec disconnects.
Deploy:
Wait for pod to be ready (the init takes ~1-2 minutes to install packages):
Step 4.2: Copy Dataset to Benchmark Pod
Phase 5: Run Benchmarks
Step 5.1: Benchmark Router-ON
Verify the frontend service is reachable (the operator creates a service named {deployment-name}-frontend):
Launch the benchmark inside a tmux session so it survives kubectl exec disconnects:
AIPerf writes the run to /tmp/aiperf_router_on on the pod (summary JSON and profile_export.jsonl).
Monitoring Benchmarks
Benchmarks run inside a tmux session so they survive kubectl exec disconnects.
Attach to the live TUI (detach with Ctrl+B then D):
Step 5.2: Switch to Router-OFF and Benchmark
Tear down router-ON and deploy the baseline:
Wait for 8/8 workers to be Ready again (re-run the health check from Step 2.4), then clean up the previous tmux session and launch the baseline benchmark:
Step 5.3: Collect Results
Copy the artifact directories (or the summary/export files inside them) to your machine:
Each artifact directory contains:
profile_export_aiperf.json— summary with aggregated metrics (TTFT, latency percentiles, throughput)profile_export.jsonl— per-request records (one JSON object per completed request)
Step 5.4: Quick Comparison
Extract and compare key metrics from the two summary files:
Phase 6: Analyze Results
Key Metrics to Compare
Interpreting Results
Your Results May Vary: The improvement from KV Smart Router depends heavily on your workload characteristics:
Factors that increase KV router benefit:
- High prefix overlap (shared system prompts, templates, document contexts)
- Long prompts (>2000 tokens) where caching saves significant compute
- Multi-turn conversations with context carryover
- Batch workloads with similar queries
Factors that reduce KV router benefit:
- Unique prompts with no prefix reuse
- Short prompts (less than 1000 tokens) where routing overhead exceeds benefit
- Evenly distributed load where round-robin is already optimal
- Low request rate where cache eviction negates benefits
KV Smart Router is beneficial when:
- TTFT improvements > 20%
- No significant degradation in other metrics
- Workload demonstrates measurable prefix reuse patterns
Standard routing is better when:
- KV router shows less than 10% improvement
- Increased latency variance is observed
- Load distribution across workers is more important than cache affinity
Example Comparison
From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:
In this example with all 8 workers healthy, the KV router dramatically outperformed the baseline:
- 96% faster TTFT — Users see first token in ~2.6s instead of ~64s
- 79% lower E2E latency — Requests complete in ~19s instead of ~93s
- 95% faster TTFT p99 — Tail latency drops from ~333s to ~18s
The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.
Phase 7: Cleanup
Troubleshooting
Issue: Pods Stuck in Pending
Cause: Insufficient GPU resources
Solution:
Issue: ImagePullBackOff Errors
Cause: Version mismatch or missing credentials
Solution:
Issue: Operator Not Processing Deployment
Cause: Namespace restrictions
Solution:
- Ensure Dynamo platform is Helm-installed in the namespace
- Verify operator has
--restrictedNamespace=dynamo-benchargument - Check operator logs:
kubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-manager
Issue: Workers Not Becoming Ready
Cause: Model download failures or probe configuration
Solution:
Issue: Workers Restarting in CrashLoopBackOff
Cause: Startup probe timeout — workers killed before finishing initialization
Symptoms:
- Pods show “Container main failed startup probe, will be restarted”
- Logs show model still downloading or loading when pod is killed
Solution:
The deployment YAMLs in this guide set failureThreshold: 60, allowing up to 32 minutes (120s + 60×30s). If you lowered this value or are using a larger model that needs more time, increase it:
The relevant startup probe fields:
Model Loading Times (approximate):
- Qwen3-32B: ~20-25 minutes (first download)
- With cached model on node: ~2-5 minutes
Issue: Unequal Worker Health
Cause: Resource constraints, image pull issues, or configuration errors
Solution:
Advanced Configuration
Testing Different Models
Replace Qwen/Qwen3-32B with your model in:
- Deployment YAML
argssection - AIPerf
--modeland--tokenizerparameters
Adjusting Worker Count
Change replicas: 8 in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.
Using Custom Datasets
Replace the Mooncake trace with your own JSONL file:
- Format: One request per line with
timestampfield - AIPerf supports various formats via
--custom-dataset-type
Disaggregated Prefill/Decode
For advanced testing, add separate prefill workers:
Best Practices
- Equal Conditions: Ensure both deployments have identical worker counts and health before benchmarking
- Warm-Up: Run a small test (100 requests) before the full benchmark to warm up caches
- Multiple Runs: Run benchmarks 3+ times and average results for statistical significance
- Monitor Workers: Watch for any pod restarts or issues during benchmark runs
- Document Conditions: Record cluster state, worker health, and any anomalies
- Consistent Configuration: Use the same trace file and AIPerf options for both runs
Conclusion
This guide provides a complete methodology for A/B testing Dynamo’s KV Smart Router. The KV router’s effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the Tuning Guidelines.
For questions or issues, consult the Dynamo documentation or open an issue on GitHub.
Appendix: Files Reference
router-off-deployment.yaml: Standard routing deploymentrouter-on-deployment.yaml: KV router enabled deploymentbenchmark-job.yaml: AIPerf benchmark pod- AIPerf artifact dirs: summary JSON and
profile_export.jsonlper run
Repository: https://github.com/ai-dynamo/dynamo