KV Router A/B Testing
KV Router A/B Testing
KV Router A/B Testing
This guide walks you through setting up and running A/B benchmarks to compare Dynamo’s KV Smart Router against standard round-robin routing on a Kubernetes cluster.
Dynamo’s KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
Prerequisites: Kubernetes cluster with GPUs, kubectl, helm
kubectl (configured with cluster access)helm (v3+)This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.
Key Difference: Deployment B sets DYN_ROUTER_MODE=kv on the frontend to enable KV cache-aware routing.
If the model you’re seeking to deploy requires HF token to download (Llama family models require this), replace YOUR_HF_TOKEN with your actual HuggingFace token:
If your cluster uses namespace-restricted Dynamo operators, you’ll need to install the Dynamo platform in the workload namespace. Follow the Dynamo Kubernetes Installation Guide to install the platform in dynamo-bench.
Key Configuration Notes:
dynamo-operator.namespaceRestriction.enabled=true is set during installationExpect operator, etcd, and nats pods Running before deploying the graph.
Create router-off-deployment.yaml (baseline):
Create router-on-deployment.yaml (KV router ON):
💡 Optimization Tip: Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with ReadWriteMany access mode to cache the model.
First, create the PVC in the same namespace as your deployment (e.g. dynamo-bench). Use a storage class that supports ReadWriteMany:
Apply it: kubectl apply -f pvc-model-cache.yaml
Then reference the existing PVC in your DynamoGraphDeployment by adding the following under spec (and under VllmDecodeWorker, add volumeMounts):
With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.
Wait for all pods to reach Running status and pass readiness probes.
Expected Timeline:
The deployment’s startup probe (initialDelaySeconds: 120, periodSeconds: 30, failureThreshold: 60) allows up to 32 minutes per pod for model download and initialization.
⚠️ CRITICAL CHECKPOINT: Before running benchmarks, you MUST verify equal worker health. Unequal worker counts will invalidate your comparison results.
All 8 must show 1/1 Running and Ready. Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).
For this A/B comparison, we use the Mooncake FAST’25 Toolagent Trace, published by Mooncake AI (USENIX FAST’25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production tool-agent workloads — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains 23,608 requests spanning ~59 minutes of real-time traffic.
Why the toolagent trace? Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router’s real-world performance gains.
What’s in the dataset? Each trace entry contains:
Sample trace entries (showing prefix reuse):
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a 512-token block, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The KV Smart Router routes requests with matching hash IDs to the same worker, maximizing cache hits.
Key Dataset Properties:
Create benchmark-job.yaml:
This pod installs tmux and aiperf on startup so benchmarks can run inside a tmux session that survives kubectl exec disconnects.
Deploy:
Wait for pod to be ready (the init takes ~1-2 minutes to install packages):
Verify the frontend service is reachable (the operator creates a service named {deployment-name}-frontend):
Launch the benchmark inside a tmux session so it survives kubectl exec disconnects:
AIPerf writes the run to /tmp/aiperf_router_on on the pod (summary JSON and profile_export.jsonl).
Benchmarks run inside a tmux session so they survive kubectl exec disconnects.
Attach to the live TUI (detach with Ctrl+B then D):
Tear down router-ON and deploy the baseline:
Wait for 8/8 workers to be Ready again (re-run the health check from Step 2.4), then clean up the previous tmux session and launch the baseline benchmark:
Copy the artifact directories (or the summary/export files inside them) to your machine:
Each artifact directory contains:
profile_export_aiperf.json — summary with aggregated metrics (TTFT, latency percentiles, throughput)profile_export.jsonl — per-request records (one JSON object per completed request)Extract and compare key metrics from the two summary files:
Your Results May Vary: The improvement from KV Smart Router depends heavily on your workload characteristics:
Factors that increase KV router benefit:
Factors that reduce KV router benefit:
KV Smart Router is beneficial when:
Standard routing is better when:
From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:
In this example with all 8 workers healthy, the KV router dramatically outperformed the baseline:
The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.
Cause: Insufficient GPU resources
Solution:
Cause: Version mismatch or missing credentials
Solution:
Cause: Namespace restrictions
Solution:
--restrictedNamespace=dynamo-bench argumentkubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-managerCause: Model download failures or probe configuration
Solution:
Cause: Startup probe timeout — workers killed before finishing initialization
Symptoms:
Solution:
The deployment YAMLs in this guide set failureThreshold: 60, allowing up to 32 minutes (120s + 60×30s). If you lowered this value or are using a larger model that needs more time, increase it:
The relevant startup probe fields:
Model Loading Times (approximate):
Cause: Resource constraints, image pull issues, or configuration errors
Solution:
Replace Qwen/Qwen3-32B with your model in:
args section--model and --tokenizer parametersChange replicas: 8 in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.
Replace the Mooncake trace with your own JSONL file:
timestamp field--custom-dataset-typeFor advanced testing, add separate prefill workers:
This guide provides a complete methodology for A/B testing Dynamo’s KV Smart Router. The KV router’s effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the Tuning Guidelines.
For questions or issues, consult the Dynamo documentation or open an issue on GitHub.
router-off-deployment.yaml: Standard routing deploymentrouter-on-deployment.yaml: KV router enabled deploymentbenchmark-job.yaml: AIPerf benchmark podprofile_export.jsonl per runRepository: https://github.com/ai-dynamo/dynamo