KV Router A/B Testing | NVIDIA Dynamo Documentation

This guide walks you through setting up and running A/B benchmarks to compare Dynamo’s KV Smart Router against standard round-robin routing on a Kubernetes cluster.

Overview

Dynamo’s KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:

Deploy two identical Dynamo configurations: a. A vllm server for Qwen3-32B with 8 workers (aggregated) WITHOUT KV Smart Router enabled b. A vllm server for Qwen3-32B with 8 workers (aggregated) WITH KV Smart Router enabled
Run controlled benchmarks using AIPerf
Compare performance metrics to evaluate KV router effectiveness

Prerequisites: Kubernetes cluster with GPUs, kubectl, helm

Prerequisites

Required Tools

kubectl (configured with cluster access)
helm (v3+)
HuggingFace account and token (if model downloads are gated)
Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (8+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace

Knowledge Requirements

Basic Kubernetes concepts (namespaces, pods, services)
Familiarity with LLM inference concepts
Command-line proficiency

Architecture

This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.

┌──────────────────────────────────────────────┐
│ Namespace: dynamo-bench                       │
│ (one of A or B active at a time)              │
│                                              │
│  Deployment A: Router OFF                     │
│    ├─ Frontend (Standard Routing)              │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Deployment B: Router ON                      │
│    ├─ Frontend (KV Smart Router)               │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Benchmark Pod (AIPerf + Dataset)             │
└──────────────────────────────────────────────┘

Key Difference: Deployment B sets DYN_ROUTER_MODE=kv on the frontend to enable KV cache-aware routing.

Phase 1: Namespace and Infrastructure Setup

Step 1.1: Create Namespace

$ kubectl create namespace dynamo-bench

Step 1.2: Create HuggingFace Token Secret (optional)

If the model you’re seeking to deploy requires HF token to download (Llama family models require this), replace YOUR_HF_TOKEN with your actual HuggingFace token:

$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
>   -n dynamo-bench

Step 1.3: Install Dynamo Platform

If your cluster uses namespace-restricted Dynamo operators, you’ll need to install the Dynamo platform in the workload namespace. Follow the Dynamo Kubernetes Installation Guide to install the platform in dynamo-bench.

Key Configuration Notes:

If your cluster uses namespace restrictions, ensure dynamo-operator.namespaceRestriction.enabled=true is set during installation
Adjust version tags to match your cluster’s available Dynamo versions
If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation

Step 1.4: Verify Infrastructure

$ kubectl get pods -n dynamo-bench

Expect operator, etcd, and nats pods Running before deploying the graph.

Phase 2: Deploy Model Serving

Step 2.1: Create Deployment YAMLs

Create router-off-deployment.yaml (baseline):

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-agg-no-router
5 spec:
6   services:
7     Frontend:
8       dynamoNamespace: vllm-agg-no-router
9       componentType: frontend
10       replicas: 1
11       extraPodSpec:
12         mainContainer:
13           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
14           env:
15             - name: POD_UID
16               valueFrom:
17                 fieldRef:
18                   fieldPath: metadata.uid
19     VllmDecodeWorker:
20       envFromSecret: hf-token-secret
21       dynamoNamespace: vllm-agg-no-router
22       componentType: worker
23       replicas: 8
24       resources:
25         limits:
26           gpu: "1"
27       extraPodSpec:
28         affinity:
29           nodeAffinity:
30             requiredDuringSchedulingIgnoredDuringExecution:
31               nodeSelectorTerms:
32                 - matchExpressions:
33                     - key: node.kubernetes.io/instance-type
34                       operator: In
35                       values:
36                         - gpu-h100-sxm  # Adjust to your GPU node type
37         mainContainer:
38           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
39           workingDir: /workspace
40           command:
41             - /bin/sh
42             - -c
43           args:
44             - >-
45               python3 -m dynamo.vllm
46               --model Qwen/Qwen3-32B
47               --quantization fp8
48               --kv-cache-dtype fp8
49               --max-model-len 131072
50               --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
51               --gpu-memory-utilization 0.90
52               --block-size 64
53               --async-scheduling
54               --disable-log-requests
55           env:
56             - name: DYN_HEALTH_CHECK_ENABLED
57               value: "false"
58             - name: POD_UID
59               valueFrom:
60                 fieldRef:
61                   fieldPath: metadata.uid
62           startupProbe:
63             httpGet:
64               path: /health
65               port: 9090
66             initialDelaySeconds: 120
67             periodSeconds: 30
68             timeoutSeconds: 10
69             failureThreshold: 60
70           livenessProbe:
71             httpGet:
72               path: /live
73               port: 9090
74             initialDelaySeconds: 300
75             periodSeconds: 30
76             timeoutSeconds: 10
77             failureThreshold: 10
78           readinessProbe:
79             httpGet:
80               path: /live
81               port: 9090
82             initialDelaySeconds: 300
83             periodSeconds: 30
84             timeoutSeconds: 10
85             failureThreshold: 10
86       subComponentType: decode

Create router-on-deployment.yaml (KV router ON):

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-agg-router
5 spec:
6   services:
7     Frontend:
8       dynamoNamespace: vllm-agg-router
9       componentType: frontend
10       replicas: 1
11       extraPodSpec:
12         mainContainer:
13           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
14           env:
15             - name: POD_UID
16               valueFrom:
17                 fieldRef:
18                   fieldPath: metadata.uid
19       envs:
20         - name: DYN_ROUTER_MODE
21           value: kv  # KEY DIFFERENCE: Enable KV Smart Router
22     VllmDecodeWorker:
23       envFromSecret: hf-token-secret
24       dynamoNamespace: vllm-agg-router
25       componentType: worker
26       replicas: 8
27       resources:
28         limits:
29           gpu: "1"
30       extraPodSpec:
31         affinity:
32           nodeAffinity:
33             requiredDuringSchedulingIgnoredDuringExecution:
34               nodeSelectorTerms:
35                 - matchExpressions:
36                     - key: node.kubernetes.io/instance-type
37                       operator: In
38                       values:
39                         - gpu-h100-sxm  # Adjust to your GPU node type
40         mainContainer:
41           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
42           workingDir: /workspace
43           command:
44             - /bin/sh
45             - -c
46           args:
47             - >-
48               python3 -m dynamo.vllm
49               --model Qwen/Qwen3-32B
50               --quantization fp8
51               --kv-cache-dtype fp8
52               --max-model-len 131072
53               --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
54               --gpu-memory-utilization 0.90
55               --block-size 64
56               --async-scheduling
57               --disable-log-requests
58           env:
59             - name: DYN_HEALTH_CHECK_ENABLED
60               value: "false"
61             - name: POD_UID
62               valueFrom:
63                 fieldRef:
64                   fieldPath: metadata.uid
65           startupProbe:
66             httpGet:
67               path: /health
68               port: 9090
69             initialDelaySeconds: 120
70             periodSeconds: 30
71             timeoutSeconds: 10
72             failureThreshold: 60
73           livenessProbe:
74             httpGet:
75               path: /live
76               port: 9090
77             initialDelaySeconds: 300
78             periodSeconds: 30
79             timeoutSeconds: 10
80             failureThreshold: 10
81           readinessProbe:
82             httpGet:
83               path: /live
84               port: 9090
85             initialDelaySeconds: 300
86             periodSeconds: 30
87             timeoutSeconds: 10
88             failureThreshold: 10
89       subComponentType: decode

Step 2.2: Deploy Router-ON First

$ kubectl apply -f router-on-deployment.yaml -n dynamo-bench

💡 Optimization Tip: Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with ReadWriteMany access mode to cache the model.

First, create the PVC in the same namespace as your deployment (e.g. dynamo-bench). Use a storage class that supports ReadWriteMany:

$ kubectl get storageclass   # choose one with ReadWriteMany (e.g. azurefile-csi-premium, nfs, efs)

1 apiVersion: v1
2 kind: PersistentVolumeClaim
3 metadata:
4   name: model-cache
5   namespace: dynamo-bench
6 spec:
7   accessModes:
8     - ReadWriteMany
9   storageClassName: "azurefile-csi-premium"   # Adjust to your cluster
10   resources:
11     requests:
12       storage: 100Gi

Apply it: kubectl apply -f pvc-model-cache.yaml

Then reference the existing PVC in your DynamoGraphDeployment by adding the following under spec (and under VllmDecodeWorker, add volumeMounts):

1 spec:
2   pvcs:
3     - create: false
4       name: model-cache
5       size: "0"
6   services:
7     VllmDecodeWorker:
8       volumeMounts:
9         - mountPoint: /root/.cache/huggingface
10           name: model-cache
11           useAsCompilationCache: false

With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.

Step 2.3: Monitor Deployment Progress

$ kubectl get pods -n dynamo-bench -w

Wait for all pods to reach Running status and pass readiness probes.

Expected Timeline:

With shared PVC (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
Without shared PVC: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget 1-2 hours for full deployment (workers start in parallel but are limited by node scheduling)

The deployment’s startup probe (initialDelaySeconds: 120, periodSeconds: 30, failureThreshold: 60) allows up to 32 minutes per pod for model download and initialization.

Step 2.4: Verify Workers Are Healthy

⚠️ CRITICAL CHECKPOINT: Before running benchmarks, you MUST verify equal worker health. Unequal worker counts will invalidate your comparison results.

$ # Quick health check - should show "8/8"
$ echo "Workers: $(kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
$ 
$ # Detailed view
$ kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker

All 8 must show 1/1 Running and Ready. Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).

Phase 3: Prepare Benchmark Dataset

Understanding the Mooncake Toolagent Trace

For this A/B comparison, we use the Mooncake FAST’25 Toolagent Trace, published by Mooncake AI (USENIX FAST’25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production tool-agent workloads — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains 23,608 requests spanning ~59 minutes of real-time traffic.

Why the toolagent trace? Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router’s real-world performance gains.

What’s in the dataset? Each trace entry contains:

Timestamp: When the request arrived (for realistic request timing)
Input/output lengths: Number of tokens in prompts and responses
Block hash IDs: Cryptographic hashes representing KV cache blocks (no user text; explained below)

Sample trace entries (showing prefix reuse):

1 {"timestamp": 0, "input_length": 9013, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}
2 {"timestamp": 0, "input_length": 6506, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 64]}

These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a 512-token block, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The KV Smart Router routes requests with matching hash IDs to the same worker, maximizing cache hits.

Key Dataset Properties:

✅ Realistic timing: Request arrival patterns from production tool-agent workloads
✅ High prefix overlap: 59% cache ratio (Mooncake FAST’25 paper); iterative tool calls within sessions produce natural prefix reuse
✅ Privacy-preserving: No actual text — only hash-based cache block identifiers
✅ Reproducible: Public dataset enables fair comparisons across different systems

Download and Prepare the Dataset

$ # Download the Mooncake FAST'25 toolagent trace
> curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl -o toolagent_trace.jsonl
> 
> # Slow down timestamps to 0.80× replay speed (~5.3 req/s instead of ~6.7 req/s)
> python3 - <<'PY'
> import json
> 
> with open("toolagent_trace.jsonl") as src, open("toolagent_trace_080x.jsonl", "w") as dst:
>     for line in src:
>         rec = json.loads(line)
>         rec["timestamp"] = int(rec["timestamp"] / 0.80)
>         dst.write(json.dumps(rec) + "\n")
> PY
> 
> echo "Dataset ready: toolagent_trace_080x.jsonl (23,608 requests, 0.80x speed)"

Phase 4: Set Up Benchmark Environment

Step 4.1: Deploy Benchmark Pod

Create benchmark-job.yaml:

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: aiperf-benchmark
5 spec:
6   backoffLimit: 1
7   template:
8     spec:
9       restartPolicy: Never
10       containers:
11       - name: benchmark
12         image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
13         securityContext:
14           runAsUser: 0  # Required: apt-get and pip install need root in ephemeral benchmark pod
15         command:
16           - /bin/bash
17           - -lc
18           - |
19             apt-get update -qq && apt-get install -y -qq tmux > /dev/null 2>&1
20             pip install -q aiperf==0.5.0
21             echo "Benchmark pod ready (tmux + aiperf installed)."
22             sleep infinity
23         imagePullPolicy: IfNotPresent
24         resources:
25           limits:
26             nvidia.com/gpu: 0

This pod installs tmux and aiperf on startup so benchmarks can run inside a tmux session that survives kubectl exec disconnects.

Deploy:

$ kubectl apply -f benchmark-job.yaml -n dynamo-bench

Wait for pod to be ready (the init takes ~1-2 minutes to install packages):

$ kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -w

Step 4.2: Copy Dataset to Benchmark Pod

$ POD_NAME=$(kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
$ kubectl -n dynamo-bench cp toolagent_trace_080x.jsonl ${POD_NAME}:/tmp/toolagent_trace_080x.jsonl

Phase 5: Run Benchmarks

Step 5.1: Benchmark Router-ON

Verify the frontend service is reachable (the operator creates a service named {deployment-name}-frontend):

$ kubectl get svc -n dynamo-bench | grep frontend

Launch the benchmark inside a tmux session so it survives kubectl exec disconnects:

$ kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
>   tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
>     AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
>       -m Qwen/Qwen3-32B \
>       --tokenizer Qwen/Qwen3-32B \
>       --input-file /tmp/toolagent_trace_080x.jsonl \
>       --custom-dataset-type mooncake_trace \
>       --fixed-schedule \
>       --url http://vllm-agg-router-frontend.dynamo-bench.svc.cluster.local:8000 \
>       --streaming \
>       --random-seed 42 \
>       --workers-max 200 \
>       --request-timeout-seconds 1000 \
>       --profile-export-level records \
>       --record-processors 8 \
>       --artifact-dir /tmp/aiperf_router_on \
>       --goodput \"time_to_first_token:5000 inter_token_latency:100\""
> '

AIPerf writes the run to /tmp/aiperf_router_on on the pod (summary JSON and profile_export.jsonl).

Monitoring Benchmarks

Benchmarks run inside a tmux session so they survive kubectl exec disconnects.

Attach to the live TUI (detach with Ctrl+B then D):

$ kubectl -n dynamo-bench exec -it ${POD_NAME} -- tmux a -t benchmark

Step 5.2: Switch to Router-OFF and Benchmark

Tear down router-ON and deploy the baseline:

$ kubectl delete dynamographdeployment vllm-agg-router -n dynamo-bench
$ kubectl apply -f router-off-deployment.yaml -n dynamo-bench

Wait for 8/8 workers to be Ready again (re-run the health check from Step 2.4), then clean up the previous tmux session and launch the baseline benchmark:

$ kubectl -n dynamo-bench exec ${POD_NAME} -- tmux kill-session -t benchmark 2>/dev/null
$ 
$ kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
>   tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
>     AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
>       -m Qwen/Qwen3-32B \
>       --tokenizer Qwen/Qwen3-32B \
>       --input-file /tmp/toolagent_trace_080x.jsonl \
>       --custom-dataset-type mooncake_trace \
>       --fixed-schedule \
>       --url http://vllm-agg-no-router-frontend.dynamo-bench.svc.cluster.local:8000 \
>       --streaming \
>       --random-seed 42 \
>       --workers-max 200 \
>       --request-timeout-seconds 1000 \
>       --profile-export-level records \
>       --record-processors 8 \
>       --artifact-dir /tmp/aiperf_router_off \
>       --goodput \"time_to_first_token:5000 inter_token_latency:100\""
> '

Step 5.3: Collect Results

Copy the artifact directories (or the summary/export files inside them) to your machine:

$ kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_on ./aiperf_router_on
$ kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_off ./aiperf_router_off

Each artifact directory contains:

profile_export_aiperf.json — summary with aggregated metrics (TTFT, latency percentiles, throughput)
profile_export.jsonl — per-request records (one JSON object per completed request)

Step 5.4: Quick Comparison

Extract and compare key metrics from the two summary files:

$ python3 -c "
> import json, pathlib
> 
> def load(d):
>     return json.loads(pathlib.Path(d, 'profile_export_aiperf.json').read_text())
> 
> on, off = load('aiperf_router_on'), load('aiperf_router_off')
> 
> metrics = [
>     ('TTFT avg (ms)',             'time_to_first_token', 'avg'),
>     ('TTFT p99 (ms)',             'time_to_first_token', 'p99'),
>     ('E2E Latency avg (ms)',      'request_latency',     'avg'),
>     ('E2E Latency p99 (ms)',      'request_latency',     'p99'),
>     ('Output Throughput (tok/s)', 'output_token_throughput', 'avg'),
> ]
> 
> print(f\"{'Metric':<28} {'Router-OFF':>12} {'Router-ON':>12} {'Speedup':>10}\")
> print('-' * 66)
> for label, key, stat in metrics:
>     v_off = off.get(key, {}).get(stat, 0)
>     v_on  = on.get(key, {}).get(stat, 0)
>     if 'throughput' in key.lower():
>         speedup = v_on / v_off if v_off else 0
>     else:
>         speedup = v_off / v_on if v_on else 0
>     print(f'{label:<28} {v_off:>12.1f} {v_on:>12.1f} {speedup:>9.1f}x')
> "

Phase 6: Analyze Results

Key Metrics to Compare

Metric	Description	What to Look For
Time to First Token (TTFT)	Latency until first token arrives	Lower is better; KV router may reduce with prefix reuse
Inter Token Latency (ITL)	Average time between tokens	Lower is better; indicates generation speed
Request Latency	Total end-to-end latency	Lower is better; overall user experience
Output Token Throughput	Tokens generated per second (system-wide)	Higher is better; system efficiency
Request Throughput	Requests completed per second	Higher is better; capacity

Interpreting Results

Your Results May Vary: The improvement from KV Smart Router depends heavily on your workload characteristics:

Factors that increase KV router benefit:

High prefix overlap (shared system prompts, templates, document contexts)
Long prompts (>2000 tokens) where caching saves significant compute
Multi-turn conversations with context carryover
Batch workloads with similar queries

Factors that reduce KV router benefit:

Unique prompts with no prefix reuse
Short prompts (less than 1000 tokens) where routing overhead exceeds benefit
Evenly distributed load where round-robin is already optimal
Low request rate where cache eviction negates benefits

KV Smart Router is beneficial when:

TTFT improvements > 20%
No significant degradation in other metrics
Workload demonstrates measurable prefix reuse patterns

Standard routing is better when:

KV router shows less than 10% improvement
Increased latency variance is observed
Load distribution across workers is more important than cache affinity

Example Comparison

From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:

Metric	Router-OFF (Baseline)	Router-ON (KV Router)	Improvement	Speedup
TTFT avg	63,652 ms	2,586 ms	96% faster	24.6x ✅
TTFT p99	332,974 ms	17,871 ms	95% faster	18.6x ✅
E2E Latency avg	92,856 ms	19,112 ms	79% faster	4.9x ✅
E2E Latency p99	411,252 ms	88,274 ms	79% faster	4.7x ✅

In this example with all 8 workers healthy, the KV router dramatically outperformed the baseline:

96% faster TTFT — Users see first token in ~2.6s instead of ~64s
79% lower E2E latency — Requests complete in ~19s instead of ~93s
95% faster TTFT p99 — Tail latency drops from ~333s to ~18s

The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.

Phase 7: Cleanup

$ kubectl delete dynamographdeployment --all -n dynamo-bench
$ kubectl delete job aiperf-benchmark -n dynamo-bench
$ kubectl delete namespace dynamo-bench

Troubleshooting

Issue: Pods Stuck in Pending

Cause: Insufficient GPU resources

Solution:

$ # Check GPU availability
$ kubectl describe nodes | grep -A 10 "Allocated resources"
$ 
$ # Reduce worker replicas if needed
$ kubectl edit dynamographdeployment -n dynamo-bench

Issue: ImagePullBackOff Errors

Cause: Version mismatch or missing credentials

Solution:

$ # Check available versions
$ kubectl get pods -n dynamo-bench -o yaml | grep image:
$ 
$ # Update deployment YAML to match cluster version

Issue: Operator Not Processing Deployment

Cause: Namespace restrictions

Solution:

Ensure Dynamo platform is Helm-installed in the namespace
Verify operator has --restrictedNamespace=dynamo-bench argument
Check operator logs: kubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-manager

Issue: Workers Not Becoming Ready

Cause: Model download failures or probe configuration

Solution:

$ # Check worker logs
$ kubectl logs -n dynamo-bench <worker-pod-name>
$ 
$ # Common issues:
$ # - Invalid HuggingFace token
$ # - Network connectivity
$ # - Insufficient disk space for model

Issue: Workers Restarting in CrashLoopBackOff

Cause: Startup probe timeout — workers killed before finishing initialization

Symptoms:

Pods show “Container main failed startup probe, will be restarted”
Logs show model still downloading or loading when pod is killed

Solution: The deployment YAMLs in this guide set failureThreshold: 60, allowing up to 32 minutes (120s + 60×30s). If you lowered this value or are using a larger model that needs more time, increase it:

$ kubectl patch dynamographdeployment <deployment-name> -n dynamo-bench --type='json' \
>   -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 80}]'

The relevant startup probe fields:

1 startupProbe:
2   httpGet:
3     path: /health
4     port: 9090
5   initialDelaySeconds: 120
6   periodSeconds: 30
7   timeoutSeconds: 10
8   failureThreshold: 60  # 32 minutes total (120s + 60*30s); increase for larger models

Model Loading Times (approximate):

Qwen3-32B: ~20-25 minutes (first download)
With cached model on node: ~2-5 minutes

Issue: Unequal Worker Health

Cause: Resource constraints, image pull issues, or configuration errors

Solution:

$ # Check all worker status
$ kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
$ 
$ # Describe problematic pods
$ kubectl describe pod <pod-name> -n dynamo-bench
$ 
$ # Fix issues before benchmarking or results will be skewed

Advanced Configuration

Testing Different Models

Replace Qwen/Qwen3-32B with your model in:

Deployment YAML args section
AIPerf --model and --tokenizer parameters

Adjusting Worker Count

Change replicas: 8 in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.

Using Custom Datasets

Replace the Mooncake trace with your own JSONL file:

Format: One request per line with timestamp field
AIPerf supports various formats via --custom-dataset-type

Disaggregated Prefill/Decode

For advanced testing, add separate prefill workers:

1 VllmPrefillWorker:
2   componentType: worker
3   replicas: 2
4   # ... configuration

Best Practices

Equal Conditions: Ensure both deployments have identical worker counts and health before benchmarking
Warm-Up: Run a small test (100 requests) before the full benchmark to warm up caches
Multiple Runs: Run benchmarks 3+ times and average results for statistical significance
Monitor Workers: Watch for any pod restarts or issues during benchmark runs
Document Conditions: Record cluster state, worker health, and any anomalies
Consistent Configuration: Use the same trace file and AIPerf options for both runs

Conclusion

This guide provides a complete methodology for A/B testing Dynamo’s KV Smart Router. The KV router’s effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the Tuning Guidelines.

For questions or issues, consult the Dynamo documentation or open an issue on GitHub.

Appendix: Files Reference

router-off-deployment.yaml: Standard routing deployment
router-on-deployment.yaml: KV router enabled deployment
benchmark-job.yaml: AIPerf benchmark pod
AIPerf artifact dirs: summary JSON and profile_export.jsonl per run

Repository: https://github.com/ai-dynamo/dynamo