Additional Resources

KV Router A/B Testing

View as Markdown

This guide walks you through setting up and running A/B benchmarks to compare Dynamo’s KV Smart Router against standard round-robin routing on a Kubernetes cluster.

Overview

Dynamo’s KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:

  1. Deploy two identical Dynamo configurations: a. A vllm server for Qwen3-32B with 8 workers (aggregated) WITHOUT KV Smart Router enabled b. A vllm server for Qwen3-32B with 8 workers (aggregated) WITH KV Smart Router enabled
  2. Run controlled benchmarks using AIPerf
  3. Compare performance metrics to evaluate KV router effectiveness

Prerequisites: Kubernetes cluster with GPUs, kubectl, helm


Prerequisites

Required Tools

  • kubectl (configured with cluster access)
  • helm (v3+)
  • HuggingFace account and token (if model downloads are gated)
  • Kubernetes cluster with:
    • GPU nodes (H100, H200, or similar)
    • Sufficient GPU capacity (8+ GPUs recommended for this example)
    • Dynamo platform installed globally OR ability to install per-namespace

Knowledge Requirements

  • Basic Kubernetes concepts (namespaces, pods, services)
  • Familiarity with LLM inference concepts
  • Command-line proficiency

Architecture

This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.

┌──────────────────────────────────────────────┐
│ Namespace: dynamo-bench │
│ (one of A or B active at a time) │
│ │
│ Deployment A: Router OFF │
│ ├─ Frontend (Standard Routing) │
│ └─ 8x Decode Workers (1 GPU each) │
│ │
│ Deployment B: Router ON │
│ ├─ Frontend (KV Smart Router) │
│ └─ 8x Decode Workers (1 GPU each) │
│ │
│ Benchmark Pod (AIPerf + Dataset) │
└──────────────────────────────────────────────┘

Key Difference: Deployment B sets DYN_ROUTER_MODE=kv on the frontend to enable KV cache-aware routing.


Phase 1: Namespace and Infrastructure Setup

Step 1.1: Create Namespace

$kubectl create namespace dynamo-bench

Step 1.2: Create HuggingFace Token Secret (optional)

If the model you’re seeking to deploy requires HF token to download (Llama family models require this), replace YOUR_HF_TOKEN with your actual HuggingFace token:

$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
> -n dynamo-bench

Step 1.3: Install Dynamo Platform

If your cluster uses namespace-restricted Dynamo operators, you’ll need to install the Dynamo platform in the workload namespace. Follow the Dynamo Kubernetes Installation Guide to install the platform in dynamo-bench.

Key Configuration Notes:

  • If your cluster uses namespace restrictions, ensure dynamo-operator.namespaceRestriction.enabled=true is set during installation
  • Adjust version tags to match your cluster’s available Dynamo versions
  • If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation

Step 1.4: Verify Infrastructure

$kubectl get pods -n dynamo-bench

Expect operator, etcd, and nats pods Running before deploying the graph.


Phase 2: Deploy Model Serving

Step 2.1: Create Deployment YAMLs

Create router-off-deployment.yaml (baseline):

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-agg-no-router
5spec:
6 services:
7 Frontend:
8 dynamoNamespace: vllm-agg-no-router
9 componentType: frontend
10 replicas: 1
11 extraPodSpec:
12 mainContainer:
13 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
14 env:
15 - name: POD_UID
16 valueFrom:
17 fieldRef:
18 fieldPath: metadata.uid
19 VllmDecodeWorker:
20 envFromSecret: hf-token-secret
21 dynamoNamespace: vllm-agg-no-router
22 componentType: worker
23 replicas: 8
24 resources:
25 limits:
26 gpu: "1"
27 extraPodSpec:
28 affinity:
29 nodeAffinity:
30 requiredDuringSchedulingIgnoredDuringExecution:
31 nodeSelectorTerms:
32 - matchExpressions:
33 - key: node.kubernetes.io/instance-type
34 operator: In
35 values:
36 - gpu-h100-sxm # Adjust to your GPU node type
37 mainContainer:
38 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
39 workingDir: /workspace
40 command:
41 - /bin/sh
42 - -c
43 args:
44 - >-
45 python3 -m dynamo.vllm
46 --model Qwen/Qwen3-32B
47 --quantization fp8
48 --kv-cache-dtype fp8
49 --max-model-len 131072
50 --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
51 --gpu-memory-utilization 0.90
52 --block-size 64
53 --async-scheduling
54 --disable-log-requests
55 env:
56 - name: DYN_HEALTH_CHECK_ENABLED
57 value: "false"
58 - name: POD_UID
59 valueFrom:
60 fieldRef:
61 fieldPath: metadata.uid
62 startupProbe:
63 httpGet:
64 path: /health
65 port: 9090
66 initialDelaySeconds: 120
67 periodSeconds: 30
68 timeoutSeconds: 10
69 failureThreshold: 60
70 livenessProbe:
71 httpGet:
72 path: /live
73 port: 9090
74 initialDelaySeconds: 300
75 periodSeconds: 30
76 timeoutSeconds: 10
77 failureThreshold: 10
78 readinessProbe:
79 httpGet:
80 path: /live
81 port: 9090
82 initialDelaySeconds: 300
83 periodSeconds: 30
84 timeoutSeconds: 10
85 failureThreshold: 10
86 subComponentType: decode

Create router-on-deployment.yaml (KV router ON):

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-agg-router
5spec:
6 services:
7 Frontend:
8 dynamoNamespace: vllm-agg-router
9 componentType: frontend
10 replicas: 1
11 extraPodSpec:
12 mainContainer:
13 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
14 env:
15 - name: POD_UID
16 valueFrom:
17 fieldRef:
18 fieldPath: metadata.uid
19 envs:
20 - name: DYN_ROUTER_MODE
21 value: kv # KEY DIFFERENCE: Enable KV Smart Router
22 VllmDecodeWorker:
23 envFromSecret: hf-token-secret
24 dynamoNamespace: vllm-agg-router
25 componentType: worker
26 replicas: 8
27 resources:
28 limits:
29 gpu: "1"
30 extraPodSpec:
31 affinity:
32 nodeAffinity:
33 requiredDuringSchedulingIgnoredDuringExecution:
34 nodeSelectorTerms:
35 - matchExpressions:
36 - key: node.kubernetes.io/instance-type
37 operator: In
38 values:
39 - gpu-h100-sxm # Adjust to your GPU node type
40 mainContainer:
41 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
42 workingDir: /workspace
43 command:
44 - /bin/sh
45 - -c
46 args:
47 - >-
48 python3 -m dynamo.vllm
49 --model Qwen/Qwen3-32B
50 --quantization fp8
51 --kv-cache-dtype fp8
52 --max-model-len 131072
53 --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
54 --gpu-memory-utilization 0.90
55 --block-size 64
56 --async-scheduling
57 --disable-log-requests
58 env:
59 - name: DYN_HEALTH_CHECK_ENABLED
60 value: "false"
61 - name: POD_UID
62 valueFrom:
63 fieldRef:
64 fieldPath: metadata.uid
65 startupProbe:
66 httpGet:
67 path: /health
68 port: 9090
69 initialDelaySeconds: 120
70 periodSeconds: 30
71 timeoutSeconds: 10
72 failureThreshold: 60
73 livenessProbe:
74 httpGet:
75 path: /live
76 port: 9090
77 initialDelaySeconds: 300
78 periodSeconds: 30
79 timeoutSeconds: 10
80 failureThreshold: 10
81 readinessProbe:
82 httpGet:
83 path: /live
84 port: 9090
85 initialDelaySeconds: 300
86 periodSeconds: 30
87 timeoutSeconds: 10
88 failureThreshold: 10
89 subComponentType: decode

Step 2.2: Deploy Router-ON First

$kubectl apply -f router-on-deployment.yaml -n dynamo-bench

💡 Optimization Tip: Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with ReadWriteMany access mode to cache the model.

First, create the PVC in the same namespace as your deployment (e.g. dynamo-bench). Use a storage class that supports ReadWriteMany:

$kubectl get storageclass # choose one with ReadWriteMany (e.g. azurefile-csi-premium, nfs, efs)
1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4 name: model-cache
5 namespace: dynamo-bench
6spec:
7 accessModes:
8 - ReadWriteMany
9 storageClassName: "azurefile-csi-premium" # Adjust to your cluster
10 resources:
11 requests:
12 storage: 100Gi

Apply it: kubectl apply -f pvc-model-cache.yaml

Then reference the existing PVC in your DynamoGraphDeployment by adding the following under spec (and under VllmDecodeWorker, add volumeMounts):

1spec:
2 pvcs:
3 - create: false
4 name: model-cache
5 size: "0"
6 services:
7 VllmDecodeWorker:
8 volumeMounts:
9 - mountPoint: /root/.cache/huggingface
10 name: model-cache
11 useAsCompilationCache: false

With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.

Step 2.3: Monitor Deployment Progress

$kubectl get pods -n dynamo-bench -w

Wait for all pods to reach Running status and pass readiness probes.

Expected Timeline:

  • With shared PVC (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
  • Without shared PVC: 20-30 minutes per worker (workers download independently)
    • For 8 workers: Budget 1-2 hours for full deployment (workers start in parallel but are limited by node scheduling)

The deployment’s startup probe (initialDelaySeconds: 120, periodSeconds: 30, failureThreshold: 60) allows up to 32 minutes per pod for model download and initialization.

Step 2.4: Verify Workers Are Healthy

⚠️ CRITICAL CHECKPOINT: Before running benchmarks, you MUST verify equal worker health. Unequal worker counts will invalidate your comparison results.

$# Quick health check - should show "8/8"
$echo "Workers: $(kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
$
$# Detailed view
$kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker

All 8 must show 1/1 Running and Ready. Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).


Phase 3: Prepare Benchmark Dataset

Understanding the Mooncake Toolagent Trace

For this A/B comparison, we use the Mooncake FAST’25 Toolagent Trace, published by Mooncake AI (USENIX FAST’25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production tool-agent workloads — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains 23,608 requests spanning ~59 minutes of real-time traffic.

Why the toolagent trace? Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router’s real-world performance gains.

What’s in the dataset? Each trace entry contains:

  • Timestamp: When the request arrived (for realistic request timing)
  • Input/output lengths: Number of tokens in prompts and responses
  • Block hash IDs: Cryptographic hashes representing KV cache blocks (no user text; explained below)

Sample trace entries (showing prefix reuse):

1{"timestamp": 0, "input_length": 9013, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}
2{"timestamp": 0, "input_length": 6506, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 64]}

These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a 512-token block, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The KV Smart Router routes requests with matching hash IDs to the same worker, maximizing cache hits.

Key Dataset Properties:

  • Realistic timing: Request arrival patterns from production tool-agent workloads
  • High prefix overlap: 59% cache ratio (Mooncake FAST’25 paper); iterative tool calls within sessions produce natural prefix reuse
  • Privacy-preserving: No actual text — only hash-based cache block identifiers
  • Reproducible: Public dataset enables fair comparisons across different systems

Download and Prepare the Dataset

$# Download the Mooncake FAST'25 toolagent trace
>curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl -o toolagent_trace.jsonl
>
># Slow down timestamps to 0.80× replay speed (~5.3 req/s instead of ~6.7 req/s)
>python3 - <<'PY'
>import json
>
>with open("toolagent_trace.jsonl") as src, open("toolagent_trace_080x.jsonl", "w") as dst:
> for line in src:
> rec = json.loads(line)
> rec["timestamp"] = int(rec["timestamp"] / 0.80)
> dst.write(json.dumps(rec) + "\n")
>PY
>
>echo "Dataset ready: toolagent_trace_080x.jsonl (23,608 requests, 0.80x speed)"

Phase 4: Set Up Benchmark Environment

Step 4.1: Deploy Benchmark Pod

Create benchmark-job.yaml:

1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: aiperf-benchmark
5spec:
6 backoffLimit: 1
7 template:
8 spec:
9 restartPolicy: Never
10 containers:
11 - name: benchmark
12 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
13 securityContext:
14 runAsUser: 0 # Required: apt-get and pip install need root in ephemeral benchmark pod
15 command:
16 - /bin/bash
17 - -lc
18 - |
19 apt-get update -qq && apt-get install -y -qq tmux > /dev/null 2>&1
20 pip install -q aiperf==0.5.0
21 echo "Benchmark pod ready (tmux + aiperf installed)."
22 sleep infinity
23 imagePullPolicy: IfNotPresent
24 resources:
25 limits:
26 nvidia.com/gpu: 0

This pod installs tmux and aiperf on startup so benchmarks can run inside a tmux session that survives kubectl exec disconnects.

Deploy:

$kubectl apply -f benchmark-job.yaml -n dynamo-bench

Wait for pod to be ready (the init takes ~1-2 minutes to install packages):

$kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -w

Step 4.2: Copy Dataset to Benchmark Pod

$POD_NAME=$(kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
$kubectl -n dynamo-bench cp toolagent_trace_080x.jsonl ${POD_NAME}:/tmp/toolagent_trace_080x.jsonl

Phase 5: Run Benchmarks

Step 5.1: Benchmark Router-ON

Verify the frontend service is reachable (the operator creates a service named {deployment-name}-frontend):

$kubectl get svc -n dynamo-bench | grep frontend

Launch the benchmark inside a tmux session so it survives kubectl exec disconnects:

$kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
> tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
> AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
> -m Qwen/Qwen3-32B \
> --tokenizer Qwen/Qwen3-32B \
> --input-file /tmp/toolagent_trace_080x.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --url http://vllm-agg-router-frontend.dynamo-bench.svc.cluster.local:8000 \
> --streaming \
> --random-seed 42 \
> --workers-max 200 \
> --request-timeout-seconds 1000 \
> --profile-export-level records \
> --record-processors 8 \
> --artifact-dir /tmp/aiperf_router_on \
> --goodput \"time_to_first_token:5000 inter_token_latency:100\""
>'

AIPerf writes the run to /tmp/aiperf_router_on on the pod (summary JSON and profile_export.jsonl).

Monitoring Benchmarks

Benchmarks run inside a tmux session so they survive kubectl exec disconnects.

Attach to the live TUI (detach with Ctrl+B then D):

$kubectl -n dynamo-bench exec -it ${POD_NAME} -- tmux a -t benchmark

Step 5.2: Switch to Router-OFF and Benchmark

Tear down router-ON and deploy the baseline:

$kubectl delete dynamographdeployment vllm-agg-router -n dynamo-bench
$kubectl apply -f router-off-deployment.yaml -n dynamo-bench

Wait for 8/8 workers to be Ready again (re-run the health check from Step 2.4), then clean up the previous tmux session and launch the baseline benchmark:

$kubectl -n dynamo-bench exec ${POD_NAME} -- tmux kill-session -t benchmark 2>/dev/null
$
$kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
> tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
> AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
> -m Qwen/Qwen3-32B \
> --tokenizer Qwen/Qwen3-32B \
> --input-file /tmp/toolagent_trace_080x.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --url http://vllm-agg-no-router-frontend.dynamo-bench.svc.cluster.local:8000 \
> --streaming \
> --random-seed 42 \
> --workers-max 200 \
> --request-timeout-seconds 1000 \
> --profile-export-level records \
> --record-processors 8 \
> --artifact-dir /tmp/aiperf_router_off \
> --goodput \"time_to_first_token:5000 inter_token_latency:100\""
>'

Step 5.3: Collect Results

Copy the artifact directories (or the summary/export files inside them) to your machine:

$kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_on ./aiperf_router_on
$kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_off ./aiperf_router_off

Each artifact directory contains:

  • profile_export_aiperf.json — summary with aggregated metrics (TTFT, latency percentiles, throughput)
  • profile_export.jsonl — per-request records (one JSON object per completed request)

Step 5.4: Quick Comparison

Extract and compare key metrics from the two summary files:

$python3 -c "
>import json, pathlib
>
>def load(d):
> return json.loads(pathlib.Path(d, 'profile_export_aiperf.json').read_text())
>
>on, off = load('aiperf_router_on'), load('aiperf_router_off')
>
>metrics = [
> ('TTFT avg (ms)', 'time_to_first_token', 'avg'),
> ('TTFT p99 (ms)', 'time_to_first_token', 'p99'),
> ('E2E Latency avg (ms)', 'request_latency', 'avg'),
> ('E2E Latency p99 (ms)', 'request_latency', 'p99'),
> ('Output Throughput (tok/s)', 'output_token_throughput', 'avg'),
>]
>
>print(f\"{'Metric':<28} {'Router-OFF':>12} {'Router-ON':>12} {'Speedup':>10}\")
>print('-' * 66)
>for label, key, stat in metrics:
> v_off = off.get(key, {}).get(stat, 0)
> v_on = on.get(key, {}).get(stat, 0)
> if 'throughput' in key.lower():
> speedup = v_on / v_off if v_off else 0
> else:
> speedup = v_off / v_on if v_on else 0
> print(f'{label:<28} {v_off:>12.1f} {v_on:>12.1f} {speedup:>9.1f}x')
>"

Phase 6: Analyze Results

Key Metrics to Compare

MetricDescriptionWhat to Look For
Time to First Token (TTFT)Latency until first token arrivesLower is better; KV router may reduce with prefix reuse
Inter Token Latency (ITL)Average time between tokensLower is better; indicates generation speed
Request LatencyTotal end-to-end latencyLower is better; overall user experience
Output Token ThroughputTokens generated per second (system-wide)Higher is better; system efficiency
Request ThroughputRequests completed per secondHigher is better; capacity

Interpreting Results

Your Results May Vary: The improvement from KV Smart Router depends heavily on your workload characteristics:

Factors that increase KV router benefit:

  • High prefix overlap (shared system prompts, templates, document contexts)
  • Long prompts (>2000 tokens) where caching saves significant compute
  • Multi-turn conversations with context carryover
  • Batch workloads with similar queries

Factors that reduce KV router benefit:

  • Unique prompts with no prefix reuse
  • Short prompts (less than 1000 tokens) where routing overhead exceeds benefit
  • Evenly distributed load where round-robin is already optimal
  • Low request rate where cache eviction negates benefits

KV Smart Router is beneficial when:

  • TTFT improvements > 20%
  • No significant degradation in other metrics
  • Workload demonstrates measurable prefix reuse patterns

Standard routing is better when:

  • KV router shows less than 10% improvement
  • Increased latency variance is observed
  • Load distribution across workers is more important than cache affinity

Example Comparison

From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:

MetricRouter-OFF (Baseline)Router-ON (KV Router)ImprovementSpeedup
TTFT avg63,652 ms2,586 ms96% faster24.6x ✅
TTFT p99332,974 ms17,871 ms95% faster18.6x ✅
E2E Latency avg92,856 ms19,112 ms79% faster4.9x ✅
E2E Latency p99411,252 ms88,274 ms79% faster4.7x ✅

In this example with all 8 workers healthy, the KV router dramatically outperformed the baseline:

  • 96% faster TTFT — Users see first token in ~2.6s instead of ~64s
  • 79% lower E2E latency — Requests complete in ~19s instead of ~93s
  • 95% faster TTFT p99 — Tail latency drops from ~333s to ~18s

The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.


Phase 7: Cleanup

$kubectl delete dynamographdeployment --all -n dynamo-bench
$kubectl delete job aiperf-benchmark -n dynamo-bench
$kubectl delete namespace dynamo-bench

Troubleshooting

Issue: Pods Stuck in Pending

Cause: Insufficient GPU resources

Solution:

$# Check GPU availability
$kubectl describe nodes | grep -A 10 "Allocated resources"
$
$# Reduce worker replicas if needed
$kubectl edit dynamographdeployment -n dynamo-bench

Issue: ImagePullBackOff Errors

Cause: Version mismatch or missing credentials

Solution:

$# Check available versions
$kubectl get pods -n dynamo-bench -o yaml | grep image:
$
$# Update deployment YAML to match cluster version

Issue: Operator Not Processing Deployment

Cause: Namespace restrictions

Solution:

  • Ensure Dynamo platform is Helm-installed in the namespace
  • Verify operator has --restrictedNamespace=dynamo-bench argument
  • Check operator logs: kubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-manager

Issue: Workers Not Becoming Ready

Cause: Model download failures or probe configuration

Solution:

$# Check worker logs
$kubectl logs -n dynamo-bench <worker-pod-name>
$
$# Common issues:
$# - Invalid HuggingFace token
$# - Network connectivity
$# - Insufficient disk space for model

Issue: Workers Restarting in CrashLoopBackOff

Cause: Startup probe timeout — workers killed before finishing initialization

Symptoms:

  • Pods show “Container main failed startup probe, will be restarted”
  • Logs show model still downloading or loading when pod is killed

Solution: The deployment YAMLs in this guide set failureThreshold: 60, allowing up to 32 minutes (120s + 60×30s). If you lowered this value or are using a larger model that needs more time, increase it:

$kubectl patch dynamographdeployment <deployment-name> -n dynamo-bench --type='json' \
> -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 80}]'

The relevant startup probe fields:

1startupProbe:
2 httpGet:
3 path: /health
4 port: 9090
5 initialDelaySeconds: 120
6 periodSeconds: 30
7 timeoutSeconds: 10
8 failureThreshold: 60 # 32 minutes total (120s + 60*30s); increase for larger models

Model Loading Times (approximate):

  • Qwen3-32B: ~20-25 minutes (first download)
  • With cached model on node: ~2-5 minutes

Issue: Unequal Worker Health

Cause: Resource constraints, image pull issues, or configuration errors

Solution:

$# Check all worker status
$kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
$
$# Describe problematic pods
$kubectl describe pod <pod-name> -n dynamo-bench
$
$# Fix issues before benchmarking or results will be skewed

Advanced Configuration

Testing Different Models

Replace Qwen/Qwen3-32B with your model in:

  • Deployment YAML args section
  • AIPerf --model and --tokenizer parameters

Adjusting Worker Count

Change replicas: 8 in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.

Using Custom Datasets

Replace the Mooncake trace with your own JSONL file:

  • Format: One request per line with timestamp field
  • AIPerf supports various formats via --custom-dataset-type

Disaggregated Prefill/Decode

For advanced testing, add separate prefill workers:

1VllmPrefillWorker:
2 componentType: worker
3 replicas: 2
4 # ... configuration

Best Practices

  1. Equal Conditions: Ensure both deployments have identical worker counts and health before benchmarking
  2. Warm-Up: Run a small test (100 requests) before the full benchmark to warm up caches
  3. Multiple Runs: Run benchmarks 3+ times and average results for statistical significance
  4. Monitor Workers: Watch for any pod restarts or issues during benchmark runs
  5. Document Conditions: Record cluster state, worker health, and any anomalies
  6. Consistent Configuration: Use the same trace file and AIPerf options for both runs

Conclusion

This guide provides a complete methodology for A/B testing Dynamo’s KV Smart Router. The KV router’s effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the Tuning Guidelines.

For questions or issues, consult the Dynamo documentation or open an issue on GitHub.


Appendix: Files Reference

  • router-off-deployment.yaml: Standard routing deployment
  • router-on-deployment.yaml: KV router enabled deployment
  • benchmark-job.yaml: AIPerf benchmark pod
  • AIPerf artifact dirs: summary JSON and profile_export.jsonl per run

Repository: https://github.com/ai-dynamo/dynamo