Kimi-K2.6

Serve moonshotai/Kimi-K2.6 with Dynamo and vLLM, tuned per GPU and workload.

View as Markdown

Each target below is a validated aggregated vLLM deployment of Kimi-K2.6 — text + image input, reasoning, and tool calling — with KV-aware routing, Eagle3 MLA speculative decoding, and LMCache CPU KV-cache offload, benchmarked to roughly 50 output tok/s per user on its trace. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPU
Workload
Checkpoint nvidia/Kimi-K2.6-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4MoE backend FlashInfer-TRTLLMAttention TokenSpeed MLAWorkload Chat, 70% KV reuse
Checkpoint nvidia/Kimi-K2.6-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4MoE backend FlashInfer-TRTLLMAttention TokenSpeed MLAWorkload Agentic, 64K-median ISL, 90% KV reuse
Checkpoint moonshotai/Kimi-K2.6 (native INT4)Precision INT4GPUs 8x H200 per worker, TP8MoE backend MarlinAttention FlashAttention MLAWorkload Chat, 70% KV reuse
Checkpoint moonshotai/Kimi-K2.6 (native INT4)Precision INT4GPUs 8x H200 per worker, TP8MoE backend MarlinAttention FlashAttention MLAWorkload Agentic, 64K-median ISL, 90% KV reuse

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 4x B200 per worker replica available (results below were measured at 4 replicas).
  • A Hugging Face token with access to nvidia/Kimi-K2.6-NVFP4 and lightseekorg/kimi-k2.6-eagle3-mla.
  • A Kubernetes cluster with the Dynamo platform installed and 8x H200 per worker replica available (results below were measured at 4 replicas).
  • A Hugging Face token with access to moonshotai/Kimi-K2.6 and lightseekorg/kimi-k2.6-eagle3-mla.

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint and Eagle3 head:

$# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$kubectl apply -f recipes/kimi-k2.6/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download. B200 uses the NVFP4 checkpoint — remove the native INT4
$# download from model-download.yaml before applying.
$kubectl apply -f recipes/kimi-k2.6/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
$# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$kubectl apply -f recipes/kimi-k2.6/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download. H200 uses the native INT4 checkpoint — remove the NVFP4
$# download from model-download.yaml before applying.
$kubectl apply -f recipes/kimi-k2.6/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$kubectl apply -f recipes/kimi-k2.6/vllm/agg-b200-chat/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.6/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.6/vllm/agg-h200-chat/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/kimi-k2.6/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/kimi-k26-agg-b200-chat-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/kimi-k26-agg-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/kimi-k26-agg-h200-chat-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/kimi-k26-agg-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"moonshotai/Kimi-K2.6","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT and TRACE_FILE change. Before running it: point the worker’s SPECULATIVE_CONFIG at the speculative-config-synthetic ConfigMap key (synthetic Eagle3 acceptance, AL=2.49), set worker replicas to your target, and stage the traces from recipes/kimi-k2.6/perf/traces/ onto the model-cache PVC:

$kubectl run pvc-helper -n ${NAMESPACE} \
> --image=busybox:1.36 --restart=Never \
> --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/model-cache"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"model-cache"}}]}}' \
> --command -- sleep 3600
$kubectl cp recipes/kimi-k2.6/perf/traces ${NAMESPACE}/pvc-helper:/model-cache/

Set ENDPOINT to kimi-k26-agg-b200-chat-frontend:8000 and TRACE_FILE to the chat trace, then apply. The Job wraps this AIPerf run:

$aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-b200-chat-frontend:8000 --streaming \
> --use-server-token-count --extra-inputs ignore_eos:true --concurrency 48 --random-seed 42

Set ENDPOINT to kimi-k26-agg-b200-agentic-frontend:8000 and TRACE_FILE to the agentic trace, then apply. The Job wraps this AIPerf run:

$aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-b200-agentic-frontend:8000 --streaming \
> --use-server-token-count --extra-inputs ignore_eos:true --concurrency 64 --random-seed 42

Set ENDPOINT to kimi-k26-agg-h200-chat-frontend:8000 and TRACE_FILE to the chat trace, then apply. The Job wraps this AIPerf run:

$aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-h200-chat-frontend:8000 --streaming \
> --use-server-token-count --extra-inputs ignore_eos:true --concurrency 32 --random-seed 42

Set ENDPOINT to kimi-k26-agg-h200-agentic-frontend:8000 and TRACE_FILE to the agentic trace, then apply. The Job wraps this AIPerf run:

$aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-h200-agentic-frontend:8000 --streaming \
> --use-server-token-count --extra-inputs ignore_eos:true --concurrency 48 --random-seed 42
$kubectl apply -f recipes/kimi-k2.6/perf/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/kimi-k26-bench -n ${NAMESPACE} --timeout=7200s

15% and 30% trace subsets are provided for shorter runs. To sweep concurrencies, delete the DGD worker pods between runs so residual KV-cache and prefix-cache state does not skew results — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape at a 50 tok/s/user interactivity target:

WorkloadMedian ISLMedian OSLKV cache hit rateUser output tok/s
Chat1K1K70%50
Agentic64K40090%50

Measured results below were collected by replaying the 15% trace subsets (*_short_15perc.jsonl) with 4 worker replicas per deployment; your selected target’s row is highlighted:

RecipeSKUWorker replicasConcurrencyUser output tok/sSystem output tok/s/GPU
Chat (15% subset)B20044849.86107.8
Agentic (15% subset)B20046455.50166.5
Chat (15% subset)H20043254.8638.7
Agentic (15% subset)H20044856.0666.5

Compare All Targets

All four targets serve moonshotai/Kimi-K2.6 on aggregated vLLM 0.21.0 with KV-aware routing, Eagle3 MLA speculative decoding (3 draft tokens), and LMCache CPU KV-cache offload. They differ in checkpoint, parallelism, kernel backends, and the trace they are benchmarked against:

B200 chatH200 chatB200 agenticH200 agentic
GPUs per worker4x B2008x H2004x B2008x H200
PrecisionNVFP4 + FP8 KVINT4 (native)NVFP4 + FP8 KVINT4 (native)
ParallelismTP4TP8TP4TP8
MoE backendFlashInfer-TRTLLMMarlinFlashInfer-TRTLLMMarlin
Attention backendTokenSpeed MLAFlashAttention MLATokenSpeed MLAFlashAttention MLA
AllReduceNCCL symmetric memoryNCCLNCCL symmetric memoryNCCL
WorkloadChat traceChat traceAgentic traceAgentic trace

Notes

  • Dynamo’s KV cache router does not support all LMCache KV events, so routing can be sub-optimal.
  • Some 400 HTTP errors raised by workers on invalid inputs can surface as 500 errors through the frontend.
  • B200 targets run nvidia/Kimi-K2.6-NVFP4 with FP8 KV cache; H200 targets run the native INT4 moonshotai/Kimi-K2.6 checkpoint. Both serve under the name moonshotai/Kimi-K2.6.
  • The chat- and agentic-tuned deployments for a given SKU share the same engine configuration; they are separate targets because each is benchmarked and sized against its own trace, and either trace can be replayed against either DGD.
  • For benchmarking, swap SPECULATIVE_CONFIG to the speculative-config-synthetic key so Eagle3 acceptance is synthetic and deterministic (AL=2.49); production deployments use the standard speculative-config key.

Source