Kimi-K2.6 | NVIDIA Dynamo Documentation

Each target below is a validated aggregated vLLM deployment of Kimi-K2.6 — text + image input, reasoning, and tool calling — with KV-aware routing, Eagle3 MLA speculative decoding, and LMCache CPU KV-cache offload, benchmarked to roughly 50 output tok/s per user on its trace. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPUB200 RecommendedH200

WorkloadChatAgentic

Checkpoint nvidia/Kimi-K2.6-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4MoE backend FlashInfer-TRTLLMAttention TokenSpeed MLAWorkload Chat, 70% KV reuse

Checkpoint nvidia/Kimi-K2.6-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4MoE backend FlashInfer-TRTLLMAttention TokenSpeed MLAWorkload Agentic, 64K-median ISL, 90% KV reuse

Checkpoint moonshotai/Kimi-K2.6 (native INT4)Precision INT4GPUs 8x H200 per worker, TP8MoE backend MarlinAttention FlashAttention MLAWorkload Chat, 70% KV reuse

Checkpoint moonshotai/Kimi-K2.6 (native INT4)Precision INT4GPUs 8x H200 per worker, TP8MoE backend MarlinAttention FlashAttention MLAWorkload Agentic, 64K-median ISL, 90% KV reuse

Prerequisites

A Kubernetes cluster with the Dynamo platform installed and 4x B200 per worker replica available (results below were measured at 4 replicas).
A Hugging Face token with access to nvidia/Kimi-K2.6-NVFP4 and lightseekorg/kimi-k2.6-eagle3-mla.

A Kubernetes cluster with the Dynamo platform installed and 8x H200 per worker replica available (results below were measured at 4 replicas).
A Hugging Face token with access to moonshotai/Kimi-K2.6 and lightseekorg/kimi-k2.6-eagle3-mla.

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token" \
>   -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint and Eagle3 head:

$ # 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$ kubectl apply -f recipes/kimi-k2.6/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download. B200 uses the NVFP4 checkpoint — remove the native INT4
$ #    download from model-download.yaml before applying.
$ kubectl apply -f recipes/kimi-k2.6/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

$ # 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$ kubectl apply -f recipes/kimi-k2.6/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download. H200 uses the native INT4 checkpoint — remove the NVFP4
$ #    download from model-download.yaml before applying.
$ kubectl apply -f recipes/kimi-k2.6/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$ kubectl apply -f recipes/kimi-k2.6/vllm/agg-b200-chat/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/kimi-k2.6/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/kimi-k2.6/vllm/agg-h200-chat/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/kimi-k2.6/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/kimi-k26-agg-b200-chat-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/kimi-k26-agg-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/kimi-k26-agg-h200-chat-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/kimi-k26-agg-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}

$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"moonshotai/Kimi-K2.6","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT and TRACE_FILE change. Before running it: point the worker’s SPECULATIVE_CONFIG at the speculative-config-synthetic ConfigMap key (synthetic Eagle3 acceptance, AL=2.49), set worker replicas to your target, and stage the traces from recipes/kimi-k2.6/perf/traces/ onto the model-cache PVC:

$ kubectl run pvc-helper -n ${NAMESPACE} \
>   --image=busybox:1.36 --restart=Never \
>   --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/model-cache"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"model-cache"}}]}}' \
>   --command -- sleep 3600
$ kubectl cp recipes/kimi-k2.6/perf/traces ${NAMESPACE}/pvc-helper:/model-cache/

Set ENDPOINT to kimi-k26-agg-b200-chat-frontend:8000 and TRACE_FILE to the chat trace, then apply. The Job wraps this AIPerf run:

$ aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-b200-chat-frontend:8000 --streaming \
>   --use-server-token-count --extra-inputs ignore_eos:true --concurrency 48 --random-seed 42

Set ENDPOINT to kimi-k26-agg-b200-agentic-frontend:8000 and TRACE_FILE to the agentic trace, then apply. The Job wraps this AIPerf run:

$ aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-b200-agentic-frontend:8000 --streaming \
>   --use-server-token-count --extra-inputs ignore_eos:true --concurrency 64 --random-seed 42

Set ENDPOINT to kimi-k26-agg-h200-chat-frontend:8000 and TRACE_FILE to the chat trace, then apply. The Job wraps this AIPerf run:

$ aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-h200-chat-frontend:8000 --streaming \
>   --use-server-token-count --extra-inputs ignore_eos:true --concurrency 32 --random-seed 42

Set ENDPOINT to kimi-k26-agg-h200-agentic-frontend:8000 and TRACE_FILE to the agentic trace, then apply. The Job wraps this AIPerf run:

$ aiperf profile -m moonshotai/Kimi-K2.6 --tokenizer moonshotai/Kimi-K2.6 --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 --url http://kimi-k26-agg-h200-agentic-frontend:8000 --streaming \
>   --use-server-token-count --extra-inputs ignore_eos:true --concurrency 48 --random-seed 42

$ kubectl apply -f recipes/kimi-k2.6/perf/perf.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/kimi-k26-bench -n ${NAMESPACE} --timeout=7200s

15% and 30% trace subsets are provided for shorter runs. To sweep concurrencies, delete the DGD worker pods between runs so residual KV-cache and prefix-cache state does not skew results — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape at a 50 tok/s/user interactivity target:

Workload	Median ISL	Median OSL	KV cache hit rate	User output tok/s
Chat	1K	1K	70%	50
Agentic	64K	400	90%	50

Measured results below were collected by replaying the 15% trace subsets (*_short_15perc.jsonl) with 4 worker replicas per deployment; your selected target’s row is highlighted:

Recipe	SKU	Worker replicas	Concurrency	User output tok/s	System output tok/s/GPU
Chat (15% subset)	B200	4	48	49.86	107.8
Agentic (15% subset)	B200	4	64	55.50	166.5
Chat (15% subset)	H200	4	32	54.86	38.7
Agentic (15% subset)	H200	4	48	56.06	66.5

Compare All Targets

All four targets serve moonshotai/Kimi-K2.6 on aggregated vLLM 0.21.0 with KV-aware routing, Eagle3 MLA speculative decoding (3 draft tokens), and LMCache CPU KV-cache offload. They differ in checkpoint, parallelism, kernel backends, and the trace they are benchmarked against:

	B200 chat	H200 chat	B200 agentic	H200 agentic
GPUs per worker	4x B200	8x H200	4x B200	8x H200
Precision	NVFP4 + FP8 KV	INT4 (native)	NVFP4 + FP8 KV	INT4 (native)
Parallelism	TP4	TP8	TP4	TP8
MoE backend	FlashInfer-TRTLLM	Marlin	FlashInfer-TRTLLM	Marlin
Attention backend	TokenSpeed MLA	FlashAttention MLA	TokenSpeed MLA	FlashAttention MLA
AllReduce	NCCL symmetric memory	NCCL	NCCL symmetric memory	NCCL
Workload	Chat trace	Chat trace	Agentic trace	Agentic trace

Notes

Dynamo’s KV cache router does not support all LMCache KV events, so routing can be sub-optimal.
Some 400 HTTP errors raised by workers on invalid inputs can surface as 500 errors through the frontend.
B200 targets run nvidia/Kimi-K2.6-NVFP4 with FP8 KV cache; H200 targets run the native INT4 moonshotai/Kimi-K2.6 checkpoint. Both serve under the name moonshotai/Kimi-K2.6.
The chat- and agentic-tuned deployments for a given SKU share the same engine configuration; they are separate targets because each is benchmarked and sized against its own trace, and either trace can be replayed against either DGD.
For benchmarking, swap SPECULATIVE_CONFIG to the speculative-config-synthetic key so Eagle3 acceptance is synthetic and deterministic (AL=2.49); production deployments use the standard speculative-config key.

Source

Source README: recipes/kimi-k2.6/README.md
Benchmark README: recipes/kimi-k2.6/perf/README.md and recipes/kimi-k2.6/perf/perf.yaml
B200 chat: recipes/kimi-k2.6/vllm/agg-b200-chat/deploy.yaml
H200 chat: recipes/kimi-k2.6/vllm/agg-h200-chat/deploy.yaml
B200 agentic: recipes/kimi-k2.6/vllm/agg-b200-agentic/deploy.yaml
H200 agentic: recipes/kimi-k2.6/vllm/agg-h200-agentic/deploy.yaml
Setup assets: recipes/kimi-k2.6/model-cache/model-cache.yaml and recipes/kimi-k2.6/model-cache/model-download.yaml