GPT-OSS-120B | NVIDIA Dynamo Documentation

Validated deployment targets for openai/gpt-oss-120b (MXFP4 MoE, 128 experts, top-4) across two runtimes. The vLLM targets serve the Mooncake agentic trace (64K ISL / 400 OSL, 90% KV cache hit) on B200 or H200, aggregated or disaggregated, with KV-aware routing and EAGLE3 speculative decoding. The TensorRT-LLM targets cover short-prompt high-concurrency and long-context generation on GB200. The GPU choice selects the runtime — B200/H200 run vLLM, GB200 runs TensorRT-LLM — and the targets use different traffic shapes, so this page is not a backend benchmark. Pick your GPU and topology; every command on this page updates to match.

Choose your deployment target

GPUB200 · vLLMH200 · vLLMGB200 · TRT-LLM

TopologyAggregatedDisaggregated

Checkpoint openai/gpt-oss-120bRuntime vLLM · MXFP4 + FP8 KV cacheGPUs 8x B200, 8x TP1 replicasTechniques KV-aware routing, EAGLE3-v3, flashinfer_trtllm MoEWorkload Agentic (64K ISL / 400 OSL, 90% KV hit)

Checkpoint openai/gpt-oss-120bRuntime vLLM · MXFP4 + FP8 KV cacheGPUs 8x B200, 2 prefill + 6 decode (in-pod)Techniques KV-aware routing, EAGLE3-v3, flashinfer_trtllm MoEWorkload Agentic (64K ISL / 400 OSL, 90% KV hit)

Checkpoint openai/gpt-oss-120bRuntime vLLM · MXFP4 + FP8 KV cacheGPUs 8x H200, 8x TP1 replicasTechniques KV-aware routing, EAGLE3-v3, SimpleCPUOffload, flashinfer_cutlass MoEWorkload Agentic (64K ISL / 400 OSL, 90% KV hit)

Checkpoint openai/gpt-oss-120bRuntime vLLM · MXFP4 + FP8 KV cacheGPUs 8x H200, 4 prefill + 4 decode (in-pod)Techniques KV-aware routing, EAGLE3-v3, flashinfer_cutlass MoEWorkload Agentic (64K ISL / 400 OSL, 90% KV hit)

Checkpoint openai/gpt-oss-120bRuntime TensorRT-LLM (tensorrtllm-runtime:1.2.1)GPUs 4x GB200 (ARM64), TP4, EP4 + attention-DPWorkload Short prompts, long outputs, high concurrency (128 ISL / 1000 OSL)

Checkpoint openai/gpt-oss-120bRuntime TensorRT-LLM (tensorrtllm-runtime:1.2.1)GPUs 5x GB200/B200 (TP1 prefill + TP4 decode)Quantization W4A8_MXFP4_MXFP8Workload Long-context generation (8K ISL / 1K OSL)

Prerequisites

A Kubernetes cluster with the Dynamo platform installed and 8x B200 on a single node — the disaggregated target co-locates prefill and decode workers in one Pod.
A Hugging Face token with access to openai/gpt-oss-120b and nvidia/gpt-oss-120b-Eagle3-v3 (the EAGLE3 speculative-decoding head).

A Kubernetes cluster with the Dynamo platform installed and 8x H200 on a single node — the disaggregated target co-locates prefill and decode workers in one Pod.
A Hugging Face token with access to openai/gpt-oss-120b and nvidia/gpt-oss-120b-Eagle3-v3 (the EAGLE3 speculative-decoding head).

A Kubernetes cluster with the Dynamo platform installed and 4x GB200 available on ARM64 nodes — the aggregated target will not run on x86 Hopper/Ampere hardware.
A Hugging Face token with access to openai/gpt-oss-120b.

A Kubernetes cluster with the Dynamo platform installed and 5x GB200 or B200 available (1 prefill + 4 decode GPUs).
A Hugging Face token with access to openai/gpt-oss-120b.

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token" \
>   -n ${NAMESPACE}

Update storageClassName in model-cache/model-cache.yaml and the container image tag in deploy.yaml to match your Dynamo release before deploying. Also edit namespace, node selectors, and cluster-specific placement.

Deploy

Prepare the model cache (shared by all targets; the vLLM targets also pull the EAGLE3 head):

$ kubectl apply -f recipes/gpt-oss-120b/model-cache/ -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$ kubectl apply -f recipes/gpt-oss-120b/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/gpt-oss-120b/vllm/disagg-b200-agentic/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/gpt-oss-120b/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/gpt-oss-120b/vllm/disagg-h200-agentic/deploy.yaml -n ${NAMESPACE}

$ kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/deploy.yaml -n ${NAMESPACE}

Model loading takes roughly 15-30 minutes depending on storage speed:

$ kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/deploy.yaml -n ${NAMESPACE}
$ kubectl get pods -n ${NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=gpt-oss-disagg -w

Smoke Test

Send a test request to verify the deployment serves traffic. First forward the frontend port for your target:

$ kubectl port-forward svc/turbo-gptoss-120b-agg-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}

The disaggregated target is a single co-located Pod exposed as one Service (no separate frontend):

$ kubectl port-forward svc/turbo-gptoss-120b-disagg-b200-agentic 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/turbo-gptoss-120b-agg-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}

The disaggregated target is a single co-located Pod exposed as one Service (no separate frontend):

$ kubectl port-forward svc/turbo-gptoss-120b-disagg-h200-agentic 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/gpt-oss-agg-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/gpt-oss-disagg-frontend 8000:8000 -n ${NAMESPACE}

All targets serve openai/gpt-oss-120b:

$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

A single AIPerf trace-replay Job — perf/perf.yaml — covers every vLLM variant. It replays the Mooncake agentic trace (reused from the Kimi-K2.6 recipe; fetch it with git lfs pull --include "recipes/kimi-k2.6/perf/traces/*") at one concurrency value and writes artifacts to the shared model-cache PVC. Edit the env block to set ENDPOINT, TRACE_FILE, and CONCURRENCY for your target:

Target	`ENDPOINT`	`CONCURRENCY`
vLLM aggregated (B200)	`turbo-gptoss-120b-agg-b200-agentic-frontend:8000`	`512`
vLLM aggregated (H200)	`turbo-gptoss-120b-agg-h200-agentic-frontend:8000`	`256`
vLLM disaggregated (B200)	`turbo-gptoss-120b-disagg-b200-agentic:8000`	`256`
vLLM disaggregated (H200)	`turbo-gptoss-120b-disagg-h200-agentic:8000`	`256`

$ kubectl apply -f recipes/gpt-oss-120b/perf/perf.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/turbo-gptoss-120b-bench -n ${NAMESPACE} --timeout=7200s

To measure multiple concurrencies, clear server state between runs — otherwise residual KV/prefix-cache hits skew results. See the benchmark README.

The TensorRT-LLM target ships a perf.yaml Kubernetes Job that runs AIPerf at ISL 128 / OSL 1000 and 900 per GPU x 4 GPUs = 3,600 total concurrency (request count 10x concurrency). The Job wraps this AIPerf run:

$ aiperf profile \
>   --model openai/gpt-oss-120b \
>   --endpoint-type chat --endpoint /v1/chat/completions --streaming \
>   --url http://gpt-oss-agg-frontend:8000 \
>   --synthetic-input-tokens-mean 128 --output-tokens-mean 1000 \
>   --extra-inputs ignore_eos:true \
>   --concurrency 3600 --request-count 36000

$ kubectl apply -f recipes/gpt-oss-120b/trtllm/agg/perf.yaml -n ${NAMESPACE}
$ kubectl logs -f -l job-name=gpt-oss-120b-bench -n ${NAMESPACE}

The TensorRT-LLM target ships a perf.yaml Kubernetes Job that runs AIPerf at ISL 8192 / OSL 1024 and 1,536 total concurrency (request count 10x concurrency). The Job wraps this AIPerf run:

$ aiperf profile \
>   --model openai/gpt-oss-120b \
>   --endpoint-type chat --endpoint /v1/chat/completions --streaming \
>   --url http://gpt-oss-disagg-frontend:8000 \
>   --synthetic-input-tokens-mean 8192 --output-tokens-mean 1024 \
>   --extra-inputs ignore_eos:true \
>   --concurrency 1536 --request-count 15360

$ kubectl apply -f recipes/gpt-oss-120b/trtllm/disagg/perf.yaml -n ${NAMESPACE}
$ kubectl logs -f -l job-name=gpt-oss-120b-disagg-bench -n ${NAMESPACE}

Expected Performance

Measured on the agentic 15% trace (8 GPUs) with the synthetic-acceptance EAGLE3 throughput proxy (AL=2.72; use for relative comparison — the generated text is intentionally garbage). Per-GPU throughput is system throughput / 8.

Target	Concurrency	tok/s/GPU	tok/s/user	TTFT avg
vLLM aggregated (B200)	512	2896	58	4.4s
vLLM aggregated (H200)	256	1256	47.5	1.4s
vLLM disaggregated (B200)	256	2069	99	5.6s
vLLM disaggregated (H200)	256	1046	43	4.4s

The TensorRT-LLM targets ship benchmark Jobs but no published numbers — reproduce them with the Benchmark step above.

Compare All Targets

vLLM targets (agentic Mooncake trace, 8 GPUs, MXFP4 + FP8 KV, EAGLE3-v3, KV-aware routing):

	vLLM agg B200	vLLM agg H200	vLLM disagg B200	vLLM disagg H200
GPUs	8x B200, 8x TP1	8x H200, 8x TP1	8x B200, 2P6D in-pod	8x H200, 4P4D in-pod
MoE backend	flashinfer_trtllm	flashinfer_cutlass	flashinfer_trtllm	flashinfer_cutlass
KV offload	none	SimpleCPUOffload	none (NIXL-incompatible)	none (NIXL-incompatible)
Concurrency	512	256	256	256

TensorRT-LLM targets (GB200, static synthetic traffic):

	TRT-LLM aggregated	TRT-LLM disaggregated
GPUs	4x GB200 (ARM64 required)	5x GB200/B200
Topology	TP4, EP4 + attention-DP	TP1 prefill + TP4 decode
Workload	128 ISL / 1000 OSL, 3,600 concurrency	8K ISL / 1K OSL, 1,536 concurrency
Quantization	Checkpoint default	W4A8_MXFP4_MXFP8
KV transfer	—	UCX cache transceiver

Notes

Reasoning and tool calling use the gpt-oss “harmony” format, wired with --dyn-reasoning-parser gpt_oss --dyn-tool-call-parser harmony; tool_calls populates with finish_reason: tool_calls.
The disaggregated targets run single-node in-pod (prefill and decode co-located in one Pod). Multi-pod disaggregation is not supported.
Structured output (response_format: json_object / json_schema) may return invalid JSON while speculative decoding is enabled — use tool calling or validate client-side.
The H200 aggregated target enables SimpleCPUOffload (about +9% throughput, quality-neutral); the B200 targets leave it off by default.
Speculative decoding uses the nvidia/gpt-oss-120b-Eagle3-v3 head (EAGLE3-v3, draft length 3).

The aggregated target requires ARM64 (GB200) nodes; the disaggregated target accepts GB200 or B200.
Do not read the two TensorRT-LLM targets as an aggregated-vs-disaggregated benchmark; their traffic shapes differ by design.
The disaggregated deployment uses 5 GPUs (1x TP1 prefill + 1x TP4 decode), while its perf.yaml computes total concurrency from a 6-GPU count (256 x 6 = 1,536); adjust DEPLOYMENT_GPU_COUNT if you want strict per-GPU normalization.
The disaggregated target uses W4A8_MXFP4_MXFP8 quantization via the OVERRIDE_QUANT_ALGO environment variable, and UCX-based KV transfer (max_tokens_in_buffer=9216).

Source

Source README: recipes/gpt-oss-120b/README.md
Benchmark README: recipes/gpt-oss-120b/perf/README.md
vLLM aggregated: B200 and H200 deploy.yaml
vLLM disaggregated: B200 and H200 deploy.yaml
vLLM benchmark manifest: perf.yaml
TRT-LLM aggregated: deploy.yaml and perf.yaml
TRT-LLM disaggregated: deploy.yaml and perf.yaml